Once we have individual data types like text, images, or audio transformed into numerical representations and properly aligned (as discussed in "Aligning Data from Multiple Sources"), a significant question arises: How can we tell if the information coming from these different sources is related? For instance, how do we determine if a given piece of text accurately describes an image, or if the sound in a video matches the visual scene? This section introduces the basic ideas behind comparing information across different modalities.
The ability to compare information is fundamental to many multimodal AI tasks. Imagine searching your photo library using a text query like "sunset over mountains." The system needs to compare your text query (one modality) with the content of your images (another modality) to find the best matches. Or, consider a system that verifies if a person speaking in a video is the same person whose voice is heard. This also requires comparing information from visual and audio streams.
At its core, comparing information across modalities means assessing how similar or different their content is. If an image shows a cat playing with a yarn ball, and a text description says "A feline engages with a woolen toy," we'd intuitively say these two pieces of information are very similar. If the text said "A dog barks at a car," it would be very different. AI systems aim to quantify this similarity or dissimilarity.
To do this, they work with the numerical representations we learned about earlier (e.g., vectors for text, feature sets for images). If two pieces of information from different modalities are semantically related, their numerical representations should also reflect this relationship in some way. This often means their representations might be "close" to each other in a mathematical sense or share certain predictable patterns.
To make comparisons concrete, AI systems often compute a similarity score. This is typically a number that indicates how related two pieces of information are. A higher score might mean more similar, and a lower score less similar (or vice versa, depending on the specific measure). There are several ways to approach this.
One intuitive way to think about similarity is through distance. If we can represent both an image and a piece of text as points in some high-dimensional space (even if these spaces are initially different), we could, in principle, measure the "distance" between them. If the points are close, the items are considered more similar. For example, if VI is a vector representing an image and VT is a vector representing a text caption, a simple distance like the Euclidean distance could be used if they are in the same space and have the same dimensions. However, data from different modalities often live in very different types of numerical spaces initially.
A very common and effective measure, especially when dealing with high-dimensional data like text embeddings or image features, is cosine similarity. Instead of just looking at the distance between two points (vectors), cosine similarity measures the cosine of the angle between them.
Imagine two vectors as arrows starting from the same point.
The formula for cosine similarity between two vectors A and B is: similarity=cos(θ)=∥A∥∥B∥A⋅B Here, A⋅B is the dot product of the vectors, and ∥A∥ and ∥B∥ are their magnitudes (or lengths). The beauty of cosine similarity is that it's less sensitive to the magnitude of the vectors and more focused on their orientation or the "direction" of their content. This is often useful because the length of a feature vector might not be as informative as the pattern of its values.
For instance, a short sentence and a long descriptive paragraph about a "dog" might have feature vectors of different magnitudes, but their cosine similarity could still be high if they point in a similar direction in the feature space, indicating they both relate to "dog-ness."
Comparing information becomes much more direct if we can project data from different modalities into a shared representation space (sometimes called a common embedding space). In such a space, items that are semantically similar are positioned close to each other, regardless of their original modality.
For example, an image of a bicycle, the word "bicycle," and even the sound of a bicycle bell might all be mapped to nearby points in this shared space. Once data is in this common format, comparing them can be as straightforward as calculating distances or cosine similarities between their points in this shared space.
In this diagram, items related to "dog" from both text and image modalities are mapped to nearby points in a shared space. Similarly, items related to "cat" form another cluster. This makes it easier to see that the text 'dog playing' is more similar to the image of a dog playing than to the image of a cat sleeping.
Achieving such a shared space is a significant goal of many multimodal learning techniques, which we'll touch upon later in the course.
Let's consider a couple of straightforward scenarios:
Image-Text Matching:
Audio-Visual Synchrony:
While the idea of comparing information seems intuitive, it comes with its own set of challenges:
Understanding how to represent, align, and then compare information from different modalities lays a critical foundation. These comparisons are not just an end goal; they are often a stepping stone towards more complex tasks where information from multiple sources is integrated to make a decision or generate new content, as we will see in later chapters.
Was this section helpful?
© 2025 ApX Machine Learning