The chapter introduction explained that we need ways to bring information from different modalities together. We've looked at fusion strategies, where data streams are merged. Now, let's explore another powerful approach: learning shared representations.
Imagine you're trying to get two people who speak different languages to understand each other. You could translate word-for-word (which is a bit like some fusion methods), or you could try to get them both to understand a set of common ideas or symbols. Shared representations in multimodal AI are more like the latter.
The main idea is to transform data from different modalities, like an image and a piece of text, into a common format or "space" where they can be directly compared. Think of it as creating a universal language that both images and text (or audio, etc.) can be translated into. If an image and a sentence describe the same thing, their translations into this universal language should be very similar.
A shared representation, sometimes called a joint embedding space, is a vector space where representations (or "embeddings") from two or more different modalities can coexist and be meaningfully compared.
Let's say we have an image, ximg, and a text caption, xtxt, that describes the image. The goal is to learn two mapping functions, fimg(⋅) and ftxt(⋅), such that: zimg=fimg(ximg) ztxt=ftxt(xtxt) Here, zimg and ztxt are vectors in the same high-dimensional space (e.g., both might be 300-dimensional vectors). The "shared" part means that if ximg and xtxt are semantically related (e.g., the image depicts what the text describes), then their corresponding vectors zimg and ztxt should be "close" to each other in this space.
Creating these common spaces is useful for several reasons:
Typically, shared representations are learned using neural networks. Each modality will often have its own network (or "encoder") responsible for processing its specific type of data and transforming it into an initial feature vector. Then, these feature vectors are projected into the shared embedding space.
The magic happens during the training process. The model is trained on pairs (or triplets) of data from different modalities. For instance, it might be given:
The learning algorithm, guided by a loss function, tries to adjust the network parameters so that:
This process, often involving techniques like contrastive learning or triplet loss (though the specifics are beyond our current scope), effectively "shapes" the shared space so that semantic similarity across modalities translates into proximity in the vector space.
Here's a diagram illustrating the general idea for images and text:
Data from different modalities (image, text) are processed by their respective encoders and mapped into a shared semantic space where their vector representations can be directly compared.
For example, if you have an image of a "blue car parked by a red fire hydrant" and the text "a blue car next to a red hydrant," their representations zimg and ztxt in the shared space would ideally be very close. If the text was "a cat sleeping on a mat," its vector ztxt_other should be far from zimg.
Shared representation learning is a powerful technique for many multimodal tasks, especially those involving search, retrieval, and comparison.
Benefits:
Challenges:
While methods like early or late fusion, which we discussed previously, directly combine features or decisions, shared representation learning focuses on creating a common ground for understanding. This approach provides a different, and often very effective, way to integrate information from multiple modalities. Next, we'll look at another related idea: coordinated representations.
Was this section helpful?
© 2025 ApX Machine Learning