Okay, we've talked about what we want to do when we have data from different sources, like images and text. We've seen ideas like fusion, where we mix the data together, and representation learning, where we try to get the data into a common format or find connections between formats. Now, let's look at how we actually build systems that do this. We'll examine some fundamental structures, or "architectures," using neural networks to bring these ideas to life. Think of these as basic blueprints that you'll see often in multimodal AI.
One very common way to handle multiple types of data is to first process each type separately and then combine their processed information.
Imagine you have an image and a piece of text that you want a system to understand together.
- Individual Processing (Encoding):
- The image goes into an "image encoder." This part of the system is specialized to understand images. It might be a Convolutional Neural Network (CNN) that looks for edges, shapes, textures, and eventually outputs a set of numbers (a feature vector, let's call it vimage) that represents the important visual information.
- Similarly, the text goes into a "text encoder." This part is good at understanding language. It might be a Recurrent Neural Network (RNN) or a Transformer-based model that converts words into numerical representations and tries to capture the meaning of the sentence, also outputting a feature vector, vtext.
- Combining Features: Once we have these two feature vectors, vimage and vtext, we need to combine them. A simple and effective way is to just stick them together, end-to-end. This operation is called concatenation:
vcombined=concat(vimage,vtext)
This new vector, vcombined, now holds information from both the image and the text in a single representation.
- Joint Processing: This combined vector is then fed into another part of the neural network, often called a "joint processing unit" or "fusion layer." This unit, typically made of one or more dense (fully connected) layers, learns to make sense of the combined information to perform a specific task. This could be anything from answering a question about the image (Visual Question Answering) to generating a caption for the image.
This architectural pattern allows each modality to be initially understood by a specialized component before their insights are merged. It's quite flexible because the point of fusion can happen early (if vimage and vtext are very raw features) or a bit later (if they are more processed, abstract features), aligning well with the early and intermediate fusion strategies we discussed previously.
Separate encoders process image and text data. Their extracted features (vimage, vtext) are then combined (e.g., concatenated into vcombined) and fed into a joint processing unit to produce a final output.
Another important architectural approach aligns with the idea of learning shared representations. The aim here is to project information from different modalities into a common space where their representations can be directly compared or meaningfully combined, as if they "speak the same language."
Architectures designed for this often include:
- Modality-Specific Encoders: As with the previous pattern, each type of data (e.g., an image, a piece of audio) begins its journey through its own specialized encoder. Let's say these encoders output initial representations eimage and eaudio.
- Projection to a Shared Space: The critical step is transforming these initial representations into a common, shared representational space. This is typically done by additional neural network layers (projection heads) that map eimage and eaudio to new vectors, simage and saudio, respectively, which live in this shared space.
simage=projectionimage(eimage)
saudio=projectionaudio(eaudio)
The training process for such an architecture often involves an objective function (like a contrastive loss or a triplet loss, which are more advanced topics) that encourages simage and saudio to be close together in the shared space if the image and audio correspond (e.g., an image of a dog barking and an audio clip of a bark) and far apart if they don't.
- Operations in the Shared Space: Once representations from different modalities are in this shared space, they can be used for various tasks:
- Similarity Measurement: Calculate how similar an image is to a piece of text or audio.
- Cross-Modal Retrieval: Find the most relevant images for a given text query, or vice versa.
- Combined Processing: Features in the shared space can be further processed by another network.
This type of architecture is fundamental to models that learn joint embeddings, effectively building a bridge between how different types of data initially represent information.
Data from Modality A and Modality B are independently encoded (producing eA, eB) and then projected (producing sA, sB) into a shared representation space where they can be compared or used for other joint tasks.
Sometimes, it's more effective to let each modality be processed almost entirely by its own specialized model, and then combine their individual outputs or predictions at the very end. This approach is known as a late fusion architecture.
Here's how it typically works:
- Independent Unimodal Models: You have separate, often complete, AI models for each modality.
- For instance, one model might analyze a movie review (text) to predict its sentiment (positive, negative, or neutral). Let's say its prediction is ptext.
- Another model might analyze the user's tone of voice from an audio recording of them speaking the review to detect their emotion (happy, sad, angry). Let's call its prediction paudio.
- Individual Predictions/Outputs: Each of these unimodal models produces its own output. These outputs could be class probabilities (e.g., 70% positive, 30% negative for text), class labels, or even numerical scores.
- Decision-Level Combination: The final step is to combine these individual predictions to arrive at an overall multimodal decision. This combination can be achieved in several ways:
- Simple Rules: Averaging the probability scores, taking a majority vote if outputs are class labels, or using a weighted average if one modality is considered more reliable.
- A Small Meta-Model: A simple model, like a logistic regression or a small neural network, can be trained to take the unimodal predictions (ptext, paudio) as input and learn the best way to combine them into a final prediction, pfinal:
p_{\text{final}} = \text{meta_model}(p_{\text{text}}, p_{\text{audio}})
Late fusion is particularly useful when the modalities are quite distinct, or when you already have high-performing unimodal models that you wish to combine without needing to retrain everything from scratch. It's generally simpler to implement than early or intermediate fusion because the core unimodal processing paths remain largely independent.
Independent models process Modality A and Modality B, producing predictions ptext and paudio. These predictions are then combined by a decision fusion unit to yield the final multimodal output pfinal.
These architectural patterns are constructed using the standard tools from the neural network toolkit. While Chapter 4 will go into more detail on specific components, it's good to know that these architectures typically use:
- Encoder Networks:
- For images, Convolutional Neural Networks (CNNs) are very common for extracting visual features.
- For text, Recurrent Neural Networks (RNNs like LSTMs or GRUs) or Transformer networks are often used to process sequences of words.
- For audio, CNNs (often applied to spectrogram representations of sound) or RNNs can be employed to capture temporal patterns.
- Combining and Transforming Layers:
- Dense layers (also called fully connected layers) are workhorses for combining feature vectors from different sources, for projecting features into shared spaces, or as part of the joint processing unit. They are versatile for learning complex relationships between numerical inputs.
- Task-Specific Output Layers: The final layer(s) of the network will be tailored to the specific task. For example, a layer with a softmax activation function is used for classification tasks (like identifying an object or a sentiment category), while a linear layer might be used for regression tasks (like predicting a score).
You don't need to master all these layer types at this stage. The important takeaway is that these multimodal architectures are assembled by connecting different types of neural network layers, each good at a particular kind of processing, to effectively handle and integrate information from multiple data sources.
It's helpful to explicitly connect these basic architectures back to the fusion strategies we discussed earlier in this chapter:
- The Separate Encoders, Joint Processing architecture is highly flexible. It can implement:
- Early Fusion: If the combination (e.g., concatenation) happens with raw data or very low-level features extracted by shallow encoders.
- Intermediate Fusion: If more processed, abstract features from deeper within the encoders are combined. The first diagram we looked at typically represents this form of intermediate fusion.
- Architectures for Shared Representations are fundamentally about learning a common ground. While fusion is primarily about combining data streams, shared representation learning focuses on transforming modalities so they become comparable or can be seamlessly integrated within this common space. This often involves specialized training objectives designed to align the different modalities.
- The Late Fusion Architecture directly implements the late fusion strategy. Here, the combination occurs at the decision or prediction level, after each modality has been extensively and independently processed by its own dedicated model.
These patterns are not rigid, mutually exclusive categories. Many advanced multimodal systems in practice might blend elements from these basic structures. For instance, a system could use separate encoders, project their outputs to a semi-shared space for some initial joint processing (an intermediate fusion step), and perhaps even incorporate an attention mechanism (which we'll touch upon next) before a final output layer. However, understanding these fundamental architectural patterns provides a solid foundation for making sense of more complex multimodal AI systems you might encounter.