In this chapter, we've explored several ways to combine information from different modalities. Sometimes, seeing these methods laid out visually can make them click. This practice session is all about helping you build mental models of how fusion and representation strategies work by drawing and interpreting diagrams. Don't worry about artistic skill; the goal is clarity.
Remember, multimodal fusion can happen at different stages:
Let's try to visualize these.
Your Turn: Sketch It Out!
Before looking at our examples, grab a piece of paper or open a drawing tool. Try to sketch a simple block diagram for each of these fusion types: Early, Intermediate, and Late. Think about:
This exercise helps solidify your understanding. Once you've given it a try, compare your sketches with the diagrams below.
Early fusion, also known as input-level or data-level fusion, involves combining information from different modalities at the very beginning of the process. This often means concatenating raw data or basic features extracted from each modality.
A diagram showing early fusion. Data from Modality A and Modality B are directly combined at the fusion stage, producing combined features that are then fed into a multimodal model.
In this diagram:
Modality A
(e.g., raw image pixels) and Modality B
(e.g., raw text tokens) are the inputs.Early Fusion
stage. This could be as simple as stacking their feature vectors side-by-side, like in the equation vfused=concat(vA,vB).Combined Features
are then processed by a single Multimodal Model
to produce an Output
.
The main point here is that fusion happens before any significant independent processing of each modality.Intermediate fusion, or feature-level fusion, occurs after each modality has undergone some initial processing and feature extraction. These extracted features are then combined.
A diagram illustrating intermediate fusion. Modality A and Modality B are first processed by separate feature extractors. The resulting features (Features A and Features B) are then combined at the fusion stage. This combined representation is used by a subsequent multimodal model.
Here's what's happening:
Modality A
and Modality B
are first processed by their own Feature Extractor
(e.g., a Convolutional Neural Network for images, a word embedding layer for text).Features A
and Features B
, which are more abstract representations than the raw data.Intermediate Fusion
stage.Combined Features
feed into the Multimodal Model
.
Fusion happens after some unimodal processing but before the final decision-making part of the model.Late fusion, also called decision-level fusion, involves processing each modality independently through separate models. The outputs or decisions from these models are then combined to produce a final result.
A diagram of late fusion. Modality A is processed by Model A to produce Prediction A. Independently, Modality B is processed by Model B to produce Prediction B. These individual predictions are then combined at the fusion stage to yield a final prediction.
In late fusion:
Modality A
is fed into Model A
, which produces Prediction A
.Modality B
is fed into Model B
, producing Prediction B
.Late Fusion
stage (e.g., by averaging, voting, or a simple learned layer).Final Prediction
.
The main idea is that each modality is fully processed by its own model, and only the high-level results are merged.Think About It: Look back at your own sketches and these examples.
Beyond just when to fuse, we also discussed how modalities relate to each other in terms of their representations. Two main ideas are shared representations and coordinated representations.
A shared representation (or joint embedding) aims to map data from different modalities into a single, common vector space. In this shared space, representations from different modalities can be directly compared. For example, an image of a cat and the word 'cat' might be projected to nearby points in this space.
A diagram illustrating a shared representation space. Modality A is transformed into "Image Rep" and Modality B into "Text Rep." Both representations exist within the same "Shared Space (S)," allowing for direct comparison or interaction.
In this diagram:
Modality A
(e.g., an image) goes through Encoder A
.Modality B
(e.g., text) goes through Encoder B
.Rep A
and Rep B
) into the same Shared Space (S)
.
The goal is that similar items, regardless of their original modality, end up close together in this shared space. This is powerful for tasks like cross-modal retrieval (finding images based on text queries).Coordinated representations, on the other hand, don't necessarily force everything into one identical space. Instead, they focus on learning mappings or correlations between separate representation spaces for each modality. The spaces are 'coordinated' such that you can translate or relate information from one to the other, even if their structures are different.
A diagram of coordinated representation spaces. Modality A is encoded into "Rep A," associated with its own "Space A," while Modality B is encoded into "Rep B," associated with its "Space B." A "Coordination (Learned Mapping)" mechanism allows these distinct representations to be related or transformed into one another.
In this setup:
Modality A
is encoded into Rep A
, which is associated with Space A
.Modality B
is encoded into Rep B
, associated with Space B
.Coordination (Learned Mapping)
. This mapping allows the system to understand relationships between Rep A
and Rep B
, even if they are in different mathematical spaces. For example, the model might learn to translate an image representation in Space A
to a text description representation in Space B
.Quick Check: What's the main difference in how the representations from different modalities are treated in a shared space versus coordinated spaces? One brings them together into a common area; the other builds bridges between separate areas. Which is which?
Let's try to apply these ideas. Imagine you are tasked with building a simple system that helps identify a species of bird. You have two types of input:
Your goal is to predict the bird species.
Activity:
There are no single 'right' answers here. The goal is to think through the design choices using the visual language we've been practicing.
For example, if you chose Intermediate Fusion:
An example diagram for the bird identification scenario using intermediate fusion. Image and audio inputs are processed by respective feature extractors. The resulting features are combined and then fed to a classifier to predict the bird species.
This diagram shows one way to approach the bird identification task. You could also sketch out an early or late fusion approach for comparison.
Hopefully, drawing and looking at these diagrams has made the different ways of combining multimodal data a bit clearer. These are simplified views, of course. Real-world multimodal systems can be much more complex, often blending these strategies. However, understanding these fundamental patterns of fusion and representation is a great first step. As you encounter more advanced multimodal architectures, try to see if you can spot these basic building blocks at play.
Was this section helpful?
© 2025 ApX Machine Learning