In this chapter, we've explored several ways to combine information from different modalities. Sometimes, seeing these methods laid out visually can make them click. This practice session is all about helping you build mental models of how fusion and representation strategies work by drawing and interpreting diagrams. Don't worry about artistic skill; the goal is clarity.Part 1: Understanding Fusion Levels through DiagramsRemember, multimodal fusion can happen at different stages:Early Fusion: Combining raw data or very low-level features.Intermediate Fusion: Merging features after some initial processing of each modality.Late Fusion: Combining the outputs or decisions from models trained on individual modalities.Let's try to visualize these.Your Turn: Sketch It Out!Before looking at our examples, grab a piece of paper or open a drawing tool. Try to sketch a simple block diagram for each of these fusion types: Early, Intermediate, and Late. Think about:Where does the data for each modality (e.g., image, text) enter the system?Where are features extracted (if at all before fusion)?At what point do the different data streams come together?What happens after they are combined?This exercise helps solidify your understanding. Once you've given it a try, compare your sketches with the diagrams below.Example Diagram: Early FusionEarly fusion, also known as input-level or data-level fusion, involves combining information from different modalities at the very beginning of the process. This often means concatenating raw data or basic features extracted from each modality.digraph EarlyFusion { rankdir=TB; bgcolor="transparent"; node [shape=box, style="filled", fontname="sans-serif"]; edge [fontname="sans-serif"]; Modality_A [label="Modality A\n(e.g., Image Pixels)", fillcolor="#a5d8ff"]; Modality_B [label="Modality B\n(e.g., Text Tokens)", fillcolor="#b2f2bb"]; Fusion [label="Early Fusion\n(e.g., Concatenate)", fillcolor="#ffec99", shape=ellipse]; Combined_Features [label="Combined Features", fillcolor="#ffd8a8"]; Multimodal_Model [label="Multimodal Model", fillcolor="#d0bfff"]; Output [label="Output / Prediction", fillcolor="#ced4da"]; Modality_A -> Fusion; Modality_B -> Fusion; Fusion -> Combined_Features; Combined_Features -> Multimodal_Model; Multimodal_Model -> Output; }A diagram showing early fusion. Data from Modality A and Modality B are directly combined at the fusion stage, producing combined features that are then fed into a multimodal model.In this diagram:Modality A (e.g., raw image pixels) and Modality B (e.g., raw text tokens) are the inputs.They are immediately fed into the Early Fusion stage. This could be as simple as stacking their feature vectors side-by-side, like in the equation $v_{\text{fused}} = \text{concat}(v_A, v_B)$.The Combined Features are then processed by a single Multimodal Model to produce an Output. The main point here is that fusion happens before any significant independent processing of each modality.Example Diagram: Intermediate FusionIntermediate fusion, or feature-level fusion, occurs after each modality has undergone some initial processing and feature extraction. These extracted features are then combined.digraph IntermediateFusion { rankdir=TB; bgcolor="transparent"; node [shape=box, style="filled", fontname="sans-serif"]; edge [fontname="sans-serif"]; Modality_A_Input [label="Modality A\n(e.g., Image)", fillcolor="#a5d8ff"]; Feature_Extractor_A [label="Feature Extractor A\n(e.g., CNN)", fillcolor="#74c0fc"]; Features_A [label="Features A", fillcolor="#4dabf7"]; Modality_B_Input [label="Modality B\n(e.g., Text)", fillcolor="#b2f2bb"]; Feature_Extractor_B [label="Feature Extractor B\n(e.g., Word Embeddings)", fillcolor="#8ce99a"]; Features_B [label="Features B", fillcolor="#69db7c"]; Fusion [label="Intermediate Fusion\n(e.g., Concatenate, Attention)", fillcolor="#ffec99", shape=ellipse]; Combined_Features [label="Combined Features", fillcolor="#ffd8a8"]; Multimodal_Model [label="Multimodal Model", fillcolor="#d0bfff"]; Output [label="Output / Prediction", fillcolor="#ced4da"]; Modality_A_Input -> Feature_Extractor_A; Feature_Extractor_A -> Features_A; Modality_B_Input -> Feature_Extractor_B; Feature_Extractor_B -> Features_B; Features_A -> Fusion; Features_B -> Fusion; Fusion -> Combined_Features; Combined_Features -> Multimodal_Model; Multimodal_Model -> Output; }A diagram illustrating intermediate fusion. Modality A and Modality B are first processed by separate feature extractors. The resulting features (Features A and Features B) are then combined at the fusion stage. This combined representation is used by a subsequent multimodal model.Here's what's happening:Modality A and Modality B are first processed by their own Feature Extractor (e.g., a Convolutional Neural Network for images, a word embedding layer for text).This produces Features A and Features B, which are more abstract representations than the raw data.These features are then combined in the Intermediate Fusion stage.The resulting Combined Features feed into the Multimodal Model. Fusion happens after some unimodal processing but before the final decision-making part of the model.Example Diagram: Late FusionLate fusion, also called decision-level fusion, involves processing each modality independently through separate models. The outputs or decisions from these models are then combined to produce a final result.digraph LateFusion { rankdir=TB; bgcolor="transparent"; node [shape=box, style="filled", fontname="sans-serif"]; edge [fontname="sans-serif"]; Modality_A_Input [label="Modality A\n(e.g., Image)", fillcolor="#a5d8ff"]; Model_A [label="Model A", fillcolor="#74c0fc"]; Prediction_A [label="Prediction A", fillcolor="#4dabf7"]; Modality_B_Input [label="Modality B\n(e.g., Text)", fillcolor="#b2f2bb"]; Model_B [label="Model B", fillcolor="#8ce99a"]; Prediction_B [label="Prediction B", fillcolor="#69db7c"]; Fusion [label="Late Fusion\n(e.g., Averaging, Voting)", fillcolor="#ffec99", shape=ellipse]; Final_Prediction [label="Final Prediction", fillcolor="#ced4da"]; Modality_A_Input -> Model_A; Model_A -> Prediction_A; Modality_B_Input -> Model_B; Model_B -> Prediction_B; Prediction_A -> Fusion; Prediction_B -> Fusion; Fusion -> Final_Prediction; }A diagram of late fusion. Modality A is processed by Model A to produce Prediction A. Independently, Modality B is processed by Model B to produce Prediction B. These individual predictions are then combined at the fusion stage to yield a final prediction.In late fusion:Modality A is fed into Model A, which produces Prediction A.Separately, Modality B is fed into Model B, producing Prediction B.These individual predictions (or scores, or class probabilities) are then combined in the Late Fusion stage (e.g., by averaging, voting, or a simple learned layer).This yields the Final Prediction. The main idea is that each modality is fully processed by its own model, and only the high-level results are merged.Think About It: Look back at your own sketches and these examples.What are the main visual differences between these three diagrams?How does the point of "joining" the information streams change?Can you think of a simple scenario where one might be preferred over the others? For instance, if one modality is much more reliable than another, how might that influence your choice of fusion strategy?Part 2: Visualizing Representation StrategiesTwo main ideas are shared representations and coordinated representations.Shared Representation SpaceA shared representation (or joint embedding) aims to map data from different modalities into a single, common vector space. In this shared space, representations from different modalities can be directly compared. For example, an image of a cat and the word 'cat' might be projected to nearby points in this space.digraph SharedRepresentation { rankdir=TB; bgcolor="transparent"; node [shape=box, style="filled", fontname="sans-serif"]; edge [fontname="sans-serif"]; Modality_A [label="Modality A\n(e.g., Image)", fillcolor="#a5d8ff"]; Encoder_A [label="Encoder A", fillcolor="#74c0fc"]; Rep_A [label="Image Rep\n(in S)", fillcolor="#4dabf7", shape=ellipse]; Modality_B [label="Modality B\n(e.g., Text)", fillcolor="#b2f2bb"]; Encoder_B [label="Encoder B", fillcolor="#8ce99a"]; Rep_B [label="Text Rep\n(in S)", fillcolor="#69db7c", shape=ellipse]; Shared_Space_Label [label="Shared Space (S)", shape=plaintext, fontcolor="#495057"]; Modality_A -> Encoder_A -> Rep_A; Modality_B -> Encoder_B -> Rep_B; {rank=same; Rep_A; Rep_B; Shared_Space_Label} }A diagram illustrating a shared representation space. Modality A is transformed into "Image Rep" and Modality B into "Text Rep." Both representations exist within the same "Shared Space (S)," allowing for direct comparison or interaction.In this diagram:Modality A (e.g., an image) goes through Encoder A.Modality B (e.g., text) goes through Encoder B.Both encoders project their outputs (Rep A and Rep B) into the same Shared Space (S). The goal is that similar items, regardless of their original modality, end up close together in this shared space. This is powerful for tasks like cross-modal retrieval (finding images based on text queries).Coordinated Representation SpacesCoordinated representations, on the other hand, don't necessarily force everything into one identical space. Instead, they focus on learning mappings or correlations between separate representation spaces for each modality. The spaces are 'coordinated' such that you can translate or relate information from one to the other, even if their structures are different.digraph CoordinatedRepresentation { rankdir=TB; bgcolor="transparent"; node [shape=box, style="filled", fontname="sans-serif"]; edge [fontname="sans-serif", color="#495057"]; Modality_A [label="Modality A\n(Image)", fillcolor="#a5d8ff"]; Encoder_A [label="Encoder A", fillcolor="#74c0fc"]; Rep_A [label="Rep A\n(Image Features)", fillcolor="#4dabf7", shape=ellipse]; Modality_B [label="Modality B\n(Text)", fillcolor="#b2f2bb"]; Encoder_B [label="Encoder B", fillcolor="#8ce99a"]; Rep_B [label="Rep B\n(Text Features)", fillcolor="#69db7c", shape=ellipse]; Mapping_Node [label="Coordination\n(Learned Mapping)", fillcolor="#ffc078", shape=diamond, style="filled,dashed", fontcolor="#495057"]; Space_A_Label [label="Space A", shape=plaintext, fontcolor="#495057"]; Space_B_Label [label="Space B", shape=plaintext, fontcolor="#495057"]; Modality_A -> Encoder_A -> Rep_A; Modality_B -> Encoder_B -> Rep_B; Rep_A -> Space_A_Label [style=dotted, arrowhead=none, color="#adb5bd"]; Rep_B -> Space_B_Label [style=dotted, arrowhead=none, color="#adb5bd"]; Rep_A -> Mapping_Node [style=dashed, dir=both, color="#f76707"]; Rep_B -> Mapping_Node [style=dashed, dir=both, color="#f76707"]; {rank=same; Rep_A; Mapping_Node; Rep_B;} {rank=same; Space_A_Label; Space_B_Label;} }A diagram of coordinated representation spaces. Modality A is encoded into "Rep A," associated with its own "Space A," while Modality B is encoded into "Rep B," associated with its "Space B." A "Coordination (Learned Mapping)" mechanism allows these distinct representations to be related or transformed into one another.In this setup:Modality A is encoded into Rep A, which is associated with Space A.Modality B is encoded into Rep B, associated with Space B.The important part is the Coordination (Learned Mapping). This mapping allows the system to understand relationships between Rep A and Rep B, even if they are in different mathematical spaces. For example, the model might learn to translate an image representation in Space A to a text description representation in Space B.Quick Check: What's the main difference in how the representations from different modalities are treated in a shared space versus coordinated spaces? One brings them together into a common area; the other builds bridges between separate areas. Which is which?Part 3: Bringing It All Together: A ScenarioLet's try to apply these ideas. Imagine you are tasked with building a simple system that helps identify a species of bird. You have two types of input:An image of the bird.An audio recording of the bird's song.Your goal is to predict the bird species.Activity:Choose a Fusion Strategy: Which fusion strategy (Early, Intermediate, or Late) seems like a reasonable starting point for this problem? Sketch a block diagram for your chosen strategy, labeling the bird image and bird song as inputs.Consider Representations:If you were aiming for a shared representation, what would that mean for the image features and audio features? What might be an advantage?If you opted for coordinated representations, how would that work? What might be a scenario where this is more flexible?There are no single 'right' answers here. The goal is to think through the design choices using the visual language we've been practicing.For example, if you chose Intermediate Fusion:You'd have an image feature extractor (perhaps a pre-trained CNN) and an audio feature extractor (e.g., one that produces MFCCs or spectrogram features).The outputs of these extractors would then be combined (e.g., concatenated, or fed into an attention mechanism) before going to a classifier that predicts the bird species.digraph BirdScenario { rankdir=TB; bgcolor="transparent"; node [shape=box, style="filled", fontname="sans-serif"]; edge [fontname="sans-serif"]; Image_Input [label="Bird Image", fillcolor="#a5d8ff"]; Image_Encoder [label="Image Feature\nExtractor (CNN)", fillcolor="#74c0fc"]; Image_Features [label="Image Features", fillcolor="#4dabf7"]; Audio_Input [label="Bird Song (Audio)", fillcolor="#b2f2bb"]; Audio_Encoder [label="Audio Feature\nExtractor (MFCCs)", fillcolor="#8ce99a"]; Audio_Features [label="Audio Features", fillcolor="#69db7c"]; Fusion [label="Intermediate Fusion", fillcolor="#ffec99", shape=ellipse]; Combined_Rep [label="Combined Bird\nRepresentation", fillcolor="#ffd8a8"]; Classifier [label="Species Classifier", fillcolor="#d0bfff"]; Species_Output [label="Predicted Bird Species", fillcolor="#ced4da"]; Image_Input -> Image_Encoder -> Image_Features -> Fusion; Audio_Input -> Audio_Encoder -> Audio_Features -> Fusion; Fusion -> Combined_Rep -> Classifier -> Species_Output; }An example diagram for the bird identification scenario using intermediate fusion. Image and audio inputs are processed by respective feature extractors. The resulting features are combined and then fed to a classifier to predict the bird species.This diagram shows one way to approach the bird identification task. You could also sketch out an early or late fusion approach for comparison.ReflectionHopefully, drawing and looking at these diagrams has made the different ways of combining multimodal data clearer. These are simplified views, of course. Multimodal systems can be much more complex, often blending these strategies. However, understanding these fundamental patterns of fusion and representation is a great first step. As you encounter more advanced multimodal architectures, try to see if you can spot these basic building blocks at play.