Imagine you have two experts, each specializing in a different type of information. One expert analyzes written reports, and the other analyzes audio recordings. To get a final, combined opinion, you wouldn't ask them to merge their raw notes (like in early fusion) or their intermediate summaries (like in intermediate fusion). Instead, you'd let each expert come to their own individual conclusion first. Then, you'd find a way to combine these separate conclusions into a single, more informed decision. This is the essence of late fusion.
In late fusion, also known as decision-level fusion, we combine information at the very end of the processing pipeline. Each modality (like text, image, or audio) is first processed independently by its own specialized model. These individual models each produce their own output, which is typically a prediction, a classification score, or a probability distribution. Only after these individual predictions are made do we combine them to arrive at a final, multimodal prediction.
This approach stands in contrast to early fusion, where raw data or low-level features are combined (like the vfused=concat(vimage,vtext) example you saw earlier), or intermediate fusion, which combines features at a middle stage of processing.
The process of late fusion can be broken down into a few steps:
Here's a diagram illustrating the late fusion process:
This diagram shows two separate modalities being processed by their respective models. The outputs (Prediction A and Prediction B) are then fed into a fusion mechanism, which produces the final combined prediction.
Once you have the individual predictions from each unimodal model, there are several ways to combine them:
Averaging/Weighted Averaging: If the outputs are numerical scores or probabilities, you can simply average them. For instance, if an image model predicts a 70% probability of an event occurring and an audio model predicts a 60% probability, the averaged prediction would be 65%. You can also use a weighted average if you believe one modality or model is more reliable than another. If PA is the probability from model A and PB is from model B, the fused probability Pfused could be: Pfused=wA⋅PA+wB⋅PB where wA and wB are weights that sum to 1 (e.g., wA=0.6,wB=0.4).
Voting: If the models output class labels, you can use a majority vote. If three models predict "A", "A", and "B", the final prediction would be "A". This can be extended to weighted voting, where each model's vote is weighted by its confidence score.
Maximum/Minimum Rule: You could choose the prediction with the highest confidence score (maximum rule) or, in some contexts, the minimum.
Product Rule: Multiplying probabilities can be effective, especially if the probabilities are well-calibrated and represent independent evidence.
Learned Fusion Function: A more sophisticated approach is to train another small model (sometimes called a "meta-learner" or "gating network") that learns the best way to combine the predictions. This fusion model takes the outputs of the unimodal models as its input and is trained to produce the final prediction. This could be a simple logistic regression, a support vector machine (SVM), or even a small neural network.
Late fusion offers several practical benefits, making it a popular choice in many applications:
While late fusion is straightforward and flexible, it's not without its limitations:
Let's consider analyzing the sentiment of a product review that includes both text and a short video clip of the reviewer.
Text Model: A text sentiment analysis model processes the written review (e.g., "This product is amazing and works perfectly!") and outputs sentiment probabilities, say:
Video Model: A video analysis model (which might internally look at facial expressions and listen to tone of voice) processes the video clip and outputs its own sentiment probabilities:
Late Fusion: We can now combine these predictions.
In this scenario, late fusion allows us to leverage the strengths of specialized models for text and video independently, and then combine their insights to make a more robust sentiment prediction.
Late fusion provides a flexible and often effective way to combine information from different sources when you want to keep the processing of each modality separate until the decision-making stage. It’s particularly useful when working with pre-existing unimodal systems or when modularity is a high priority. However, always consider whether you might be missing out on valuable low-level interactions between modalities that earlier fusion methods could capture.
Was this section helpful?
© 2025 ApX Machine Learning