In the preceding chapters, we examined the core ideas, data representations, integration techniques, and model components essential to multimodal AI. Now, we shift our focus to observe these elements functioning together through various applications.
This chapter introduces several examples where combining different data types provides enhanced capabilities. We will cover:
By studying these applications, you will gain a practical understanding of how multimodal AI systems are designed and what kinds of tasks they can perform.
5.1 Image Captioning Systems: Generating Text from Images
5.2 Visual Question Answering: Interacting with Images Through Questions
5.3 Text-to-Image Synthesis: Creating Visuals from Descriptions (Introduction)
5.4 Speech Recognition Enhanced by Visual Cues (Introduction)
5.5 Multimodal Sentiment Analysis: Understanding Opinions from Multiple Cues
5.6 Inputs and Outputs in Multimodal Applications
5.7 Practice: Brainstorming a Multimodal Solution
© 2025 ApX Machine Learning