Chapter 5: Introductory Applications of Multimodal AI

In the preceding chapters, we examined the core ideas, data representations, integration techniques, and model components essential to multimodal AI. Now, we shift our focus to observe these elements functioning together through various applications.

This chapter introduces several examples where combining different data types provides enhanced capabilities. We will cover:

Image captioning systems, which generate textual descriptions from images.
Visual Question Answering (VQA), enabling interaction with images through natural language questions.
An introduction to text-to-image synthesis, where AI systems create visual content from text.
A brief look at how visual information can augment speech recognition.
Multimodal sentiment analysis, for understanding opinions using cues from multiple sources.

By studying these applications, you will gain a practical understanding of how multimodal AI systems are designed and what kinds of tasks they can perform.

Sections

5.1 Image Captioning Systems: Generating Text from Images
5.2 Visual Question Answering: Interacting with Images Through Questions
5.3 Text-to-Image Synthesis: Creating Visuals from Descriptions (Introduction)
5.4 Speech Recognition Enhanced by Visual Cues (Introduction)
5.5 Multimodal Sentiment Analysis: Understanding Opinions from Multiple Cues
5.6 Inputs and Outputs in Multimodal Applications
5.7 Practice: Brainstorming a Multimodal Solution