The previous chapter focused on acoustic modeling, the process of mapping audio signals to sequences of phonetic units or characters. However, achieving high-accuracy speech recognition often requires more than just acoustic evidence. The likelihood of the resulting word sequence, , and the system's ability to handle variations in speech are equally significant. An ASR system aims to find the most probable word sequence given the acoustic input , often represented as maximizing . Language models help estimate , and adaptation techniques help the model generalize better across different .
This chapter introduces methods to integrate linguistic constraints and handle variability. We will cover:
3.1 Neural Language Models for ASR
3.2 Shallow Fusion and Deep Fusion
3.3 Contextual ASR
3.4 Speaker Adaptation Techniques
3.5 Environment and Channel Adaptation
3.6 Unsupervised and Semi-Supervised Learning for ASR
3.7 Multi-Lingual and Cross-Lingual ASR
3.8 Practice: Fine-tuning ASR with Adaptation Data