The previous chapter focused on acoustic modeling, the process of mapping audio signals to sequences of phonetic units or characters. However, achieving high-accuracy speech recognition often requires more than just acoustic evidence. The likelihood of the resulting word sequence, P(W), and the system's ability to handle variations in speech are equally significant. An ASR system aims to find the most probable word sequence W given the acoustic input X, often represented as maximizing P(W∣X). Language models help estimate P(W), and adaptation techniques help the model generalize better across different X.
This chapter introduces methods to integrate linguistic constraints and handle variability. We will cover:
3.1 Neural Language Models for ASR
3.2 Shallow Fusion and Deep Fusion
3.3 Contextual ASR
3.4 Speaker Adaptation Techniques
3.5 Environment and Channel Adaptation
3.6 Unsupervised and Semi-Supervised Learning for ASR
3.7 Multi-Lingual and Cross-Lingual ASR
3.8 Practice: Fine-tuning ASR with Adaptation Data
© 2025 ApX Machine Learning