Raw audio signals are transformed into clean, information-rich feature vectors, such as MFCCs or log-mel spectrograms. To interpret these features and translate them into text, a model is necessary. This is the primary responsibility of the acoustic model. It serves as the bridge between the processed audio signal and the linguistic content it represents.
The acoustic model's fundamental job is to map an input sequence of acoustic features to a sequence of probabilities over a set of characters or phonemes. Think of it as the component that "listens" to the features and makes a guess about what sound is being produced at every moment in time.
To understand its role, let's visualize where the acoustic model fits within a complete ASR system. It takes the output from the feature extraction stage and feeds its own output to the decoding stage, which is responsible for assembling the final text transcription.
The ASR pipeline showing the acoustic model's position. It processes feature vectors and passes its output to the decoder.
The input to an acoustic model is the sequence of feature vectors we generated, which we can denote as X.
X=(x1,x2,…,xT)Here, T is the total number of time steps or frames in the audio, and each xt is a feature vector (e.g., a vector of 40 MFCCs) for that time step.
The model's output is a sequence of probability distributions, one for each time step. If our target vocabulary consists of all lowercase letters ('a' through 'z'), a space character, and an apostrophe, the model would output a vector of 28 probabilities at each time step. Each probability, P(c∣xt), represents the model's confidence that the character c was spoken during the time interval corresponding to the feature vector xt.
A significant difficulty arises from the fact that the input sequence length T is almost never the same as the output text sequence length N. For example, a one-second audio clip saying "hello" might be represented by 100 feature vectors (T=100), but the target transcription has only 5 characters (N=5).
This length mismatch creates the alignment problem: how do we map the 100 input vectors to the 5 output characters? Which specific feature vectors correspond to the 'h', 'e', 'l', 'l', and 'o' sounds? Furthermore, spoken characters have different durations; the 'l' sound in "hello" is longer than the 'h'.
In older ASR systems, this was handled with complex techniques that required a "forced alignment" to be created beforehand, a process that was often brittle and computationally expensive.
This is where modern deep learning approaches shine. By using architectures like RNNs combined with a specialized loss function like Connectionist Temporal Classification (CTC), the model can learn this alignment automatically. The network can be trained on pairs of audio features and text transcripts without any explicit information about which sound occurs at which specific time. It learns to output a probability stream that the decoder can then process to find the most likely text sequence.
In the sections that follow, we will build exactly this type of model, starting with the recurrent architectures that are naturally suited for processing sequential data like speech.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with