A digital signal, x[n], represents an analog sound wave's amplitude at discrete points in time, resulting in a sequence of numbers. This is known as the time-domain representation. It directly answers the question: "What was the signal's amplitude at a specific moment?"
While essential, this view has significant limitations for speech recognition.
When we plot the amplitude of a digital audio signal against time (or sample number), we get a waveform. This visualization is useful for seeing general characteristics like loudness, which corresponds to higher amplitude, and moments of silence, where the amplitude is near zero.
Let's look at a simple waveform. The plot below shows the amplitude of a signal over a fraction of a second.
A waveform showing changes in signal amplitude over time.
You can spot the louder and quieter parts, but discerning the actual content of the speech, like the difference between an "s" sound and a "sh" sound, is nearly impossible. These distinct sounds are characterized by their frequency content, not just their overall amplitude. For a machine learning model to learn the building blocks of speech, it needs a richer representation.
The frequency domain provides a different perspective. Instead of plotting amplitude versus time, we plot signal strength (or energy) versus frequency. This view answers the question: "Which frequencies are present in the signal and how dominant are they?"
To move from the time domain to the frequency domain, we use a mathematical tool called the Fourier Transform. For digital signals, we use its counterpart, the Discrete Fourier Transform (DFT), which is most often computed using an efficient algorithm called the Fast Fourier Transform (FFT).
The Fourier Transform decomposes a signal into the set of sine and cosine waves of different frequencies that, when added together, reconstruct the original signal. The output of an FFT gives us two important pieces of information for each frequency component: its magnitude (how much of that frequency is present) and its phase (the offset of the wave). For speech analysis, we are primarily interested in the magnitude.
If we took a signal composed of two pure sine waves, one at 200 Hz and another at 500 Hz, its frequency spectrum would look like this:
A frequency spectrum showing two distinct peaks at 200 Hz and 500 Hz.
This is far more informative. We can clearly see the constituent frequencies that make up the signal, something that was hidden in the time-domain waveform.
There's a catch. Speech is a non-stationary signal, meaning its statistical properties, particularly its frequency content, change over time. When you say the word "speech," the frequencies that form the "s" sound are very different from those that form the "ee" sound.
If we apply a single FFT to the entire audio clip of the word "speech," we would get a single frequency spectrum that averages the frequencies of the "s," "p," and "ee" sounds. We would know which frequencies were present in the word overall, but we would lose all information about when they occurred. This temporal information is critical for distinguishing "cats" from "stack."
The solution is to analyze the signal in small, manageable chunks where we can assume the signal is stationary (its properties don't change much). This is the job of the Short-Time Fourier Transform (STFT).
The STFT works by:
The result is a sequence of frequency spectra, one for each frame. By stacking these spectra side-by-side, we create a representation that shows how the frequency content of the signal evolves over time. This process is the foundation for creating spectrograms, which we will cover in the next section.
The process of applying the Short-Time Fourier Transform (STFT) to an audio signal.
This time-frequency analysis is fundamental to modern ASR. It transforms a one-dimensional, ambiguous waveform into a rich, two-dimensional image that clearly separates the acoustic patterns that our models need to learn. By understanding both the "when" from the time domain and the "what" from the frequency domain, we set the stage for extracting powerful features for our speech recognition models.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with