While a waveform shows us the amplitude of a signal over time, it conceals the frequency information that is critical for distinguishing one sound from another. A standard Fourier Transform, on the other hand, gives us a summary of all frequencies present across the entire audio clip, but it erases all information about when those frequencies occurred. This is a significant problem for a signal like speech, where the frequency content changes from one moment to the next.
To analyze speech effectively, we need a method that preserves both time and frequency information. This is precisely the role of the spectrogram, a visual representation that shows how the frequency content of a signal evolves over time. It is one of the most fundamental tools in speech processing.
A spectrogram is generated using a procedure called the Short-Time Fourier Transform (STFT). Instead of analyzing the entire audio signal at once, the STFT breaks the signal into short, overlapping frames or windows, typically 20-30 milliseconds long. For each of these short frames, we can assume the signal is stationary, meaning its frequency properties are not changing much within that small time slice.
The process is as follows:
This effectively creates a three-dimensional representation where the x-axis is time, the y-axis is frequency, and the color or intensity of each point represents the amplitude or power of a given frequency at a particular moment in time.
The audio signal is segmented into overlapping frames. An FFT is computed for each frame, and the results are stacked to form the final spectrogram.
A spectrogram turns an audio signal into an image, allowing us to see the distinct patterns of speech. Let's look at a spectrogram for the spoken word "speech" and identify its components.
The spectrogram for the word "speech" shows changes in frequency over time.
Vowels: The most prominent feature of a spectrogram are the horizontal bars known as formants. These are the resonant frequencies of the vocal tract and appear as dark, horizontal bands in the spectrogram. The pattern and spacing of formants define different vowel sounds. For instance, in the word "speech," the initial vowel /i/ is characterized by a specific set of formant frequencies.
Fricatives: The /s/ and /ch/ sounds in "speech" are examples of fricatives. These are consonants produced by forcing air through a narrow channel, creating turbulent, high-frequency noise. On a spectrogram, they appear as a diffuse, messy cloud of energy scattered across a wide range of high frequencies.
Plosives: These are consonants like /p/, /t/, and /k/ which are created by blocking airflow and then releasing it suddenly. This creates a short, sharp burst of energy on the spectrogram.
Understanding a spectrogram gives us a powerful tool to visualize the phonetics of speech. The horizontal formants for vowels, the noisy clouds for fricatives, and the sudden bursts for plosives provide a distinct visual signature for each sound, making it possible for us to analyze and process speech signals.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with