Once audio is digitized, it exists as a long sequence of numbers representing amplitude values. While this format is perfect for a computer, it’s not very intuitive for a human to look at. To understand the structure of a sound, we need to visualize it. Visualizations are not just for our benefit; they form the basis for the features that machine learning models use to make sense of speech. The two most common ways to visualize audio are waveforms and spectrograms.
The most direct way to visualize digital audio is with a waveform. A waveform is a simple two-dimensional plot where the horizontal axis represents time and the vertical axis represents amplitude. Amplitude corresponds to the intensity or "loudness" of the sound at each moment. Positive and negative values represent the oscillations of the sound wave, while values close to zero indicate silence.
Let's look at a waveform for the spoken phrase "Hello world."
A waveform of the phrase "Hello world." The two distinct bursts of energy correspond to the two words, separated by a brief pause.
From the waveform, you can easily identify the parts of the recording that contain sound versus those that contain silence. The peaks and valleys show the sound's intensity. However, the waveform has a significant limitation: it doesn't tell us anything about the frequency content of the sound. We can see that a sound was made, but we can't distinguish between a high-pitched "eee" sound and a low-pitched "ooo" sound just by looking at the amplitude. To an ASR system, this frequency information is essential for telling phonemes apart.
To see the frequency content of an audio signal, we use a spectrogram. A spectrogram is a much richer visualization that shows how the frequencies present in a sound change over time. Think of it as a series of frequency snapshots stacked next to each other.
To create a spectrogram, the audio signal is broken down into small, overlapping time segments called frames. For each frame, a mathematical operation called the Fast Fourier Transform (FFT) is used to determine the amount of energy present at different frequency bands. The result is a 2D plot with time on the horizontal axis, frequency on the vertical axis, and color to indicate the energy or amplitude of each frequency at each point in time.
A spectrogram of the same "Hello world" phrase. The color intensity shows the energy, with yellow being high energy and blue being low energy.
This visualization is far more informative. You can now see distinct patterns:
Spectrograms are more than just a useful tool for human analysis. They are the foundation for the features that ASR systems learn from. The visual patterns that we can see in a spectrogram correspond to the acoustic patterns that a machine learning model needs to identify. The next sections on pre-emphasis, framing, windowing, and especially the creation of MFCCs, are all steps in a process that starts with this spectrogram-like representation to produce a compact and effective set of features for the acoustic model.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with