Visualizing Speech: Waveforms and Spectrograms

Once audio is digitized, it exists as a long sequence of numbers representing amplitude values. While this format is perfect for a computer, it’s not very intuitive for a human to look at. To understand the structure of a sound, we need to visualize it. Visualizations are not just for our benefit; they form the basis for the features that machine learning models use to make sense of speech. The two most common ways to visualize audio are waveforms and spectrograms.

The Waveform: Sound Through Time

The most direct way to visualize digital audio is with a waveform. A waveform is a simple two-dimensional plot where the horizontal axis represents time and the vertical axis represents amplitude. Amplitude corresponds to the intensity or "loudness" of the sound at each moment. Positive and negative values represent the oscillations of the sound wave, while values close to zero indicate silence.

Let's look at a waveform for the spoken phrase "Hello world."

A waveform of the phrase "Hello world." The two distinct bursts of energy correspond to the two words, separated by a brief pause.

From the waveform, you can easily identify the parts of the recording that contain sound versus those that contain silence. The peaks and valleys show the sound's intensity. However, the waveform has a significant limitation: it doesn't tell us anything about the frequency content of the sound. We can see that a sound was made, but we can't distinguish between a high-pitched "eee" sound and a low-pitched "ooo" sound just by looking at the amplitude. To an ASR system, this frequency information is essential for telling phonemes apart.

The Spectrogram: Adding Frequency to the Picture

To see the frequency content of an audio signal, we use a spectrogram. A spectrogram is a much richer visualization that shows how the frequencies present in a sound change over time. Think of it as a series of frequency snapshots stacked next to each other.

To create a spectrogram, the audio signal is broken down into small, overlapping time segments called frames. For each frame, a mathematical operation called the Fast Fourier Transform (FFT) is used to determine the amount of energy present at different frequency bands. The result is a 2D plot with time on the horizontal axis, frequency on the vertical axis, and color to indicate the energy or amplitude of each frequency at each point in time.

A spectrogram of the same "Hello world" phrase. The color intensity shows the energy, with yellow being high energy and blue being low energy.

This visualization is far more informative. You can now see distinct patterns:

Vowels, like the 'e' in "hello" and 'o' in "world", typically appear as strong, horizontal bands of energy at specific low-to-mid frequencies. These bands are called formants and are distinguishing characteristics of vowel sounds.
Fricatives, which are consonants produced by forcing air through a narrow channel (like 's' or 'sh'), often appear as noisy, spread-out energy in the high-frequency regions.
Plosives, which are consonants like 'p' or 't', might show up as a brief silence followed by a sudden burst of energy across a wide range of frequencies.

From Visualization to Features

Spectrograms are more than just a useful tool for human analysis. They are the foundation for the features that ASR systems learn from. The visual patterns that we can see in a spectrogram correspond to the acoustic patterns that a machine learning model needs to identify. The next sections on pre-emphasis, framing, windowing, and especially the creation of MFCCs, are all steps in a process that starts with this spectrogram-like representation to produce a compact and effective set of features for the acoustic model.

Was this section helpful?

References

Theory and Application of Digital Speech Processing, Lawrence R. Rabiner, Ronald W. Schafer, 2011 (Pearson) - A foundational textbook dedicated to the digital signal processing of speech. It offers in-depth discussions on time-domain and frequency-domain analysis of speech signals, essential for understanding waveforms and spectrograms.
Discrete-Time Signal Processing, Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck, 1999 (Pearson) - A classic textbook on digital signal processing that provides the theoretical framework for topics like sampling, Fourier transforms, and spectral analysis, which are fundamental to the generation and interpretation of spectrograms.
6.007 Digital Signal Processing, Spring 2011, Alan V. Oppenheim, 2011 (MIT OpenCourseWare) - An accessible university course offering lecture notes and assignments on the fundamentals of digital signal processing, including concepts such as the Fast Fourier Transform and spectral analysis, which are directly applied in creating spectrograms.