Processing audio signals is fundamental for speech recognition. Converting raw audio into machine-readable features is a primary step. This involves using popular Python libraries to load an audio file, visualize its properties, and extract Mel-Frequency Cepstral Coefficients (MFCCs).
This hands-on exercise will solidify your understanding of the audio processing pipeline. You will see how a few lines of code can perform complex transformations, turning a sound wave into the structured data needed for the next stage: acoustic modeling.
Before we can process audio, we need the right tools. We will primarily use librosa, a powerful Python package for audio and music analysis. For visualization, we will use matplotlib.
If you don't have these libraries installed, you can add them to your Python environment using pip:
pip install librosa matplotlib
With our environment ready, let's start by loading an audio file.
The first step in any audio processing task is to load the audio data into memory. The librosa library makes this straightforward. The librosa.load() function reads an audio file and returns two important things: the audio signal as a time series and the sampling rate.
Let's see it in action. Create a Python script and add the following code. You can use any .wav or .mp3 file you have. For this example, we'll assume we have a file named speech_sample.wav.
import librosa
# Define the path to your audio file
audio_file_path = 'speech_sample.wav'
# Load the audio file
# librosa automatically resamples to 22050 Hz by default
# y is the audio time series, sr is the sampling rate
y, sr = librosa.load(audio_file_path)
print(f"Audio Time Series (first 10 samples): {y[:10]}")
print(f"Sampling Rate: {sr} Hz")
print(f"Total number of samples: {len(y)}")
print(f"Duration of the audio: {len(y) / sr:.2f} seconds")
The output shows you the raw amplitude values, the sampling rate librosa used, and the total length of the audio. This simple step has already converted a sound file into a numerical format our program can work with.
A raw list of numbers is not very intuitive. A better way to understand an audio signal is to visualize it. As we learned, a waveform plots the amplitude of the audio signal over time. This gives us a visual representation of the sound's volume and energy.
We can use matplotlib along with librosa.display to create a clean plot.
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
# Load the audio file
audio_file_path = 'speech_sample.wav'
y, sr = librosa.load(audio_file_path)
# Create a time axis for the plot
time = np.arange(0, len(y)) / sr
# Plot the waveform
plt.figure(figsize=(12, 4))
librosa.display.waveshow(y, sr=sr, color='#4263eb')
plt.title('Audio Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
Running this code will produce a plot that shows the rise and fall of the audio's amplitude, clearly indicating where speech is present and where there are pauses.
A waveform showing the amplitude of an audio signal fluctuating over a short period.
While a waveform is useful, it doesn't show us the frequency content of the signal. A spectrogram does exactly that by displaying which frequencies are present at each point in time. It is one of the most important visualizations in speech processing.
To create a spectrogram, we first need to compute the Short-Time Fourier Transform (STFT) of our signal. The STFT breaks the signal into short, overlapping frames and calculates the frequency spectrum for each frame. librosa provides a function for this.
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
# Load the audio file
audio_file_path = 'speech_sample.wav'
y, sr = librosa.load(audio_file_path)
# Compute the Short-Time Fourier Transform (STFT)
D = librosa.stft(y)
# Convert the amplitude spectrogram to a decibel (dB) scale
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)
# Plot the spectrogram
plt.figure(figsize=(12, 5))
librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='log', cmap='viridis')
plt.colorbar(format='%+2.0f dB', label='Intensity (dB)')
plt.title('Log-Frequency Spectrogram')
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
plt.tight_layout()
plt.show()
The resulting plot shows time on the x-axis, frequency on the y-axis (often on a logarithmic scale to better represent human hearing), and the intensity of each frequency at each time with color. Brighter colors indicate higher energy. You can often see horizontal bands corresponding to the formants in speech.
A spectrogram displaying the intensity of different frequency bins over time. Brighter areas indicate higher energy at that frequency.
Finally, we arrive at the main goal of this chapter: feature extraction. We will now compute the MFCCs, which are the standard features used in many traditional speech recognition systems. As a reminder, MFCCs are a compact representation of the spectral envelope, designed to capture phonetically important characteristics of speech.
The librosa.feature.mfcc() function does all the hard work for us. It takes the audio time series and sampling rate as input and performs all the steps: framing, windowing, FFT, Mel filterbank application, and the discrete cosine transform (DCT).
import librosa
import librosa.display
import matplotlib.pyplot as plt
# Load the audio file
audio_file_path = 'speech_sample.wav'
y, sr = librosa.load(audio_file_path)
# Compute MFCCs from the audio signal
# By default, 20 MFCCs are computed
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
print(f"Shape of the MFCC matrix: {mfccs.shape}")
# Visualize the MFCCs
plt.figure(figsize=(12, 5))
librosa.display.specshow(mfccs, sr=sr, x_axis='time', cmap='coolwarm')
plt.colorbar(label='MFCC Coefficient')
plt.title('MFCCs')
plt.xlabel('Time (s)')
plt.ylabel('MFCC')
plt.tight_layout()
plt.show()
When you run this code, print(mfccs.shape) will output something like (13, 216). This means we have a matrix with 13 rows (one for each MFCC coefficient) and 216 columns (one for each time frame). This matrix is the final feature representation of our audio file. It contains the essential information about the speech content, stripped of noise and other irrelevant details.
This is the very data that an acoustic model will use to figure out which sounds were spoken.
A visualization of an MFCC matrix. Each column is a feature vector for a short time frame, and each row is a specific MFCC coefficient.
In this practical session, you have successfully written Python code to perform the entire audio preprocessing pipeline. You started with a standard audio file and:
You have transformed an unstructured sound wave into a highly structured, information-rich format. This feature matrix is ready to be fed into a machine learning model. In the next chapter, we will explore the first major model in our ASR pipeline: the Acoustic Model, which learns to map these very features to the fundamental units of speech.
Was this section helpful?
librosa Python library, providing practical guides and API references for audio analysis and feature extraction.© 2026 ApX Machine LearningEngineered with