Working with Audio Data in Python using Librosa

While Python has built-in capabilities for handling text and numbers, processing audio requires specialized tools. Raw audio data, a sequence of numbers representing signal amplitude over time, needs to be loaded from a file into a format that programs can manipulate, most commonly a NumPy array. This is where Librosa, a powerful and popular Python package for music and audio analysis, comes into play. It provides the essential functions to load, save, and analyze audio, serving as the foundation for our speech recognition pipeline.

Installing Librosa

Before we can work with audio files, you'll need to install the Librosa library. As it's available on the Python Package Index (PyPI), you can install it using pip. It is also recommended to install soundfile which Librosa uses as a backend for writing audio files.

pip install librosa soundfile

Loading Audio Files

The primary function for loading an audio file is librosa.load(). This single function handles many complexities, such as decoding different audio formats (like .wav, .mp3) and converting the signal into a standardized numerical format.

When you call librosa.load(), it returns two important values:

A time series y: A NumPy array containing the audio signal's amplitude values. By default, Librosa normalizes the data so that the values range from -1.0 to 1.0. This is the digital signal $x[n]$ we discussed previously.
A sampling rate sr: The number of samples per second of audio.

Let's see it in action. Suppose we have an audio file named sample_utterance.wav. We can load it as follows:

import librosa

# Define the path to your audio file
audio_path = 'sample_utterance.wav'

# Load the audio file
y, sr = librosa.load(audio_path)

print(f"Audio time series shape: {y.shape}")
print(f"Sampling rate: {sr} Hz")

If you run this code, you might see output similar to this:

Audio time series shape: (110250,)
Sampling rate: 22050 Hz

This tells us that our audio signal y is a one-dimensional array with 110,250 samples. The sampling rate sr is 22,050 Hz, which is the default for librosa.load(). This means that Librosa has automatically resampled the audio to this rate during the loading process. The duration of the audio is simply the number of samples divided by the sampling rate: $110250 / 22050 = 5$ seconds.

Controlling the Sampling Rate

For speech recognition, using a consistent sampling rate across all your data is important. While 22,050 Hz is a common default in audio processing, a sampling rate of 16,000 Hz (or 16 kHz) is a widely adopted standard for ASR. This rate is sufficient to capture the frequencies necessary for human speech and helps reduce the computational cost of processing the data.

You can control the sampling rate using the sr argument in librosa.load():

sr=16000: Loads the audio and resamples it to 16,000 Hz.
sr=None: Loads the audio at its original, native sampling rate, preventing any resampling.

Let's load our file again, but this time preserving its native sampling rate and then resampling it to 16 kHz.

# Load with the native sampling rate
y_native, sr_native = librosa.load(audio_path, sr=None)
print(f"Native sampling rate: {sr_native} Hz")
print(f"Shape with native sr: {y_native.shape}")

# Load and resample to 16 kHz
y_16k, sr_16k = librosa.load(audio_path, sr=16000)
print(f"New sampling rate: {sr_16k} Hz")
print(f"Shape with 16kHz sr: {y_16k.shape}")

Notice how the shape of the array changes when we resample the audio. If the original file was sampled at 44.1 kHz, resampling it to 16 kHz would result in a smaller array, as there are fewer samples needed to represent the same duration of audio.

Basic Audio Operations

Librosa is not just for loading data; it provides a suite of tools for audio manipulation. A common preprocessing step is to trim leading and trailing silence from an utterance. This is useful because it removes parts of the signal that contain no speech, focusing the ASR model's attention on the relevant audio. The librosa.effects.trim function does exactly this.

import librosa.effects

# Load the audio file
y, sr = librosa.load(audio_path, sr=16000)

# Trim silence from the beginning and end
y_trimmed, _ = librosa.effects.trim(y)

print(f"Original length: {len(y)}")
print(f"Trimmed length: {len(y_trimmed)}")

Visualizing the Audio Waveform

The time series array y we get from Librosa directly represents the audio waveform. Plotting this array against time gives us a visual representation of the sound, which can be incredibly insightful. The x-axis represents time, and the y-axis represents the amplitude of the signal.

A visualization of an audio waveform. The signal's amplitude fluctuates around zero, with louder sections corresponding to higher absolute amplitude values.

This plot shows the raw, time-domain representation of the signal. In the next sections, we will explore how to transform this one-dimensional signal into a more informative two-dimensional representation, the spectrogram, which is better suited for speech recognition models.

Was this section helpful?

References

Discrete-Time Signal Processing, Alan V. Oppenheim, Ronald W. Schafer, John R. Buck, 1999 (Prentice Hall) - A fundamental textbook on digital signal processing, explaining core concepts such as sampling, discrete-time signals, and their representation.
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Daniel Jurafsky and James H. Martin, 2025 - Comprehensive textbook on speech recognition, offering context for audio preprocessing steps and the importance of specific sampling rates for ASR systems.