Theory provides the map, but practical application is where we learn the terrain. In the preceding sections, you learned how an analog sound wave is converted into a digital signal and the difference between time and frequency domain representations. Now, we will translate that knowledge into practice by writing Python code to load an audio file and visualize these two fundamental views: the waveform and the spectrogram.This hands-on exercise is a foundational skill for any speech processing task. Visualizing audio data is not just for creating illustrations; it's an essential step for debugging, understanding your dataset, and building an intuition for how different sounds appear to a machine. We will use the Librosa library, the de facto standard for audio analysis in Python.Setting Up Your EnvironmentBefore we begin, ensure you have Librosa and Matplotlib installed. Librosa will handle the audio loading and processing, while Matplotlib provides the foundation for visualization. You can install them using pip:pip install librosa matplotlib numpyLoading an Audio File with LibrosaOur first step is to load an audio file from disk into a format we can work with. The librosa.load() function is the primary tool for this job. It returns two important items:A NumPy array representing the audio time series. By convention, we'll call this y. This is the digital waveform, $x[n]$, that we discussed earlier. Librosa automatically converts the signal to mono (a single channel) and resamples it to a default rate of 22,050 Hz.The sampling rate of the time series, which we'll call sr. This value tells us how many samples in the array y correspond to one second of audio.Let's see it in action. The following code snippet loads an audio file and prints the shape of the resulting array and its sampling rate.import librosa # Define the path to your audio file # Replace this with a .wav or .mp3 file on your system audio_path = 'path/to/your/audio.wav' # Load the audio file y, sr = librosa.load(audio_path) print(f"Audio time series shape: {y.shape}") print(f"Sampling rate: {sr} Hz")If you run this code with a 5-second audio file, you might see an output like Audio time series shape: (110250,) and Sampling rate: 22050 Hz. This makes sense: 5 seconds multiplied by 22,050 samples per second equals 110,250 total samples.Visualizing the Waveform in the Time DomainThe raw time series y contains the amplitude of the audio at each discrete time step. The most direct way to visualize this is by plotting amplitude versus time. This plot is the waveform. It gives us a sense of the audio's dynamics, showing us the loud and quiet passages.We can create a simple waveform plot. While we could use Matplotlib directly, the following example generates a self-contained chart representing a typical speech waveform.import numpy as np import librosa.display import matplotlib.pyplot as plt # Assuming 'y' and 'sr' are loaded from the previous step # For a demonstration, let's create a dummy signal sr = 22050 t = np.linspace(0., 5., int(sr*5.0)) amplitude = np.i0(np.sin(2. * np.pi * t * 1.2)) y = (amplitude - np.min(amplitude)) / (np.max(amplitude) - np.min(amplitude)) * 0.8 - 0.4 y = y * np.sin(2 * np.pi * t * 220) * (np.sin(2 * np.pi * t * 0.5) + 0.2) # --- Plotting the waveform --- # This code is for illustration; a direct plot is shown below. fig, ax = plt.subplots(figsize=(10, 4)) librosa.display.waveshow(y, sr=sr, ax=ax) ax.set_title('Audio Waveform') ax.set_xlabel('Time (s)') ax.set_ylabel('Amplitude') plt.tight_layout() plt.show()The code generates a plot where the x-axis represents time and the y-axis represents amplitude. Areas with high absolute amplitude correspond to louder parts of the sound, while areas near zero are silent or quiet.{"layout":{"xaxis":{"title":"Time (samples)"},"yaxis":{"title":"Amplitude"},"margin":{"l":50,"r":20,"t":20,"b":40},"height":300,"title":{"text":"Time-Domain Waveform"}},"data":[{"y":[0.0,0.01,0.02,-0.01,0.03,0.05,0.1,0.15,0.25,0.1,-0.1,-0.2,-0.15,-0.05,0.0,0.0,0.01,-0.02,-0.01,0.0,0.0,0.0,0.0,0.01,0.03,0.08,0.2,0.35,0.48,0.55,0.4,0.2,0.0,-0.1,-0.25,-0.4,-0.5,-0.45,-0.3,-0.15,-0.05,0.0,0.02,0.01,0.0,0.0,-0.01,-0.01,0.0,0.0,0.0,0.01,0.02,0.05,0.12,0.22,0.15,0.05,-0.05,-0.1,-0.08,-0.02,0.0],"x":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63],"type":"scatter","mode":"lines","line":{"color":"#228be6"}}]}The waveform shows the signal's amplitude over time. The dense, high-amplitude sections represent spoken words, while the flat sections near zero represent silence.From Time to Frequency: Generating a SpectrogramThe waveform is useful, but it doesn't tell us about the frequency content of the signal. To see which frequencies are present at each moment, we need to move to a frequency-domain representation. The standard way to do this for speech is to compute a spectrogram.A spectrogram is created by applying the Short-Time Fourier Transform (STFT). The STFT breaks the audio signal into small, overlapping windows and computes the Fourier transform for each window. This gives us the frequency content for successive, short time intervals.Compute the STFT: We use librosa.stft() on our time series y. This returns a 2D complex-valued matrix where rows correspond to frequency bins and columns correspond to time frames.Get the Magnitude: The STFT output contains both magnitude and phase. For visualization and most ASR models, we only need the magnitude. We can get this by taking the absolute value of the STFT matrix with np.abs().Convert to Decibels (dB): The resulting magnitude spectrogram has a very wide dynamic range. Human hearing is logarithmic, so it is standard practice to convert the amplitude to a decibel (dB) scale using librosa.amplitude_to_db(). This compresses the range and makes the resulting visualization much more informative.Here is the code to perform these steps:import numpy as np # Compute the Short-Time Fourier Transform (STFT) D = librosa.stft(y) # Convert the complex-valued STFT output to a magnitude spectrogram S_mag = np.abs(D) # Convert the magnitude spectrogram to a decibel (dB) scale S_db = librosa.amplitude_to_db(S_mag, ref=np.max) # --- Plotting the spectrogram --- fig, ax = plt.subplots(figsize=(10, 4)) img = librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='mel', ax=ax) fig.colorbar(img, ax=ax, format='%+2.0f dB', label='Magnitude (dB)') ax.set_title('Mel Spectrogram') plt.tight_layout() plt.show()The resulting plot, a spectrogram, is a heatmap. The x-axis is time, the y-axis is frequency, and the color intensity at any point (t, f) indicates the energy of frequency f at time t.{"layout":{"xaxis":{"title":"Time Frames"},"yaxis":{"title":"Frequency Bins"},"margin":{"l":50,"r":20,"t":40,"b":40},"height":350,"title":{"text":"Spectrogram (Magnitude in dB)"}},"data":[{"z":[[0,0,5,10,8,5,2,0,0,0,0,0,0,0,5,10,15,20,15,10,5,0],[5,8,15,25,20,12,8,5,2,0,2,5,8,12,20,30,40,45,35,25,15,5],[10,15,25,40,35,25,15,10,5,2,5,10,15,25,35,50,60,65,55,45,30,10],[8,12,20,35,30,20,12,8,4,1,4,8,12,20,30,45,55,60,50,40,25,8],[2,5,8,12,10,8,5,2,0,0,0,2,5,8,10,15,20,25,20,15,10,2],[0,0,2,5,4,2,1,0,0,0,0,0,0,2,4,8,10,12,10,8,4,0],[0,0,1,2,1,0,0,0,0,0,0,0,0,1,2,4,5,6,5,4,2,0]],"type":"heatmap","colorscale":[[0,"#ffec99"],[0.5,"#f59f00"],[1,"#f03e3e"]],"colorbar":{"title":"dB"}}]}The spectrogram visualizes frequency content over time. Brighter areas indicate higher energy. The horizontal bands of energy are characteristic of speech and correspond to formants, which are acoustic resonances of the human vocal tract.By completing this exercise, you have performed the most fundamental steps in any audio processing pipeline. You can now load any audio file and inspect its structure in both the time and frequency domains. These visualizations are the basis for the automated feature extraction techniques we will cover in the next chapter, where we will convert these rich visual representations into the compact numerical features that our deep learning models will consume.