The sounds we hear, from speech to music, exist as continuous waves of air pressure. Computers, however, operate on discrete numerical data. To bridge this gap, we must convert the continuous analog audio signal, represented as a function of time , into a discrete sequence of numbers, . This conversion process is fundamental to all digital audio processing and involves two main steps: sampling and quantization.
The first step, sampling, involves measuring the amplitude of the analog audio wave at regular, fixed intervals. Think of it as taking a series of snapshots of the wave's height over time. The rate at which these snapshots are taken is called the sampling rate or sampling frequency, measured in Hertz (Hz).
A common sampling rate for speech recognition is 16,000 Hz (or 16 kHz), which means we capture 16,000 numerical values, or samples, for every second of audio. For CD-quality music, a higher rate of 44,100 Hz (44.1 kHz) is used.
Why these specific numbers? The choice of sampling rate is dictated by the Nyquist-Shannon sampling theorem. This theorem states that to accurately reconstruct a signal, the sampling rate must be at least twice the highest frequency component present in the signal. This minimum rate is known as the Nyquist rate. Since the most significant frequencies in human speech are below 8 kHz, a 16 kHz sampling rate is sufficient to capture the necessary information without significant loss. Sampling below the Nyquist rate leads to an effect called aliasing, where high frequencies are incorrectly represented as lower ones, distorting the signal.
The continuous blue line represents the original analog sound wave. The dark blue dots are the discrete samples taken at regular time intervals.
After sampling, we no longer have a continuous function . Instead, we have a sequence of amplitude values at discrete time steps. However, these amplitude values are still real numbers and can take any value within their range.
The next step is quantization. This process maps the continuous amplitude values of each sample to a finite set of discrete levels. The number of available levels is determined by the bit depth of the audio.
Bit depth specifies how many bits are used to represent each sample's amplitude. A higher bit depth allows for more quantization levels, resulting in a more accurate representation of the original amplitude and a lower quantization error or noise. For example:
Think of quantization as placing a grid over the sampled amplitudes and "snapping" each sample's value to the nearest grid line. A finer grid (higher bit depth) means the snapped value is closer to the original, preserving more detail.
The original sample amplitudes (blue circles) are mapped to the nearest available discrete level (dashed lines), resulting in the quantized values (red crosses).
Once both sampling and quantization are complete, our audio signal has been fully digitized. It is now a sequence of integers, a format that a computer can store and process.
The raw sequence of numbers resulting from sampling and quantization is called Pulse-Code Modulation (PCM) data. This is the most direct digital representation of an audio signal. To store this data, we use an audio file format, often called a container.
The most common uncompressed format is WAV (.wav). A WAV file typically contains a header with metadata (sampling rate, bit depth, number of channels) followed by the raw PCM data. For ASR, working with uncompressed WAV files or files with lossless compression (like FLAC) is preferred because it ensures that no information is lost, which could otherwise harm model performance. Lossy compression formats like MP3 discard some audio information to achieve smaller file sizes and are generally less suitable for training high-fidelity ASR models.
The entire digitization pipeline transforms a physical phenomenon into a structured numerical array, making it suitable for analysis and modeling.
The process of converting an analog signal to a digital one.
With the audio now represented as a sequence of numbers, , we have a format that our Python libraries can load and that our deep learning models can use as input. The next step is to transform this raw numerical sequence into features that better highlight the phonetic properties of speech.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with