Before a computer can interpret what was said, it must first convert the physical phenomenon of sound into a language it understands: numbers. The sounds we make are continuous analog waves, moving through the air with infinite variations in pressure. Computers, however, operate on discrete, finite data. The process of converting analog sound to digital information is fundamental to all of speech recognition. This conversion involves two main steps: sampling and quantization.
Imagine an analog sound wave as a smooth, continuous line representing changes in sound pressure over time. To bring this into a digital format, we can't store the infinite number of points that make up that line. Instead, we must approximate it by taking measurements at regular, discrete intervals.
An analog wave is a continuous signal. Digitization captures its amplitude at discrete points in time.
The first step, sampling, is the act of measuring the amplitude of the analog wave at fixed time intervals. Think of it like a film camera capturing a rapid sequence of still photos to create the illusion of motion. In audio, we capture a rapid sequence of amplitude "snapshots."
The number of samples taken per second is called the sampling rate, measured in Hertz (Hz). For example, a sampling rate of 16,000 Hz (or 16 kHz) means that we measure the sound wave's amplitude 16,000 times every second.
The sampling rate is important because it determines the range of frequencies that can be captured accurately. According to the Nyquist-Shannon sampling theorem, the sampling rate must be at least twice the highest frequency present in the signal. Since the range of human speech typically falls below 8 kHz, a sampling rate of 16 kHz is common for speech recognition as it provides a safe margin.
After sampling tells us when to measure, quantization tells us what value to assign to each measurement. The amplitude of an analog wave is still continuous, meaning it can be any value within a range. Quantization approximates this continuous amplitude by mapping it to the nearest value from a finite set of discrete levels.
The number of available levels is determined by the bit depth. A higher bit depth provides more levels, resulting in a more accurate representation of the amplitude.
Quantization maps a continuous amplitude value to the nearest discrete level available, determined by the bit depth.
The combination of sampling and quantization transforms a continuous sound wave into a sequence of numbers. Each number in the sequence represents the quantized amplitude of the sound at a specific point in time. For a 1-second audio clip sampled at 16 kHz with a 16-bit depth, the result is an array of 16,000 integers, where each integer is a value between -32,768 and 32,767.
This sequence of numbers, often called a waveform, is the raw digital audio that a computer can store and process. It is the very first input to our ASR pipeline. While it's a faithful numerical representation of the sound, it's not yet in a suitable format for a machine learning model to find patterns. In the next chapter, we will learn how to process this raw waveform into more informative features.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with