Sound, as we experience it, is a continuous wave of pressure traveling through a medium like air. Computers, however, operate on discrete, numerical data. To bridge this gap, we must convert the analog sound wave into a digital format. This conversion process is fundamental to all digital audio processing and involves two main steps: sampling and quantization.
An analog signal is continuous in both time and amplitude. Think of a sound wave from a person speaking. At any given moment, the wave has a specific amplitude (related to its loudness), and it flows smoothly from one moment to the next without any breaks. To a computer, this smooth, infinite stream of information is unusable in its raw form. We need a method to capture a finite approximation of it.
The first step in this conversion is sampling. Sampling is the process of measuring the amplitude of the analog signal at regular, discrete intervals of time. It’s like creating a flip-book of the sound wave. Each page in the flip-book is a "sample," a snapshot of the wave's amplitude at a specific point in time.
The rate at which these snapshots are taken is called the sampling rate or sampling frequency, measured in Hertz (Hz). A sampling rate of 16,000 Hz (or 16 kHz) means that we measure the wave’s amplitude 16,000 times every second.
The choice of sampling rate is important. To accurately reconstruct a signal, the Nyquist-Shannon sampling theorem states that the sampling rate must be at least twice the highest frequency present in the signal. Since the range of human speech is typically below 8 kHz, a sampling rate of 16 kHz is common for speech recognition, as it provides enough detail to capture the important characteristics of spoken language. For music, where higher frequencies are present, a rate like 44.1 kHz (used for CDs) is standard.
After sampling, we have a series of measurements at discrete time intervals, but the amplitude value of each sample is still a real number, which can have infinite precision. Quantization is the process of mapping these continuous amplitude values to a finite set of discrete levels.
This is essentially an act of rounding. We define a fixed number of possible amplitude values, and each sample’s true amplitude is rounded to the nearest available level. The number of levels is determined by the bit depth. A higher bit depth provides more levels, resulting in a more accurate approximation of the original amplitude.
For most ASR applications, a 16-bit depth is standard. It offers a good balance between audio fidelity and file size. Using a lower bit depth might save space but can introduce audible distortion known as quantization error or quantization noise, which is the difference between the actual sample amplitude and its rounded, quantized value.
The following diagram shows a continuous analog wave being converted into a digital signal through both sampling and quantization.
An analog wave (blue) is measured at regular time intervals (sampling). Each measurement's amplitude is then snapped to the nearest discrete level (gray lines), resulting in the final digital points (purple squares).
Together, sampling and quantization transform a continuous analog wave into a sequence of discrete numbers. This sequence is the digital representation of the audio, a format that a computer can easily store, manipulate, and analyze.
The process of converting an analog sound wave into a computer-readable digital signal.
This stream of numbers is the input to the next stages of our ASR pipeline. In the following sections, you will learn how to take this raw digital audio and transform it into features that are even more useful for a machine learning model.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with