Digital audio, transformed from continuous sound waves through sampling and quantization, requires storage in specific file formats. Much like images have various formats (.jpg, .png, .gif), audio also uses several formats, each offering distinct trade-offs between file size, audio quality, and processing requirements.
For speech recognition, the choice of format is important because it dictates the quality of the data your model will see. Let's examine the three most common formats you will encounter: WAV, MP3, and FLAC.
The Waveform Audio File Format, or WAV (.wav), is the simplest and most direct way to store digital audio. Think of a WAV file as a raw container for the audio data you get after sampling and quantization. It stores the amplitude value for each sample sequentially, with a small header at the beginning of the file that specifies metadata like the sample rate and bit depth.
The relationship is straightforward:
File Size (WAV)=(Sample Rate)×(Bit Depth)×(Number of Channels)×(Duration in Seconds)An analogy for a WAV file is a Bitmap image file (.bmp). It stores the color value for every single pixel, resulting in a perfect but very large image file.
The MP3 (.mp3) format was designed to solve the problem of large file sizes. It uses lossy compression, which means it reduces the file size by permanently discarding parts of the audio data.
This process is not random. MP3 encoders use principles of psychoacoustics, the study of how humans perceive sound. They remove audio frequencies and sounds that are difficult for the average person to hear, such as very high frequencies or quiet sounds that occur at the same time as loud sounds.
An MP3 file is like a JPEG image (.jpg). It achieves a small file size by discarding subtle visual details, resulting in an image that looks great to the human eye but is not a perfect copy of the original.
The Free Lossless Audio Codec, or FLAC (.flac), offers a middle ground between the massive files of WAV and the quality-reducing compression of MP3. It uses lossless compression.
This means that while a FLAC file is smaller than a WAV file, it retains every single bit of the original audio information. It achieves this by finding more efficient ways to represent the data, similar to how a ZIP file compresses a document without changing its contents. When you decompress a FLAC file, you get a bit-for-bit identical copy of the original raw audio.
A FLAC file is analogous to a ZIP file (.zip). It makes the contents smaller for storage and transfer, but when you "unzip" it, you get back the exact original files.
The main takeaway is that all audio must be converted into a raw, uncompressed waveform before it can be processed by an ASR system. The file format, whether WAV, MP3, or FLAC, simply tells you about the audio's storage and its history.
The following diagram illustrates how different formats are handled before ASR processing.
All audio, regardless of its storage format, is converted to a raw waveform before being used to create features for a speech recognition model.
Understanding these formats allows you to make informed decisions about storing data and building pipelines that can handle various types of audio inputs. In the next section, we will learn how to visualize these raw waveforms to see what speech looks like.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with