After a raw audio signal is digitized, we can’t just feed the entire stream of numbers into a machine learning model. Speech is a complex signal whose characteristics change continuously. To prepare it for analysis, we must first apply two important preprocessing steps: pre-emphasis and framing. These techniques help to balance the signal's properties and break it down into manageable, analyzable segments.
If you look at the frequency content of a typical speech signal, you will notice that most of the energy is concentrated in the lower frequencies. Higher-frequency components, which are also important for distinguishing between phonemes (like 's' vs. 'f'), often have much lower energy. This imbalance can be a problem for the algorithms used in feature extraction.
Pre-emphasis is a filtering technique that aims to solve this by boosting the energy of the high-frequency components. This serves two main purposes:
The process is straightforward. We apply a simple high-pass filter that calculates a new signal, y(t), based on the original signal, x(t). The formula for pre-emphasis is:
y(t)=x(t)−α⋅x(t−1)In this equation, x(t) is the value of the current sample, and x(t−1) is the value of the previous sample. The coefficient, α (alpha), is the pre-emphasis factor. Its value is typically between 0.95 and 0.97. By subtracting a fraction of the previous sample from the current one, we effectively amplify the differences between samples, which are more pronounced at higher frequencies.
A speech signal is not stationary; its properties change over time as we pronounce different words and sounds. For example, the sound "sh" in "she" has very different frequency characteristics from the "e" sound that follows it. Analyzing a whole sentence at once would average out all these important details.
However, over very short intervals, typically around 20 to 30 milliseconds, the speech signal can be considered "quasi-stationary," meaning its properties are relatively stable. This observation is the basis for framing.
Framing is the process of slicing the pre-emphasized signal into small, overlapping segments called frames. Each frame is short enough to be considered a stable acoustic unit. Two parameters define this process:
Notice that the frame step (10 ms) is shorter than the frame size (25 ms). This means the frames will overlap. In this example, each frame overlaps with the previous one by 15 ms (25 ms - 10 ms). This overlap is important because it ensures a smooth transition between frames and prevents us from losing information that might occur at the edge of a frame. Without overlap, we might accidentally cut a phoneme in half, making it difficult to identify.
The audio signal is segmented into overlapping frames. Each frame has a fixed size (e.g., 25ms), and a new frame begins at a regular interval called the frame step (e.g., 10ms).
After pre-emphasis and framing, we are no longer dealing with a single, long audio signal. Instead, we have a sequence of short, overlapping frames. Each of these frames is now ready for the next step in our processing pipeline: applying a windowing function to prepare it for frequency analysis.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with