Pre-emphasis and Framing

After a raw audio signal is digitized, we can’t just feed the entire stream of numbers into a machine learning model. Speech is a complex signal whose characteristics change continuously. To prepare it for analysis, we must first apply two important preprocessing steps: pre-emphasis and framing. These techniques help to balance the signal's properties and break it down into manageable, analyzable segments.

Pre-emphasis: Boosting High-Frequency Content

If you look at the frequency content of a typical speech signal, you will notice that most of the energy is concentrated in the lower frequencies. Higher-frequency components, which are also important for distinguishing between phonemes (like 's' vs. 'f'), often have much lower energy. This imbalance can be a problem for the algorithms used in feature extraction.

Pre-emphasis is a filtering technique that aims to solve this by boosting the energy of the high-frequency components. This serves two main purposes:

It balances the frequency spectrum, making the signal more uniform across all frequencies.
It can help reduce the effects of noise and improve the signal-to-noise ratio (SNR).

The process is straightforward. We apply a simple high-pass filter that calculates a new signal, $y(t)$ , based on the original signal, $x(t)$ . The formula for pre-emphasis is:

y(t) = x(t) - \alpha \cdot x(t-1)

In this equation, $x(t)$ is the value of the current sample, and $x(t-1)$ is the value of the previous sample. The coefficient, $\alpha$ (alpha), is the pre-emphasis factor. Its value is typically between 0.95 and 0.97. By subtracting a fraction of the previous sample from the current one, we effectively amplify the differences between samples, which are more pronounced at higher frequencies.

Framing: Dividing the Signal into Analyzable Chunks

A speech signal is not stationary; its properties change over time as we pronounce different words and sounds. For example, the sound "sh" in "she" has very different frequency characteristics from the "e" sound that follows it. Analyzing a whole sentence at once would average out all these important details.

However, over very short intervals, typically around 20 to 30 milliseconds, the speech signal is "quasi-stationary," meaning its properties are relatively stable. This observation is the basis for framing.

Framing is the process of slicing the pre-emphasized signal into small, overlapping segments called frames. Each frame is short enough to represent a stable acoustic unit. Two parameters define this process:

Frame Size (or Frame Length): The duration of each frame. A standard value is 25 milliseconds (ms). This is long enough to contain sufficient acoustic information but short enough to maintain the stationary assumption.
Frame Step (or Frame Stride): The interval at which we start a new frame. A common value is 10 ms.

Notice that the frame step (10 ms) is shorter than the frame size (25 ms). This means the frames will overlap. In this example, each frame overlaps with the previous one by 15 ms (25 ms - 10 ms). This overlap is important because it ensures a smooth transition between frames and prevents us from losing information that might occur at the edge of a frame. Without overlap, we might accidentally cut a phoneme in half, making it difficult to identify.

The audio signal is segmented into overlapping frames. Each frame has a fixed size (e.g., 25ms), and a new frame begins at a regular interval called the frame step (e.g., 10ms).

After pre-emphasis and framing, we are no longer dealing with a single, long audio signal. Instead, we have a sequence of short, overlapping frames. Each of these frames is now ready for the next step in our processing pipeline: applying a windowing function to prepare it for frequency analysis.

Was this section helpful?

References

Digital Processing of Speech Signals, Lawrence R. Rabiner and Ronald W. Schafer, 1978 (Prentice-Hall) - A foundational textbook providing comprehensive coverage of digital speech processing, including signal analysis techniques such as pre-emphasis and framing.
Discrete-Time Processing of Speech Signals, John R. Deller, John H.L. Hansen, and John G. Proakis, 2000 (IEEE Press / Wiley-Interscience) DOI: 10.1109/9780470544402 - An authoritative text on digital speech processing focusing on discrete-time methods, detailing speech signal representations and short-time analysis, including framing and pre-emphasis.