Before the development of deep learning based approaches, vocoding relied primarily on signal processing techniques. These methods take the acoustic feature representation, typically a magnitude spectrogram (like the mel-spectrograms generated by TTS front-ends), and attempt to reconstruct a time-domain audio waveform.
A prominent example is the Griffin-Lim algorithm (GLA). The core challenge these traditional methods face is the phase reconstruction problem. TTS acoustic models usually predict only the magnitude component of the spectrogram, discarding the phase information. While magnitude captures much of the spectral content, phase is essential for accurately reconstructing the waveform's temporal structure and perceived quality.
Think of the Short-Time Fourier Transform (STFT) which converts a signal segment into magnitude and phase components for different frequencies. Reversing this process (Inverse STFT or ISTFT) requires both. Given only the magnitude ∣X(t,f)∣, how do we find the correct phase ϕ(t,f) to reconstruct the original signal x(n)?
x(n)STFTX(t,f)=∣X(t,f)∣ejϕ(t,f) X^(t,f)=∣Xtarget(t,f)∣ejϕ^(t,f)ISTFTx^(n)The Griffin-Lim algorithm tackles this iteratively. It starts with the target magnitude spectrogram and an initial guess for the phase (often random noise or zero phase). It then alternates between two steps:
This process is repeated, aiming to find a signal whose STFT magnitude matches the target while satisfying the consistency constraints inherent in the STFT/ISTFT process. The underlying assumption is that enforcing consistency will implicitly guide the phase towards a reasonable estimate.
However, this process has significant drawbacks:
Consider the difference in spectrograms. While a GLA-vocoded spectrogram might closely match the target magnitude spectrogram, its underlying phase structure leads to the perceptual degradation.
Illustration: While the Griffin-Lim reconstructed magnitude (center) might closely match the target magnitude (left), the resulting audio quality suffers compared to audio derived from the original waveform (right, whose magnitude might differ slightly but whose phase is correct) because of suboptimal phase estimation.
These limitations motivated the development of neural vocoders. By learning complex mappings directly from acoustic features to waveforms, or by modeling the conditional distribution of audio samples, neural networks can implicitly or explicitly learn the correct phase relationships, leading to significantly more natural and higher-fidelity synthesized speech. We will examine these advanced techniques in the following sections.
© 2025 ApX Machine Learning