While traditional vocoders operate primarily in the frequency domain and rely on signal processing heuristics, the first major breakthroughs in neural vocoding came from models that directly generate the raw audio waveform, sample by sample, in the time domain. These are known as autoregressive waveform models.
The core idea is elegant yet computationally demanding: predict the next audio sample based on all previously generated samples and, critically, the conditioning acoustic features (like mel-spectrograms) provided by the upstream TTS acoustic model. The probability distribution of the current sample xt is modeled conditioned on past samples x1,...,xt−1 and the conditioning input c:
p(xt∣x1,...,xt−1,c)This sequential dependency allows the models to capture the complex, long-range temporal structures inherent in audio waveforms, leading to very high fidelity and naturalness. However, generating audio one sample at a time, especially at typical sampling rates (e.g., 16kHz, 24kHz, or even 48kHz), makes inference inherently slow.
WaveNet, introduced by DeepMind, was a landmark achievement demonstrating the potential of deep learning for raw audio generation. It achieved state-of-the-art results for TTS, significantly outperforming existing parametric and concatenative systems in terms of naturalness.
The key innovations in WaveNet's architecture address the challenge of modeling long-range dependencies in high-resolution audio signals:
Causal Convolutions: To maintain the autoregressive property (predicting xt only using past samples x<t), WaveNet uses causal convolutions. In a standard 1D convolution, the output at time t might depend on inputs at t−k,...,t,...,t+k. A causal convolution ensures the output at time t only depends on inputs at times t,t−1,t−2,.... This is typically implemented by padding the input sequence appropriately on one side and ensuring the convolutional filter doesn't "look into the future".
Dilated Convolutions: Modeling dependencies over thousands of past samples (e.g., several hundred milliseconds) with standard causal convolutions would require an extremely deep network or very large filters, making it computationally infeasible. WaveNet employs dilated convolutions. In these layers, the filter is applied to an input area larger than its length by skipping input values with a certain step (the dilation rate). By stacking layers with exponentially increasing dilation rates (e.g., 1, 2, 4, 8, ..., 512), the network can achieve a very large receptive field (how far back in time the output at step t depends on) with relatively few layers, capturing dependencies across various timescales efficiently.
Stacking dilated causal convolutions allows the receptive field to grow exponentially, enabling the model to capture long-range dependencies in the audio signal efficiently. Each layer processes inputs with increasing gaps (dilation).
Gated Activation Units: Inspired by LSTMs and GRUs, WaveNet uses a gated activation mechanism within its residual blocks:
z=tanh(Wf,k∗x)⊙σ(Wg,k∗x)Here, ∗ denotes convolution, k is the layer index, Wf,k and Wg,k are filter weights, σ is the sigmoid function, and ⊙ is element-wise multiplication. This gating allows the network to control the flow of information more effectively.
Residual and Skip Connections: To facilitate the training of such deep networks, WaveNet employs both residual connections (adding the input of a block to its output) and skip connections (summing outputs from various blocks to form the final prediction), similar to ResNet architectures.
Conditioning: WaveNet needs to be conditioned on the acoustic features (e.g., mel-spectrograms) provided by the TTS model. This is achieved through local conditioning, where the conditioning features c (upsampled to match the audio resolution) influence the gating mechanism:
z=tanh(Wf,k∗x+Vf,k∗c)⊙σ(Wg,k∗x+Vg,k∗c)Where Vf,k and Vg,k are learned linear projections applied to the conditioning input c. Global conditioning (e.g., speaker identity vectors) can also be added similarly to influence the entire utterance.
Output Layer: The final output layer predicts the probability distribution for the next sample xt. Since raw audio samples are often represented as 16-bit integers (65536 possible values), a simple softmax over all values is computationally expensive. WaveNet originally used an 8-bit mu-law companding transformation (μ-law encoding) to reduce the number of possible values to 256, followed by a softmax layer. Later work explored using Mixture Density Networks (MDNs), specifically a mixture of logistics distribution, to directly model the continuous waveform or 16-bit discrete values more effectively.
While WaveNet produces exceptionally high-quality audio, its sample-by-sample generation process makes inference extremely slow, often much slower than real-time on standard hardware, limiting its practical deployment in latency-sensitive applications.
WaveRNN was developed specifically to address the slow inference speed of WaveNet while retaining the benefits of autoregressive modeling. Instead of relying solely on computationally intensive dilated convolutions, WaveRNN employs a Recurrent Neural Network (RNN), typically a GRU or LSTM, to model the sequential dependencies.
The core WaveRNN update equation looks something like this:
ot,ht=RNN([xt−1,ct],ht−1) P(xt)=OutputLayer(ot)Where xt−1 is the previous audio sample, ct is the corresponding conditioning feature frame, ht−1 is the previous hidden state of the RNN, ot is the RNN output, and ht is the updated hidden state. The OutputLayer
(often composed of fully connected layers) then predicts the probability distribution for the current sample xt, frequently using a softmax over mu-law quantized bins similar to WaveNet.
Key aspects and optimizations of WaveRNN include:
WaveRNN and its optimized variants achieve generation speeds significantly faster than the original WaveNet, making real-time synthesis feasible on CPUs and mobile devices, although often with a slight trade-off in maximum achievable audio fidelity compared to the best WaveNet implementations.
Both WaveNet and WaveRNN represent foundational work in neural vocoding, demonstrating that modeling raw audio waveforms autoregressively can yield unprecedented quality. Their primary limitation, the sequential generation process, motivated the development of the parallel waveform generation models discussed next.
© 2025 ApX Machine Learning