While autoregressive models like WaveNet generate high-fidelity audio sample by sample, and flow-based models like WaveGlow offer parallel synthesis through invertible transformations, Generative Adversarial Networks (GANs) present another powerful approach for waveform generation. GANs leverage a competitive training process between a generator and a discriminator network, proving highly effective at producing realistic-looking (and sounding) data. Adapting this paradigm for vocoding has led to models that offer an excellent trade-off between computational efficiency and audio quality.
In the context of vocoding, the GAN framework operates as follows:
The training involves an adversarial game: the generator tries to fool the discriminator by producing increasingly realistic waveforms, while the discriminator improves its ability to distinguish real from generated audio. This dynamic pushes the generator towards producing waveforms that are perceptually indistinguishable from real recordings.
Mathematically, this is often formulated as a minimax game with a value function V(G,D). For instance, using the standard GAN loss:
GminDmaxV(G,D)=Ex∼pdata(x)[logD(x)]+Ez∼pz(z)[log(1−D(G(z)))]Here, x represents real audio data, and z represents the input conditioning (mel-spectrograms) for the generator G. In practice, variations like least-squares GAN (LSGAN) loss are often used for more stable training.
MelGAN was one of the first successful GAN-based vocoders demonstrating significantly faster inference than autoregressive models while maintaining good audio quality.
Simplified architecture of MelGAN, showing the generator producing a waveform from a mel-spectrogram and the multi-scale discriminator evaluating its realism at different resolutions.
MelGAN's training objective combines the adversarial loss (encouraging realistic outputs) with a feature matching loss. The feature matching loss helps stabilize training by penalizing discrepancies between the intermediate feature map activations within the discriminators for real and generated samples. This acts as a perceptual guide for the generator, beyond just fooling the final classification layer of the discriminator.
LG=k=1∑KE(x,z)[∣∣Dk(G(z))−Dk(x)∣∣1](Feature Matching Loss) LAdv(G,D)=k=1∑KEz[(Dk(G(z))−1)2](Generator Adversarial Loss - LSGAN variant) LAdv(D,G)=k=1∑KEx[(Dk(x)−1)2]+Ez[(Dk(G(z)))2](Discriminator Adversarial Loss - LSGAN variant)The total generator loss is a weighted sum of LG and LAdv(G,D).
MelGAN provides a substantial speed-up over autoregressive models, making it practical for near real-time synthesis, although the output quality might sometimes contain minor artifacts compared to the best autoregressive or flow-based models.
HiFi-GAN builds upon the success of MelGAN, aiming for higher audio fidelity and improved perceptual quality while retaining computational efficiency. It introduces refinements to both the generator and discriminator architectures.
Generator (Multi-Receptive Field Fusion - MRF): HiFi-GAN's generator employs a module called Multi-Receptive Field Fusion (MRF). Within the generator's main structure (which still uses transposed convolutions for upsampling), the MRF module processes the intermediate features using multiple parallel residual blocks. Each residual block uses dilated convolutions with different dilation rates and kernel sizes. The outputs of these parallel blocks are then fused (typically summed). This design allows the generator to capture audio patterns across various temporal resolutions simultaneously, improving its ability to model complex waveform structures and long-range dependencies without significantly increasing computational cost.
Discriminator (Multi-Period Discriminator - MPD + Multi-Scale Discriminator - MSD): HiFi-GAN uses a combination of two types of discriminators:
HiFi-GAN architecture overview, highlighting the Multi-Receptive Field Fusion (MRF) in the generator and the combination of Multi-Period (MPD) and Multi-Scale (MSD) discriminators.
The combination of MRF in the generator and the MPD/MSD discriminators allows HiFi-GAN to achieve state-of-the-art audio quality among GAN-based vocoders, often comparable to autoregressive models but with vastly superior inference speed. It effectively reduces the artifacts sometimes heard in earlier GAN vocoders, producing clean and natural-sounding speech.
Training GANs for audio can be challenging:
GAN-based vocoders like MelGAN and HiFi-GAN represent a significant advancement in neural waveform synthesis. By leveraging adversarial training with carefully designed generator and discriminator architectures (incorporating multi-scale analysis, receptive field fusion, and periodic pattern detection), they achieve high-fidelity audio generation with remarkable computational efficiency. Their non-autoregressive nature makes them particularly well-suited for applications requiring low-latency speech synthesis, forming a cornerstone of many modern TTS pipelines. While training requires care, the resulting models offer a compelling combination of speed and quality.
© 2025 ApX Machine Learning