You've now seen the theoretical underpinnings of various neural vocoder architectures, from autoregressive models like WaveNet to efficient GAN-based approaches like HiFi-GAN. These models are crucial for transforming the intermediate acoustic features (often mel-spectrograms) generated by TTS systems into the final, audible waveform. This practical section guides you through using a pre-trained neural vocoder to perform this synthesis step. We will focus on taking existing mel-spectrograms and converting them into audio, simulating the final stage of a modern TTS pipeline.
For this exercise, we'll use the TTS
library, a popular open-source toolkit, and a pre-trained HiFi-GAN vocoder model. HiFi-GAN is known for its high fidelity and computational efficiency, making it a common choice in many TTS systems.
First, ensure you have the necessary library installed. We primarily need TTS
which bundles torch
and other dependencies. You might also need soundfile
for saving the audio.
# Install the TTS library from its repository
pip install TTS
# Install soundfile if you don't have it
pip install soundfile numpy
You'll also need a pre-computed mel-spectrogram file as input for the vocoder. For this example, let's assume you have a NumPy file named sample_mel_spectrogram.npy
. This file would typically be the output of an acoustic model (like Tacotron 2 or FastSpeech 2) from a preceding TTS stage, representing the acoustic features for a specific utterance.
Note: Generating this mel-spectrogram file itself involves running a separate TTS acoustic model, which was covered conceptually in Chapter 4. For this exercise, focus on the vocoder's role assuming the mel-spectrogram is already available.
The TTS
library provides a convenient interface to load various pre-trained models. We will load a HiFi-GAN model trained on the LJSpeech dataset.
import torch
from TTS.utils.manage import ModelManager
from TTS.utils.synthesizer import Synthesizer
# Define the path to downloaded models (or where they will be downloaded)
# Replace with your preferred path
path = "~/.local/share/tts/"
manager = ModelManager(path)
# List available vocoder models (optional, for exploration)
# print(manager.list_models())
# Download and load a pre-trained HiFi-GAN vocoder model
# Example: Using a universal HiFi-GAN model
vocoder_model_name = "vocoder_models/universal/libri-tts/wavegrad"
# Or use a specific LJSpeech HiFi-GAN if available and preferred:
# vocoder_model_name = "vocoder_models/en/ljspeech/hifigan_v2"
try:
vocoder_path, vocoder_config_path, _ = manager.download_model(vocoder_model_name)
except ValueError as e:
print(f"Error downloading model: {e}")
print(f"Please check the model name or your internet connection.")
# Provide guidance on finding correct model names if needed
print("You can list available models using manager.list_models()")
exit() # Exit if model download fails
# Check if CUDA (GPU) is available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Initialize the Synthesizer with only the vocoder
# We don't need a TTS model here since we provide the mel-spectrogram directly
syn = Synthesizer(
tts_checkpoint=None, # No TTS model checkpoint
tts_config_path=None, # No TTS model config
vocoder_checkpoint=vocoder_path,
vocoder_config=vocoder_config_path,
use_cuda=(device == "cuda"),
)
print("Neural Vocoder model loaded successfully.")
# The syn.vocoder object holds the loaded vocoder model.
# Example: syn.vocoder.model is the actual HiFi-GAN generator network
This code snippet initializes the ModelManager
to handle model downloads and then uses the Synthesizer
class, configured only with the vocoder model details. The library downloads the specified HiFi-GAN model if it's not already present locally. It also automatically detects if a GPU is available for faster processing.
Now, load the pre-computed mel-spectrogram from the .npy
file. This file should contain a 2D NumPy array where one dimension represents the mel frequency bins and the other represents time frames.
import numpy as np
# Load the mel-spectrogram from a file
# Replace 'sample_mel_spectrogram.npy' with the actual path to your file
mel_file = 'sample_mel_spectrogram.npy'
try:
mel_spectrogram = np.load(mel_file)
print(f"Loaded mel-spectrogram from {mel_file}")
print(f"Shape: {mel_spectrogram.shape}") # Example shape: (80, 250) -> 80 mel bins, 250 frames
except FileNotFoundError:
print(f"Error: Mel-spectrogram file not found at {mel_file}")
print("Please ensure the file exists or provide the correct path.")
# Create a dummy spectrogram for demonstration if needed
print("Creating a dummy mel-spectrogram for demonstration.")
mel_spectrogram = np.random.rand(80, 250).astype(np.float32) # 80 mel bins, 250 frames
except Exception as e:
print(f"Error loading mel-spectrogram: {e}")
exit()
# The TTS Synthesizer expects the mel-spectrogram as a Torch tensor
# It should also have a batch dimension added.
# Shape expected by many vocoders: [batch_size, num_mels, num_frames]
mel_tensor = torch.tensor(mel_spectrogram).unsqueeze(0).to(device)
print(f"Converted mel-spectrogram to tensor with shape: {mel_tensor.shape}")
Here, we load the NumPy array and convert it into a PyTorch tensor. Crucially, we add a batch dimension (unsqueeze(0)
) as most deep learning models expect batched input, even if the batch size is just one. We also move the tensor to the appropriate device (CPU or GPU).
With the vocoder loaded and the input mel-spectrogram prepared, we can perform the inference step. The Synthesizer
object provides a convenient method (tts
or accessing the vocoder directly) to generate the waveform. Since we are bypassing the text-to-mel stage, we use the underlying vocoder's inference capability.
# Use the synthesizer's vocoder to convert mel-spectrogram to waveform
# The `vocoder.inference` method typically handles this
# Note: The specific method might vary slightly based on the TTS library version.
# Check documentation if needed. The Synthesizer often wraps this.
print("Generating waveform from mel-spectrogram...")
# We pass the mel_tensor directly to the vocoder's inference method
# Ensure the tensor is on the correct device
outputs = syn.vocoder.inference(mel_tensor)
# The output is usually a tensor containing the raw audio waveform samples.
# It might be on the GPU, so move it to CPU and convert to NumPy array.
# The output tensor shape might be [batch_size, 1, num_samples] or [batch_size, num_samples]
waveform = outputs.squeeze().cpu().numpy()
print(f"Generated waveform with shape: {waveform.shape}") # Example shape: (55125,) -> number of audio samples
print("Waveform generation complete.")
The inference
method of the loaded vocoder model takes the mel-spectrogram tensor as input and outputs the corresponding audio waveform tensor. We then convert this tensor back to a NumPy array on the CPU for easier handling and saving.
Finally, save the generated waveform as a standard audio file (like WAV) and listen to it. You'll need the sample rate associated with the pre-trained vocoder model. This is usually stored in the model's configuration.
import soundfile as sf
# Get the sample rate from the synthesizer's configuration
# This ensures the audio is saved and played back correctly
output_sample_rate = syn.vocoder_config.get('audio', {}).get('sample_rate', 22050) # Default to 22050 if not found
print(f"Using sample rate: {output_sample_rate} Hz")
# Define the output file path
output_wav_file = 'generated_audio_hifigan.wav'
# Save the waveform as a WAV file
try:
sf.write(output_wav_file, waveform, output_sample_rate)
print(f"Audio saved successfully to {output_wav_file}")
except Exception as e:
print(f"Error saving audio file: {e}")
print("\nPractical complete. You can now listen to the generated audio file.")
This code retrieves the correct sample rate from the vocoder's configuration (essential for correct playback speed) and uses the soundfile
library to write the NumPy array containing the waveform samples into a .wav
file.
Listen to the generated_audio_hifigan.wav
file. Compare its quality to examples you might have heard from traditional vocoders like Griffin-Lim. Does it sound natural? Are there noticeable artifacts (like buzzing or hissing)? This hands-on experience directly demonstrates the quality improvements offered by modern neural vocoders like HiFi-GAN, which you learned about earlier in the chapter. You can experiment further by obtaining mel-spectrograms for different sentences or using different pre-trained vocoder models (e.g., WaveGrad, MelGAN) if available through the toolkit to compare their outputs.
© 2025 ApX Machine Learning