Practice: Fine-tuning a Pre-trained ASR Model

Powerful architectures like Transformers and pre-trained models such as Wav2Vec 2.0 are highly effective. However, training such large models from scratch requires immense computational resources and massive datasets. A more practical and highly effective approach is to fine-tune a pre-trained model. This process adapts a general-purpose model, already trained on thousands of hours of speech, to your specific domain or dataset, often achieving excellent performance with a fraction of the data and compute.

In this hands-on section, you will use the Hugging Face transformers and datasets libraries to fine-tune a pre-trained Wav2Vec 2.0 model on a sample dataset. This workflow represents a standard and powerful method for building modern ASR systems.

Environment Setup

First, ensure you have the necessary libraries installed. We will use datasets to handle data loading, transformers for the model and training pipeline, torchaudio for audio operations, and jiwer to calculate the Word Error Rate (WER).

pip install datasets transformers[torch] torchaudio jiwer

A Note on Hardware

Fine-tuning, even on a small scale, is computationally intensive. While this example can run on a CPU for a few steps, completing the training in a reasonable amount of time requires a GPU. If you are using a cloud-based notebook environment like Google Colab, make sure to select a GPU runtime.

1. Loading the Dataset and Processor

We will start by loading both the dataset and the pre-trained model's "processor." A processor in Hugging Face is a convenient object that bundles the feature extractor (for audio) and the tokenizer (for text) into one.

For this example, we'll use a small, prepared subset of the LibriSpeech dataset, which is readily available on the Hugging Face Hub. We'll also load the processor for facebook/wav2vec2-base-960h, a popular Wav2Vec 2.0 model pre-trained on 960 hours of English speech.

from datasets import load_dataset
from transformers import AutoProcessor

# Load a small sample dataset from the Hugging Face Hub
librispeech_sample = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

# Load the processor which includes the feature extractor and tokenizer
processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")

The processor object now contains two main components:

processor.feature_extractor: Converts raw audio waveforms into the input format the model expects (e.g., log-mel spectrograms or raw features).
processor.tokenizer: Converts the target text transcriptions into sequences of integer IDs.

2. The Fine-Tuning Workflow

The process of adapting a pre-trained model to a new dataset involves a few distinct steps. The base of the model, which learned general acoustic representations, is typically frozen or trained with a very low learning rate. A new, randomly initialized classification head is placed on top of the base model to map the acoustic features to the characters in our new dataset's vocabulary. The entire model is then trained on our specific data.

The fine-tuning process adapts a general-purpose base model by replacing its final layer and training it on a new, specific dataset.

3. Preprocessing the Data

ASR models require the audio input and text labels to be in a specific format. The audio must have a consistent sampling rate, and the text must be tokenized. Our preprocessing function will handle this for every example in our dataset.

The Wav2Vec 2.0 model we are using was pre-trained on audio with a 16 kHz sampling rate. Our dataset might contain audio with different rates, so we must resample it first.

import torch
from datasets import Audio

# The model expects audio at a 16kHz sampling rate
SAMPLING_RATE = 16_000

# Resample the audio in our dataset if needed
librispeech_sample = librispeech_sample.cast_column("audio", Audio(sampling_rate=SAMPLING_RATE))

# Define the preprocessing function
def prepare_dataset(batch):
    # Extract audio array
    audio = batch["audio"]

    # Use the processor to generate input values from the audio array
    # and labels from the text transcript
    batch = processor(audio["array"], sampling_rate=audio["sampling_rate"], text=batch["text"])

    # The input_values are the features for the model
    # The labels are the tokenized text
    return batch

Now, we apply this function to our entire dataset using the .map() method. This is highly efficient as it processes the data in batches.

# Apply the preprocessing function to all examples
processed_dataset = librispeech_sample.map(
    prepare_dataset, 
    remove_columns=librispeech_sample.column_names
)

After this step, processed_dataset contains two columns: input_values (the numeric features for the acoustic model) and labels (the integer IDs for the CTC loss function).

4. Configuring the Trainer

The Hugging Face Trainer is a powerful class that abstracts away the complexities of the training loop. To use it, we need to configure three things: a data collator, a set of training arguments, and an evaluation metric.

Data Collator

A data collator takes a list of individual data points and groups them into a batch. For ASR with CTC, we need a special data collator that pads the input_values and labels to the maximum length in each batch. The input_values are padded with zeros, and the labels are padded with -100, which is the value the CTC loss function ignores.

from transformers import DataCollatorCTCWithPadding

data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

Evaluation Metric

To monitor our model's performance during training, we need a function that calculates an evaluation metric. The standard metric for ASR is the Word Error Rate (WER). We will use the jiwer library to compute it. The model outputs token IDs, so our function must first decode these IDs back into text before comparing them to the reference labels.

import numpy as np
import jiwer

def compute_metrics(pred):
    # Decode the predicted IDs to text
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)
    pred_str = processor.batch_decode(pred_ids)

    # The labels are already in the correct format, but we need to replace -100
    # with the pad_token_id to decode them properly.
    label_ids = pred.label_ids
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
    label_str = processor.tokenizer.batch_decode(label_ids, group_tokens=False)

    # Calculate WER
    wer = jiwer.wer(label_str, pred_str)

    return {"wer": wer}

Training Arguments and Model

Next, we load the pre-trained model itself. We use AutoModelForCTC since our model will be trained with CTC loss. We also define TrainingArguments, which specifies hyperparameters like the learning rate, batch size, and number of training epochs.

from transformers import AutoModelForCTC, TrainingArguments, Trainer

# Load the pre-trained model
model = AutoModelForCTC.from_pretrained(
    "facebook/wav2vec2-base-960h",
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
)

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./wav2vec2-base-librispeech-demo",
    group_by_length=True,
    per_device_train_batch_size=8,
    evaluation_strategy="steps",
    num_train_epochs=3,
    fp16=True,  # Use mixed-precision training for speed
    save_steps=500,
    eval_steps=500,
    logging_steps=50,
    learning_rate=1e-4,
    weight_decay=0.005,
    warmup_steps=1000,
    save_total_limit=2,
)

5. Running the Fine-Tuning Process

With all the components prepared, we can now instantiate the Trainer and start the fine-tuning process. The Trainer will handle the entire loop: feeding batches to the model, calculating loss, performing backpropagation, updating weights, and periodically evaluating performance on the validation set.

# Instantiate the Trainer
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=processed_dataset, # In a real scenario, this would be your training set
    eval_dataset=processed_dataset,  # This would be your validation set
    tokenizer=processor.feature_extractor,
)

# Start the training
trainer.train()

The training will now begin. You will see progress bars and logs indicating the training loss and the WER on the evaluation set. By the end of this process, the model in the output_dir will be adapted to your dataset, capable of transcribing speech more accurately than the original pre-trained model for this specific type of data.

6. Inference with the Fine-Tuned Model

Once training is complete, you can easily use your fine-tuned model for transcription. The simplest way is to use the pipeline abstraction, which handles all the preprocessing and postprocessing steps for you.

from transformers import pipeline
import torchaudio

# Load the fine-tuned model using the pipeline
asr_pipeline = pipeline("automatic-speech-recognition", model="./wav2vec2-base-librispeech-demo")

# Load a sample audio file (make sure it's a 16kHz mono WAV file)
# For this example, we'll just grab an audio array from our dataset
sample_audio = librispeech_sample[0]["audio"]["array"]

# Get the transcription
transcription = asr_pipeline(sample_audio)
print(transcription)

This will output the model's transcription of the audio file. This hands-on exercise demonstrates the full, modern workflow for building a high-performance ASR system. By starting with a powerful pre-trained model, you can achieve excellent results on custom datasets with minimal code and resources compared to training from scratch.

Was this section helpful?

References

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli, 2020 Advances in Neural Information Processing Systems (NeurIPS) DOI: 10.48550/arXiv.2006.11477 - Presents the architecture and self-supervised pre-training method for Wav2Vec 2.0, the model at the center of this fine-tuning section.
Hugging Face Transformers Library Documentation, Hugging Face, 2024 - Official documentation for the transformers library, which is used for implementing the fine-tuning workflow, model loading, and Trainer setup.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems (NeurIPS) DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture, a core component of advanced acoustic models like Wav2Vec 2.0.
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, Alex Graves, Santiago Fernández, Faustino Gomez, Jürgen Schmidhuber, 2006 Proceedings of the 23rd International Conference on Machine Learning (ICML) (ACM) DOI: 10.1145/1150447.1150493 - Describes the Connectionist Temporal Classification (CTC) algorithm, which is the loss function used for fine-tuning Wav2Vec 2.0 for ASR.