Practice: Accelerating Training with GPUs

This hands-on practice adapts a neural network training process from running solely on the CPU to employing the capabilities of an NVIDIA GPU using CUDA.jl and Flux.jl. The aim is to take a simple image classification task, train a model on the CPU, and then modify the script to accelerate training on a GPU, observing the performance difference.

For this exercise, you'll need:

A working Julia installation with Flux.jl, CUDA.jl, MLDatasets.jl, and Statistics.jl.
An NVIDIA GPU compatible with CUDA and correctly configured drivers.
The MNIST dataset, which we'll fetch using MLDatasets.jl.

Our Starting Point: A CPU-Bound Image Classifier

Let's begin with a script that defines a simple Convolutional Neural Network (CNN), prepares the MNIST dataset, and trains the model on the CPU.

using Flux, MLDatasets, Statistics, Random
using Flux: onehotbatch, onecold, @epochs, glorot_uniform
using Base.Iterators: partition # For creating minibatches
using Printf

# Set a seed for reproducibility
Random.seed!(123)

# 1. Load MNIST Data
println("Loading MNIST dataset...")
# Full dataset
# imgs_train_raw, labels_train_raw = MNIST.traindata();
# imgs_test_raw, labels_test_raw = MNIST.testdata();

# For faster demonstration, let's use a subset
train_n = 5000 # Number of training samples
test_n = 1000  # Number of test samples
imgs_train_raw, labels_train_raw = MNIST.traindata(1:train_n);
imgs_test_raw, labels_test_raw = MNIST.testdata(1:test_n);

# 2. Preprocess Data
println("Preprocessing data...")
# Reshape for CNN (WHCN format: Width, Height, Channels, N_samples)
# Convert to Float32 and normalize pixel values to [0,1]
preprocess_images(imgs) = reshape(Float32.(imgs), 28, 28, 1, :) |> Flux.normalize

X_train = preprocess_images(imgs_train_raw);
X_test = preprocess_images(imgs_test_raw);

# One-hot encode labels
Y_train = onehotbatch(labels_train_raw, 0:9);
Y_test = onehotbatch(labels_test_raw, 0:9);

# 3. Define the CNN Model (CPU)
println("Defining CPU model...")
model_cpu = Chain(
    Conv((3, 3), 1=>16, relu, init=glorot_uniform), # Output: 26x26x16
    MaxPool((2,2)),                               # Output: 13x13x16
    Conv((3, 3), 16=>32, relu, init=glorot_uniform),# Output: 11x11x32
    MaxPool((2,2)),                               # Output: 5x5x32
    Flux.flatten,                                 # Output: 800
    Dense(5*5*32, 128, relu, init=glorot_uniform),
    Dense(128, 10, init=glorot_uniform)           # Raw scores (logits)
    # softmax is often applied in the loss or after for probabilities
)

# 4. Define Loss and Optimizer (CPU)
loss_cpu(x, y) = Flux.logitcrossentropy(model_cpu(x), y) # Use logitcrossentropy for numerical stability
opt_cpu = ADAM(0.001)

# 5. Prepare Minibatches (CPU)
batch_size = 128
# The data is already on the CPU
train_loader_cpu = Flux.DataLoader((X_train, Y_train), batchsize=batch_size, shuffle=true)

# 6. CPU Training Function
function train_epoch_cpu!(model, loader, opt_state, loss_fn)
    ps = Flux.params(model)
    for (x_batch, y_batch) in loader
        # x_batch and y_batch are already CPU arrays
        gs = gradient(() -> loss_fn(x_batch, y_batch), ps)
        Flux.update!(opt_state, ps, gs)
    end
end

# 7. Train and Time on CPU
println("Starting CPU training for one epoch...")
# For more accurate benchmarking, consider BenchmarkTools.jl's @btime
# Here, @time gives a simple timing.
# Warm-up (optional, but good for more stable @time results)
# train_epoch_cpu!(model_cpu, first(train_loader_cpu,1) , opt_cpu, loss_cpu)
cpu_training_time = @elapsed train_epoch_cpu!(model_cpu, train_loader_cpu, opt_cpu, loss_cpu)
@printf "CPU training epoch finished in %.2fs.\n" cpu_training_time

# Quick accuracy check on CPU
accuracy(x, y, model) = mean(onecold(model(x)) .== onecold(y))
acc_cpu = accuracy(X_test, Y_test, model_cpu)
@printf "CPU model accuracy after one epoch: %.2f%%\n" acc_cpu*100

This script sets up a standard training pipeline. The model, data, and computations all reside on the CPU. Note the use of Flux.DataLoader for convenient batching and shuffling.

Transitioning to GPU: The `gpu` Functor

Now, let's bring CUDA.jl into the picture. The primary tool Flux provides for GPU acceleration is the gpu functor. This function moves models and data to the active GPU. Conversely, the cpu functor moves them back to the CPU.

First, ensure CUDA is available and functional:

using CUDA

if !CUDA.functional()
    println("CUDA is not available or not functional on this system. GPU practice will be skipped.")
    # You might want to exit or ensure the GPU-specific code paths are not executed.
else
    println("CUDA is functional. GPU acceleration is available.")
    # Proceed with GPU-specific code
end

Assuming CUDA.functional() returns true:

Moving the Model to GPU: To run our model on the GPU, we transfer its parameters and structure using gpu.
```
# This creates a new model structure with parameters on the GPU
if CUDA.functional()
    model_gpu = gpu(model_cpu)
    # You could also do: model_gpu = model_cpu |> gpu
    println("Model transferred to GPU.")
end
```
It's important to understand that model_gpu is now a new model instance. Its parameters (weights and biases) reside in GPU memory. Operations on model_gpu with GPU data will be executed on the GPU.
Moving Data to GPU: Similarly, input data (features x and labels y) must be on the GPU before being fed to the model_gpu. This is typically done batch by batch within the training loop.
```
# Example: moving a single batch (x_batch_cpu, y_batch_cpu from CPU loader)
# x_batch_gpu = gpu(x_batch_cpu)
# y_batch_gpu = gpu(y_batch_cpu)
```

Adapting the Training Loop for the GPU

With the model on the GPU, the training loop needs slight modification to ensure data batches are also moved to the GPU before each forward and backward pass.

if CUDA.functional()
    # 1. Model is already on GPU: model_gpu

    # 2. Define Loss and Optimizer for GPU model
    # The loss function now uses model_gpu
    loss_gpu(x, y) = Flux.logitcrossentropy(model_gpu(x), y)
    opt_gpu = ADAM(0.001) # Optimizer for the GPU model parameters

    # 3. Data Loader for GPU
    # Flux.DataLoader can be used with a custom collate function to move data,
    # or we can move data inside the training loop. For simplicity here, we'll
    # modify the training loop to move batches.
    # The train_loader_cpu still provides CPU-based batches.

    # 4. GPU Training Function
    function train_epoch_gpu!(model, loader, opt_state, loss_fn)
        ps = Flux.params(model) # Parameters are already on GPU
        for (x_batch_cpu, y_batch_cpu) in loader
            # Move current batch to GPU
            x_batch_gpu = gpu(x_batch_cpu)
            y_batch_gpu = gpu(y_batch_cpu)

            # Compute gradients on GPU
            gs = gradient(() -> loss_fn(x_batch_gpu, y_batch_gpu), ps)
            Flux.update!(opt_state, ps, gs)

            # CUDA.synchronize() # Uncomment for precise step timing / debugging
                                # Usually not needed for correctness of training loop
        end
    end

    println("Starting GPU training for one epoch...")
    # Re-initialize model_gpu from model_cpu if you want a fair comparison from scratch state
    # model_gpu = gpu(deepcopy(model_cpu)) # Or re-define model_cpu then gpu(model_cpu)
    # To compare speed, ideally, we'd reset the model_cpu's weights or use a fresh copy
    # For this script, we'll just train a new model_gpu created from original model_cpu structure.
    # Let's re-create a fresh model structure for GPU to avoid using the already trained model_cpu parameters
    model_gpu_fresh = Chain(
        Conv((3, 3), 1=>16, relu, init=glorot_uniform),
        MaxPool((2,2)),
        Conv((3, 3), 16=>32, relu, init=glorot_uniform),
        MaxPool((2,2)),
        Flux.flatten,
        Dense(5*5*32, 128, relu, init=glorot_uniform),
        Dense(128, 10, init=glorot_uniform)
    ) |> gpu # Create and move to GPU

    loss_gpu_fresh(x,y) = Flux.logitcrossentropy(model_gpu_fresh(x),y)
    opt_gpu_fresh = ADAM(0.001)

    # Warm-up GPU
    # train_epoch_gpu!(model_gpu_fresh, first(train_loader_cpu,1), opt_gpu_fresh, loss_gpu_fresh)
    gpu_training_time = @elapsed train_epoch_gpu!(model_gpu_fresh, train_loader_cpu, opt_gpu_fresh, loss_gpu_fresh)
    @printf "GPU training epoch finished in %.2fs.\n" gpu_training_time

    # Quick accuracy check on GPU model
    # Test data also needs to be on GPU for evaluation
    X_test_gpu = gpu(X_test)
    Y_test_gpu = gpu(Y_test) # onecold works with CuArray for comparison
    acc_gpu = accuracy(X_test_gpu, Y_test_gpu, model_gpu_fresh) # accuracy function needs to handle GPU data

    # Custom accuracy for GPU if needed, or ensure onecold and model output can be compared
    # The provided accuracy function should work if model(X_test_gpu) returns a CuArray
    # and onecold handles CuArrays (which it does in recent Flux versions).
    @printf "GPU model accuracy after one epoch: %.2f%%\n" acc_gpu*100

    # Moving results back to CPU (example)
    # If you need to inspect predictions on the CPU:
    # sample_preds_gpu = model_gpu_fresh(X_test_gpu[:,:,:,1:5]) # Get predictions for first 5 test images
    # sample_preds_cpu = cpu(sample_preds_gpu)
    # println("Sample predictions (logits) from GPU model, moved to CPU:")
    # display(sample_preds_cpu)
end

The main change is x_batch_gpu = gpu(x_batch_cpu) and y_batch_gpu = gpu(y_batch_cpu) inside the loop. Flux's optimizers and loss functions are generally designed to work transparently with CuArrays (CUDA arrays) produced by gpu().

Observing the Speed-Up

After running both the CPU and GPU training scripts (ensure CUDA.functional() is true for the GPU part), compare the reported times for one epoch.

Illustrative training times for one epoch on the MNIST task with a simple CNN. GPU time (green) is significantly lower than CPU time (blue). Actual speed-up depends on the specific GPU, CPU, model complexity, and batch size.

You should observe a significant speed-up with the GPU, especially as model complexity and data size increase. For very small models or tiny batches, the overhead of transferring data to the GPU might diminish the gains, but for most deep learning workloads, GPU acceleration is substantial.

Essential Notes for GPU Training

Working with GPUs introduces a few more things to keep in mind:

Data Transfer Overhead: Moving data between CPU RAM and GPU VRAM takes time. This is why it's efficient to transfer data in batches and perform as much computation as possible on the GPU before bringing results back. For datasets small enough to fit entirely in GPU memory, you can move the whole dataset once at the beginning.
GPU Memory Management: GPUs have their own dedicated memory (VRAM), which is usually more limited than system RAM.
- Use CUDA.memory_status() to check available GPU memory.
- If you encounter an OutOfMemoryError, common remedies include reducing the batch_size, simplifying your model, or using techniques like gradient accumulation (which is more advanced).
- Explicitly gc() or CUDA.reclaim() can sometimes free up memory, but code design minimizes reliance on these. empty!(CUDA.pool) is a more forceful way to clear the CUDA memory pool if needed.
Asynchronous Operations: GPU computations often run asynchronously from CPU operations. CUDA.synchronize() can be used to wait for all pending GPU tasks to complete. This is critical for accurate micro-benchmarking of specific GPU kernels. Flux's gradient computations and data transfers via cpu() will implicitly synchronize when the result is needed.
Type Consistency: Ensure that your model and data are of compatible types (e.g., Float32 is common for deep learning). Flux and CUDA.jl are pretty good at handling this, but type instabilities can lead to performance degradation or errors.

Troubleshooting GPU Workflows

If you run into issues:

CUDA Setup Problems: Double-check your CUDA toolkit and NVIDIA driver installation. Ensure CUDA.functional() is true.
Array Type Mismatches: Errors like "Cannot convert an AbstractArray{Float32} to a CuArray..." or vice-versa usually mean some data (or part of the model) wasn't correctly moved to the GPU (or CPU). Trace your data flow and gpu/cpu calls.
Unsupported Operations on CuArray: While most Flux layers and common operations support CuArrays, custom layers or specific Julia functions might not. You may need to find GPU-compatible alternatives or implement custom CUDA kernels (an advanced topic).
Kernel Errors: These are errors originating from the GPU code itself. They can be cryptic but often point to issues like incorrect input dimensions for a layer, out-of-bounds memory access, or numerical instability (e.g., NaNs) during training.

This practice session has equipped you with the fundamental skills to accelerate your Flux.jl model training using GPUs. As you tackle larger and more complex deep learning problems, GPU computing will become an indispensable part of your toolkit. Experiment with your own models and datasets to solidify your understanding and observe the performance benefits firsthand.

Was this section helpful?

References

Flux.jl Documentation: GPU Acceleration, The Flux.jl contributors, 2025 - Official documentation for using Flux.jl with GPUs, detailing how to move models and data to the GPU using the gpu functor and manage GPU-specific operations.
CUDA.jl Documentation, The CUDA.jl contributors, 2025 - Official guide to integrating NVIDIA CUDA capabilities into Julia programs, covering CuArrays, memory management, and direct GPU computations in the Julia ecosystem.
CUDA C++ Programming Guide, NVIDIA Corporation, 2024 (NVIDIA Corporation) - The comprehensive guide to NVIDIA's parallel computing platform and programming model, which details GPU architecture, memory hierarchy, and the execution model for parallel processing.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A widely recognized textbook providing a thorough understanding of deep learning algorithms and architectures, with discussions on the computational considerations that make GPUs essential for training.
Dive into Deep Learning, Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola, 2023 (Cambridge University Press) - An open-source, interactive book offering practical examples and discussions on deep learning concepts, including best practices for utilizing GPUs for efficient model training.