This hands-on practice adapts a neural network training process from running solely on the CPU to employing the capabilities of an NVIDIA GPU using CUDA.jl and Flux.jl. The aim is to take a simple image classification task, train a model on the CPU, and then modify the script to accelerate training on a GPU, observing the performance difference.
For this exercise, you'll need:
Let's begin with a script that defines a simple Convolutional Neural Network (CNN), prepares the MNIST dataset, and trains the model on the CPU.
using Flux, MLDatasets, Statistics, Random
using Flux: onehotbatch, onecold, @epochs, glorot_uniform
using Base.Iterators: partition # For creating minibatches
using Printf
# Set a seed for reproducibility
Random.seed!(123)
# 1. Load MNIST Data
println("Loading MNIST dataset...")
# Full dataset
# imgs_train_raw, labels_train_raw = MNIST.traindata();
# imgs_test_raw, labels_test_raw = MNIST.testdata();
# For faster demonstration, let's use a subset
train_n = 5000 # Number of training samples
test_n = 1000 # Number of test samples
imgs_train_raw, labels_train_raw = MNIST.traindata(1:train_n);
imgs_test_raw, labels_test_raw = MNIST.testdata(1:test_n);
# 2. Preprocess Data
println("Preprocessing data...")
# Reshape for CNN (WHCN format: Width, Height, Channels, N_samples)
# Convert to Float32 and normalize pixel values to [0,1]
preprocess_images(imgs) = reshape(Float32.(imgs), 28, 28, 1, :) |> Flux.normalize
X_train = preprocess_images(imgs_train_raw);
X_test = preprocess_images(imgs_test_raw);
# One-hot encode labels
Y_train = onehotbatch(labels_train_raw, 0:9);
Y_test = onehotbatch(labels_test_raw, 0:9);
# 3. Define the CNN Model (CPU)
println("Defining CPU model...")
model_cpu = Chain(
Conv((3, 3), 1=>16, relu, init=glorot_uniform), # Output: 26x26x16
MaxPool((2,2)), # Output: 13x13x16
Conv((3, 3), 16=>32, relu, init=glorot_uniform),# Output: 11x11x32
MaxPool((2,2)), # Output: 5x5x32
Flux.flatten, # Output: 800
Dense(5*5*32, 128, relu, init=glorot_uniform),
Dense(128, 10, init=glorot_uniform) # Raw scores (logits)
# softmax is often applied in the loss or after for probabilities
)
# 4. Define Loss and Optimizer (CPU)
loss_cpu(x, y) = Flux.logitcrossentropy(model_cpu(x), y) # Use logitcrossentropy for numerical stability
opt_cpu = ADAM(0.001)
# 5. Prepare Minibatches (CPU)
batch_size = 128
# The data is already on the CPU
train_loader_cpu = Flux.DataLoader((X_train, Y_train), batchsize=batch_size, shuffle=true)
# 6. CPU Training Function
function train_epoch_cpu!(model, loader, opt_state, loss_fn)
ps = Flux.params(model)
for (x_batch, y_batch) in loader
# x_batch and y_batch are already CPU arrays
gs = gradient(() -> loss_fn(x_batch, y_batch), ps)
Flux.update!(opt_state, ps, gs)
end
end
# 7. Train and Time on CPU
println("Starting CPU training for one epoch...")
# For more accurate benchmarking, consider BenchmarkTools.jl's @btime
# Here, @time gives a simple timing.
# Warm-up (optional, but good for more stable @time results)
# train_epoch_cpu!(model_cpu, first(train_loader_cpu,1) , opt_cpu, loss_cpu)
cpu_training_time = @elapsed train_epoch_cpu!(model_cpu, train_loader_cpu, opt_cpu, loss_cpu)
@printf "CPU training epoch finished in %.2fs.\n" cpu_training_time
# Quick accuracy check on CPU
accuracy(x, y, model) = mean(onecold(model(x)) .== onecold(y))
acc_cpu = accuracy(X_test, Y_test, model_cpu)
@printf "CPU model accuracy after one epoch: %.2f%%\n" acc_cpu*100
This script sets up a standard training pipeline. The model, data, and computations all reside on the CPU. Note the use of Flux.DataLoader for convenient batching and shuffling.
gpu FunctorNow, let's bring CUDA.jl into the picture. The primary tool Flux provides for GPU acceleration is the gpu functor. This function moves models and data to the active GPU. Conversely, the cpu functor moves them back to the CPU.
First, ensure CUDA is available and functional:
using CUDA
if !CUDA.functional()
println("CUDA is not available or not functional on this system. GPU practice will be skipped.")
# You might want to exit or ensure the GPU-specific code paths are not executed.
else
println("CUDA is functional. GPU acceleration is available.")
# Proceed with GPU-specific code
end
Assuming CUDA.functional() returns true:
Moving the Model to GPU:
To run our model on the GPU, we transfer its parameters and structure using gpu.
# This creates a new model structure with parameters on the GPU
if CUDA.functional()
model_gpu = gpu(model_cpu)
# You could also do: model_gpu = model_cpu |> gpu
println("Model transferred to GPU.")
end
It's important to understand that model_gpu is now a new model instance. Its parameters (weights and biases) reside in GPU memory. Operations on model_gpu with GPU data will be executed on the GPU.
Moving Data to GPU:
Similarly, input data (features x and labels y) must be on the GPU before being fed to the model_gpu. This is typically done batch by batch within the training loop.
# Example: moving a single batch (x_batch_cpu, y_batch_cpu from CPU loader)
# x_batch_gpu = gpu(x_batch_cpu)
# y_batch_gpu = gpu(y_batch_cpu)
With the model on the GPU, the training loop needs slight modification to ensure data batches are also moved to the GPU before each forward and backward pass.
if CUDA.functional()
# 1. Model is already on GPU: model_gpu
# 2. Define Loss and Optimizer for GPU model
# The loss function now uses model_gpu
loss_gpu(x, y) = Flux.logitcrossentropy(model_gpu(x), y)
opt_gpu = ADAM(0.001) # Optimizer for the GPU model parameters
# 3. Data Loader for GPU
# Flux.DataLoader can be used with a custom collate function to move data,
# or we can move data inside the training loop. For simplicity here, we'll
# modify the training loop to move batches.
# The train_loader_cpu still provides CPU-based batches.
# 4. GPU Training Function
function train_epoch_gpu!(model, loader, opt_state, loss_fn)
ps = Flux.params(model) # Parameters are already on GPU
for (x_batch_cpu, y_batch_cpu) in loader
# Move current batch to GPU
x_batch_gpu = gpu(x_batch_cpu)
y_batch_gpu = gpu(y_batch_cpu)
# Compute gradients on GPU
gs = gradient(() -> loss_fn(x_batch_gpu, y_batch_gpu), ps)
Flux.update!(opt_state, ps, gs)
# CUDA.synchronize() # Uncomment for precise step timing / debugging
# Usually not needed for correctness of training loop
end
end
println("Starting GPU training for one epoch...")
# Re-initialize model_gpu from model_cpu if you want a fair comparison from scratch state
# model_gpu = gpu(deepcopy(model_cpu)) # Or re-define model_cpu then gpu(model_cpu)
# To compare speed, ideally, we'd reset the model_cpu's weights or use a fresh copy
# For this script, we'll just train a new model_gpu created from original model_cpu structure.
# Let's re-create a fresh model structure for GPU to avoid using the already trained model_cpu parameters
model_gpu_fresh = Chain(
Conv((3, 3), 1=>16, relu, init=glorot_uniform),
MaxPool((2,2)),
Conv((3, 3), 16=>32, relu, init=glorot_uniform),
MaxPool((2,2)),
Flux.flatten,
Dense(5*5*32, 128, relu, init=glorot_uniform),
Dense(128, 10, init=glorot_uniform)
) |> gpu # Create and move to GPU
loss_gpu_fresh(x,y) = Flux.logitcrossentropy(model_gpu_fresh(x),y)
opt_gpu_fresh = ADAM(0.001)
# Warm-up GPU
# train_epoch_gpu!(model_gpu_fresh, first(train_loader_cpu,1), opt_gpu_fresh, loss_gpu_fresh)
gpu_training_time = @elapsed train_epoch_gpu!(model_gpu_fresh, train_loader_cpu, opt_gpu_fresh, loss_gpu_fresh)
@printf "GPU training epoch finished in %.2fs.\n" gpu_training_time
# Quick accuracy check on GPU model
# Test data also needs to be on GPU for evaluation
X_test_gpu = gpu(X_test)
Y_test_gpu = gpu(Y_test) # onecold works with CuArray for comparison
acc_gpu = accuracy(X_test_gpu, Y_test_gpu, model_gpu_fresh) # accuracy function needs to handle GPU data
# Custom accuracy for GPU if needed, or ensure onecold and model output can be compared
# The provided accuracy function should work if model(X_test_gpu) returns a CuArray
# and onecold handles CuArrays (which it does in recent Flux versions).
@printf "GPU model accuracy after one epoch: %.2f%%\n" acc_gpu*100
# Moving results back to CPU (example)
# If you need to inspect predictions on the CPU:
# sample_preds_gpu = model_gpu_fresh(X_test_gpu[:,:,:,1:5]) # Get predictions for first 5 test images
# sample_preds_cpu = cpu(sample_preds_gpu)
# println("Sample predictions (logits) from GPU model, moved to CPU:")
# display(sample_preds_cpu)
end
The main change is x_batch_gpu = gpu(x_batch_cpu) and y_batch_gpu = gpu(y_batch_cpu) inside the loop. Flux's optimizers and loss functions are generally designed to work transparently with CuArrays (CUDA arrays) produced by gpu().
After running both the CPU and GPU training scripts (ensure CUDA.functional() is true for the GPU part), compare the reported times for one epoch.
Illustrative training times for one epoch on the MNIST task with a simple CNN. GPU time (green) is significantly lower than CPU time (blue). Actual speed-up depends on the specific GPU, CPU, model complexity, and batch size.
You should observe a significant speed-up with the GPU, especially as model complexity and data size increase. For very small models or tiny batches, the overhead of transferring data to the GPU might diminish the gains, but for most deep learning workloads, GPU acceleration is substantial.
Working with GPUs introduces a few more things to keep in mind:
CUDA.memory_status() to check available GPU memory.OutOfMemoryError, common remedies include reducing the batch_size, simplifying your model, or using techniques like gradient accumulation (which is more advanced).gc() or CUDA.reclaim() can sometimes free up memory, but code design minimizes reliance on these. empty!(CUDA.pool) is a more forceful way to clear the CUDA memory pool if needed.CUDA.synchronize() can be used to wait for all pending GPU tasks to complete. This is critical for accurate micro-benchmarking of specific GPU kernels. Flux's gradient computations and data transfers via cpu() will implicitly synchronize when the result is needed.Float32 is common for deep learning). Flux and CUDA.jl are pretty good at handling this, but type instabilities can lead to performance degradation or errors.If you run into issues:
CUDA.functional() is true.gpu/cpu calls.CuArray: While most Flux layers and common operations support CuArrays, custom layers or specific Julia functions might not. You may need to find GPU-compatible alternatives or implement custom CUDA kernels (an advanced topic).NaNs) during training.This practice session has equipped you with the fundamental skills to accelerate your Flux.jl model training using GPUs. As you tackle larger and more complex deep learning problems, GPU computing will become an indispensable part of your toolkit. Experiment with your own models and datasets to solidify your understanding and observe the performance benefits firsthand.
Was this section helpful?
gpu functor and manage GPU-specific operations.CuArrays, memory management, and direct GPU computations in the Julia ecosystem.© 2026 ApX Machine LearningAI Ethics & Transparency•