All Courses

MirroredStrategy for Single-Node, Multi-GPU Training

When your model or dataset demands more resources than a single GPU on your machine can efficiently provide, but you haven't yet reached the scale requiring multiple machines, tf.distribute.MirroredStrategy offers a straightforward way to leverage all the GPUs available on your single host. It's a synchronous data parallelism strategy, meaning it replicates your entire model on each available GPU, processes different portions of the input data on each replica, and ensures all model copies stay synchronized.

How MirroredStrategy Works

The core idea behind MirroredStrategy is simple: duplicate the model, split the work, and synchronize the updates. Here’s a breakdown of the process:

Model Replication: When you define your model within the MirroredStrategy scope, TensorFlow creates a complete copy (replica) of the model's variables on each specified GPU (by default, all visible GPUs).
Data Sharding: Each incoming batch of training data is split evenly across the available GPU replicas. For instance, if you have 4 GPUs and a global batch size of 1024, each GPU replica will process a unique slice of 256 examples ( $1024 / 4 = 256$ ). This is the "data parallelism" aspect.
Independent Forward & Backward Passes: Each replica performs the forward pass (calculating predictions and loss) and the backward pass (calculating gradients) independently on its assigned data slice.
Gradient Synchronization (All-Reduce): This is the synchronization step. Before the optimizer updates the model variables, the gradients computed on each replica need to be combined. MirroredStrategy typically uses an efficient "all-reduce" algorithm (often implemented using NVIDIA's NCCL library for GPU-to-GPU communication) to sum the gradients across all replicas.
Synchronous Variable Updates: Once the aggregated gradients are available on all replicas, each replica applies the exact same update to its local copy of the model variables using the optimizer. Because every replica starts with the same initial weights and receives the same gradient updates in each step, the model variables remain synchronized ("mirrored") across all GPUs throughout training.

Data is split across GPUs. Each GPU computes gradients independently. Gradients are aggregated via an all-reduce operation, and the synchronized update is applied to all model replicas.

This synchronous nature ensures consistency but introduces communication overhead during the all-reduce step. The efficiency gain depends on whether the computation time saved by parallel processing outweighs this communication cost.

Implementing MirroredStrategy

Integrating MirroredStrategy into your TensorFlow code, especially when using Keras, requires minimal changes. The primary mechanism is the strategy.scope() context manager.

Instantiate the Strategy: Create an instance of MirroredStrategy. You can optionally specify which devices to use, but by default, it uses all GPUs visible to TensorFlow.

import tensorflow as tf

# Check available GPUs
gpus = tf.config.list_physical_devices('GPU')
print(f"Num GPUs Available: {len(gpus)}")

# Instantiate the strategy (uses all available GPUs by default)
strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices in Strategy: {strategy.num_replicas_in_sync}")

Use the Strategy Scope: The most significant change is to place your model building, optimizer instantiation, and model.compile() calls inside the strategy.scope(). This tells TensorFlow to manage the variables and operations according to the strategy's rules (i.e., mirror variables across replicas).

# Define batch size GLOBALLY
# The effective batch size processed per step is global_batch_size
# Each replica processes global_batch_size // num_replicas_in_sync
BUFFER_SIZE = 10000
GLOBAL_BATCH_SIZE = 64 * strategy.num_replicas_in_sync # Example: Scale batch size by num GPUs
PER_REPLICA_BATCH_SIZE = GLOBAL_BATCH_SIZE // strategy.num_replicas_in_sync

# Create the dataset (example using MNIST)
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
x_train = x_train[..., tf.newaxis] / 255.0

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE)

# IMPORTANT: Build the model, optimizer, and compile within the strategy scope
with strategy.scope():
    # Model Definition (Simple CNN example)
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])

    # Optimizer Definition
    optimizer = tf.keras.optimizers.Adam()

    # Compile the model
    model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                  optimizer=optimizer,
                  metrics=['accuracy'])

# Train the model - Keras handles the distribution automatically
print("Starting training with MirroredStrategy...")
model.fit(train_dataset, epochs=5)
print("Training finished.")

Notice that the training loop (model.fit) itself doesn't need to be inside the scope. Keras's fit method is strategy-aware and handles the distribution details automatically if the model was compiled within the strategy's scope.

Handling Data

When using tf.data.Dataset with MirroredStrategy, TensorFlow automatically handles splitting the batches across the replicas. You define your dataset pipeline as usual, specifying the global batch size (the total number of examples processed per step across all GPUs). TensorFlow divides this global batch by the number of replicas (strategy.num_replicas_in_sync) to determine the per-replica batch size.

It's generally recommended to scale your global batch size proportionally to the number of GPUs you are using to keep the workload per GPU consistent and potentially improve training speed and stability, although optimal batch size often requires experimentation.

Practical Considerations

Communication Overhead: The all-reduce step requires communication between GPUs. On systems with fast interconnects (like NVLink), this overhead is often minimal compared to the computation time. On systems with slower connections (e.g., standard PCIe), communication can become a bottleneck, limiting the speedup you achieve. Profiling (covered in Chapter 2) is essential to identify such bottlenecks.
Variable Placement: MirroredStrategy places variables directly on each GPU replica. Ensure your model fits within the memory of a single GPU.
Synchronous Nature: While simplifying consistency, the synchronous updates mean the entire step duration is determined by the slowest replica.
Use Cases: MirroredStrategy is ideal for accelerating training on a single machine equipped with multiple GPUs when the model fits into the memory of one GPU. It provides significant speedups for compute-bound workloads with relatively low implementation complexity compared to multi-worker strategies.

By understanding how MirroredStrategy replicates models, distributes data, and synchronizes gradients, you can effectively leverage multiple GPUs on a single machine to train your TensorFlow models faster. Remember to define your model and optimizer within the strategy.scope() and adjust your global batch size appropriately.

Was this section helpful?