When your model or dataset demands more resources than a single GPU on your machine can efficiently provide, but you haven't yet reached the scale requiring multiple machines, tf.distribute.MirroredStrategy
offers a straightforward way to leverage all the GPUs available on your single host. It's a synchronous data parallelism strategy, meaning it replicates your entire model on each available GPU, processes different portions of the input data on each replica, and ensures all model copies stay synchronized.
The core idea behind MirroredStrategy
is simple: duplicate the model, split the work, and synchronize the updates. Here’s a breakdown of the process:
MirroredStrategy
scope, TensorFlow creates a complete copy (replica) of the model's variables on each specified GPU (by default, all visible GPUs).MirroredStrategy
typically uses an efficient "all-reduce" algorithm (often implemented using NVIDIA's NCCL library for GPU-to-GPU communication) to sum the gradients across all replicas.Data is split across GPUs. Each GPU computes gradients independently. Gradients are aggregated via an all-reduce operation, and the synchronized update is applied to all model replicas.
This synchronous nature ensures consistency but introduces communication overhead during the all-reduce step. The efficiency gain depends on whether the computation time saved by parallel processing outweighs this communication cost.
Integrating MirroredStrategy
into your TensorFlow code, especially when using Keras, requires minimal changes. The primary mechanism is the strategy.scope()
context manager.
Instantiate the Strategy: Create an instance of MirroredStrategy
. You can optionally specify which devices to use, but by default, it uses all GPUs visible to TensorFlow.
import tensorflow as tf
# Check available GPUs
gpus = tf.config.list_physical_devices('GPU')
print(f"Num GPUs Available: {len(gpus)}")
# Instantiate the strategy (uses all available GPUs by default)
strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices in Strategy: {strategy.num_replicas_in_sync}")
Use the Strategy Scope: The most significant change is to place your model building, optimizer instantiation, and model.compile()
calls inside the strategy.scope()
. This tells TensorFlow to manage the variables and operations according to the strategy's rules (i.e., mirror variables across replicas).
# Define batch size GLOBALLY
# The effective batch size processed per step is global_batch_size
# Each replica processes global_batch_size // num_replicas_in_sync
BUFFER_SIZE = 10000
GLOBAL_BATCH_SIZE = 64 * strategy.num_replicas_in_sync # Example: Scale batch size by num GPUs
PER_REPLICA_BATCH_SIZE = GLOBAL_BATCH_SIZE // strategy.num_replicas_in_sync
# Create the dataset (example using MNIST)
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
x_train = x_train[..., tf.newaxis] / 255.0
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE)
# IMPORTANT: Build the model, optimizer, and compile within the strategy scope
with strategy.scope():
# Model Definition (Simple CNN example)
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
# Optimizer Definition
optimizer = tf.keras.optimizers.Adam()
# Compile the model
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
optimizer=optimizer,
metrics=['accuracy'])
# Train the model - Keras handles the distribution automatically
print("Starting training with MirroredStrategy...")
model.fit(train_dataset, epochs=5)
print("Training finished.")
Notice that the training loop (model.fit
) itself doesn't need to be inside the scope. Keras's fit
method is strategy-aware and handles the distribution details automatically if the model was compiled within the strategy's scope.
When using tf.data.Dataset
with MirroredStrategy
, TensorFlow automatically handles splitting the batches across the replicas. You define your dataset pipeline as usual, specifying the global batch size (the total number of examples processed per step across all GPUs). TensorFlow divides this global batch by the number of replicas (strategy.num_replicas_in_sync
) to determine the per-replica batch size.
It's generally recommended to scale your global batch size proportionally to the number of GPUs you are using to keep the workload per GPU consistent and potentially improve training speed and stability, although optimal batch size often requires experimentation.
MirroredStrategy
places variables directly on each GPU replica. Ensure your model fits within the memory of a single GPU.MirroredStrategy
is ideal for accelerating training on a single machine equipped with multiple GPUs when the model fits into the memory of one GPU. It provides significant speedups for compute-bound workloads with relatively low implementation complexity compared to multi-worker strategies.By understanding how MirroredStrategy
replicates models, distributes data, and synchronizes gradients, you can effectively leverage multiple GPUs on a single machine to train your TensorFlow models faster. Remember to define your model and optimizer within the strategy.scope()
and adjust your global batch size appropriately.
© 2025 ApX Machine Learning