Running deep learning models efficiently often requires GPU acceleration, but MacOS users were long limited in their options. Unlike Windows and Linux, which support NVIDIA’s CUDA for GPU acceleration, Macs did not have a direct equivalent for running PyTorch on their built-in GPUs. That changed with Apple’s Metal Performance Shaders (MPS), a framework designed to leverage the GPU for high-performance computing, including machine learning tasks.
Metal is Apple’s low-level graphics and compute API, similar to DirectX and Vulkan, but optimized for MacOS, iOS, and iPadOS. It provides direct access to the GPU, enabling efficient execution of graphics and general-purpose computing workloads. Metal Performance Shaders (MPS) is a specialized component of Metal that accelerates matrix computations, convolutions, and other key operations used in deep learning. With recent PyTorch updates, users can now use MPS to run neural networks and tensor operations directly on a Mac’s M-series chip or AMD GPU.
This guide will walk through how to install and configure PyTorch to use Metal on MacOS, explore performance expectations, and discuss this approach's limitations.
To install PyTorch with MPS support, run the following:
pip install torch torchvision torchaudio
Verify the installation:
import torch
print(torch.__version__)
print(f"MPS available: {torch.backends.mps.is_available()}")
If torch.backends.mps.is_available()
returns True
, then your GPU is supported. Otherwise, check that your MacOS version and PyTorch installation are up to date.
To use Metal acceleration, set the device to "mps"
if available:
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
This ensures that computations run on the GPU when possible.
Let’s move tensors to the MPS device and perform basic computations:
import torch
# Check device
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")
# Example: Tensor operations
x = torch.rand(3, 3).to(device)
y = torch.rand(3, 3).to(device)
z = x + y
print(z)
This ensures that tensors are allocated to the GPU and that computations are performed using Metal.
PyTorch models and layers also work with the MPS backend. Here’s an example using a simple linear layer:
# Example: Neural network
model = torch.nn.Linear(3, 1).to(device)
input_tensor = torch.rand(1, 3).to(device)
output = model(input_tensor)
print(output)
The .to(device)
method moves the model and input tensor to the MPS backend for accelerated computation.
In my experience, the performance of Metal acceleration on MacOS varies depending on the task. For smaller models and basic operations, MPS can be close to the performance of a home NVIDIA desktop GPU. Some operations, like matrix multiplications and convolutions, can be significantly faster than running on the CPU.
However, other workloads may see only marginal speedups. In some cases, the CPU performs nearly as well, particularly for memory-bound tasks rather than compute-bound. This variability makes it important to benchmark specific workloads to see if MPS provides a meaningful speedup.
While MPS brings GPU acceleration to MacOS, there are some limitations:
Not all PyTorch operations are fully optimized for Metal. Some tensor operations may fall back to the CPU, which can introduce overhead and unexpectedly slow down computations. Advanced models that rely on custom CUDA kernels may not work efficiently on MPS.
MPS may use lower-precision floating-point arithmetic to gain performance. This can sometimes introduce small numerical differences compared to CUDA or CPU execution, which could affect training stability in sensitive models.
Unlike CUDA, MPS does not support training across multiple GPUs. If you have an external AMD GPU connected to your Mac, you cannot distribute training across multiple GPUs as you would with CUDA on Linux.
CUDA provides extensive support for custom operations through custom kernels. In contrast, MPS has limited support for writing and optimizing custom GPU kernels, making it less flexible for researchers who need low-level GPU optimizations.
Gradient computations (backpropagation) on MPS are not always as optimized as in CUDA. This means that while inference can be fast, training models—especially larger ones—may not always see the same level of speedup.
Some users report memory fragmentation issues with MPS, where GPU memory is not efficiently reused. This can cause out-of-memory errors when training larger models, even when memory usage seems relatively low.
To get the most out of MPS, consider the following:
Reducing the precision of tensors can improve performance and reduce memory usage. PyTorch supports torch.float16
or torch.bfloat16
in some operations, which may speed up training.
Batching operations together can minimize overhead when running on MPS. Instead of processing single inputs, use batch sizes that maximize GPU utilization.
Unlike CUDA’s nvidia-smi,
MacOS does not have a direct tool for monitoring MPS usage. However, you can use Activity Monitor under the GPU tab to get a rough idea of GPU utilization during training.
import time
x = torch.rand(1000, 1000)
# CPU Benchmark
start = time.time()
for _ in range(100):
y = x @ x
end = time.time()
print(f"CPU time: {end - start:.4f} sec")
# MPS Benchmark
x = x.to("mps")
start = time.time()
for _ in range(100):
y = x @ x
end = time.time()
print(f"MPS time: {end - start:.4f} sec")
The speedup you get from MPS depends on the workload. Some models may run efficiently on the GPU, while others may see only minor improvements over CPU execution. Always test different configurations to see what works best for your specific model.
Running PyTorch on MacOS with Metal acceleration allows you to take advantage of the GPU in Macs with M-series chips and AMD GPUs. By setting the device to "mps"
, you can offload tensor computations and neural network inference to the GPU.
While MPS can substantially accelerate some tasks, it still has limitations, especially compared to NVIDIA’s CUDA ecosystem. The performance gain depends heavily on the workload, with some models running nearly as fast as on an NVIDIA desktop GPU, while others see only small improvements over CPU execution.
© 2025 ApX Machine Learning. All rights reserved.
AutoML Platform