Horovod: Checkpointing, The Linux Foundation, 2024 (The Linux Foundation) - Official guide on implementing checkpointing strategies within Horovod, including best practices for saving and restoring model and optimizer states in a distributed setting.
torch.distributed.checkpoint documentation, PyTorch Authors, 2022 (PyTorch Foundation) - Official documentation detailing the API for saving and loading sharded model and optimizer states in distributed PyTorch, including FSDP.