All Courses

Advanced JAX: Performance, Optimization, and Scale

Chapter 1: Advanced JAX Transformations and Control Flow

Review of Core Transformations: jit, grad, vmap

Mastering lax.scan for Sequential Operations

Conditional Execution with lax.cond

Looping with lax.while_loop

Combining Control Flow and Transformations

Advanced Masking Techniques

Understanding Closures and JAX Staging

Practical: Implementing Complex Recurrent Logic

Chapter 2: Optimizing JAX Code for Performance

Profiling JAX Code on CPU, GPU, and TPU

Understanding JAX Computation Graphs (jaxpr)

The Role of XLA Compilation

Memory Layout and Its Impact on Performance

Avoiding Recompilation

Fusion and Operator Optimization

Asynchronous Dispatch

Practice: Optimizing a Numerical Computation

Chapter 3: Distributed Computing with JAX

Introduction to Parallelism Concepts

Device Management in JAX

Single-Program Multiple-Data (SPMD) with pmap

Implementing Data Parallelism using pmap

Collective Communication Primitives (psum, pmean, etc.)

Handling Axis Names in pmap

Nested pmap and Advanced Partitioning

Introduction to Multi-Host Programming

Practice: Distributed Data-Parallel Training

Chapter 4: Advanced Automatic Differentiation Techniques

Review of Forward- and Reverse-Mode Autodiff

Jacobian-Vector Products (JVPs) with jax.jvp

Vector-Jacobian Products (VJPs) with jax.vjp

Higher-Order Derivatives

Computing Full Jacobians and Hessians

Custom Differentiation Rules with jax.custom_vjp

Custom Differentiation Rules with jax.custom_jvp

Differentiation through Control Flow Primitives

Handling Non-Differentiable Functions

Practice: Implementing a Custom Gradient

Chapter 5: Interoperability and Custom Operations

Integrating JAX with NumPy

Zero-Copy Data Sharing with DLPack

Calling External CPU/GPU Code with jax.experimental.host_callback

Using jax.pure_callback for Side-Effect Free Calls

Introduction to JAX Primitives

Defining Custom Primitives

Implementing Abstract Evaluation Rules

Implementing Lowering Rules for Backends (CPU/GPU/TPU)

Defining Differentiation Rules for Custom Primitives

Practice: Integrating a C++ Function

Chapter 6: Large-Scale Model Training Techniques

Overview of Challenges in Large Model Training

Introduction to JAX Ecosystem Libraries (Flax, Haiku)

Managing Model Parameters and State

Combining pmap with Training Frameworks

Gradient Accumulation

Gradient Checkpointing (Re-materialization)

Mixed Precision Training

Model Parallelism Strategies

Optimization Algorithms for Large Scale

Practice: Implementing Gradient Checkpointing

Mixed Precision Training

Was this section helpful?

References

Mixed-Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2017 International Conference on Learning Representations (ICLR) 2018 DOI: 10.48550/arXiv.1710.03740 - The seminal paper introducing mixed-precision training with float16 and the concept of loss scaling to mitigate numerical underflow.
BFloat16: A New Standard For Deep Learning, Neal H. Liu, Norman P. Jouppi, 2019 (Google AI Blog) - Describes the bfloat16 format, its advantages for deep learning, particularly its wider dynamic range compared to float16, and its adoption in Google's TPUs.
JAX documentation for Data types (dtypes), JAX Authors and Contributors, 2024 (JAX Documentation) - Provides detailed information on JAX's handling of numerical data types, including jnp.float16 and jnp.bfloat16, essential for understanding type compatibility in JAX.
Mixed precision in Flax, Flax Authors and Contributors, 2024 (Flax Documentation) - A practical guide on configuring and implementing mixed precision training within the Flax framework, covering param_dtype and dtype settings.

© 2025 ApX Machine LearningEngineered with