In the previous section, we established that the forward diffusion process is a Markov chain where we gradually inject noise into our data x0 over T discrete timesteps. A significant aspect of this process is controlling how much noise is added at each step. We don't just add random amounts; we follow a carefully defined plan known as the noise schedule.
The noise schedule is a sequence of variance values, denoted as β1,β2,...,βT. Each βt determines the variance of the Gaussian noise added when transitioning from state xt−1 to xt. These values are hyperparameters, meaning they are chosen before training the model. They are not learned during the training process itself.
Think of the noise schedule as setting the "intensity" of the noising process at each step. The values of βt are typically chosen such that:
The original Denoising Diffusion Probabilistic Models (DDPM) paper proposed a linear schedule, where βt increases linearly from a small starting value βstart (e.g., 10−4) to a larger ending value βend (e.g., 0.02) over T steps (often T=1000).
The formula for a linear schedule is:
βt=βstart+(t−1)T−1βend−βstartfor t=1,...,TWhile simple and effective, other schedules have been developed. A popular alternative is the cosine schedule, introduced in the "Improved Denoising Diffusion Probabilistic Models" paper. This schedule changes more slowly near the beginning and end of the process, potentially leading to better performance and preventing the signal from being destroyed too quickly early on.
The cosine schedule is defined using related quantities αt=1−βt and αˉt=∏i=1tαi. It sets αˉt based on a cosine function and then derives the corresponding βt:
αˉt=f(0)f(t)wheref(t)=cos((Tt+s)/(1+s)⋅2π)2Here, s is a small offset (e.g., 0.008) to prevent βt from being too small near t=0. Once αˉt is calculated for all t, the individual βt values can be recovered:
βt=1−αˉt−1αˉtThe choice of schedule impacts how quickly information from the original data x0 is obscured. A schedule that adds noise too aggressively early on might make the reverse process harder to learn. Conversely, adding too little noise overall might not sufficiently transform the data into a simple prior distribution by step T.
Let's visualize how these two common schedules compare for T=1000 steps, with βstart=0.0001 and βend=0.02 for the linear schedule, and s=0.008 for the cosine schedule derived to approximately match the noise level at T.
Variance (βt) added at each timestep for linear and cosine schedules over T=1000 steps. The cosine schedule adds noise more slowly initially and accelerates towards the end compared to the linear schedule.
Recall from the previous section that the forward process step is defined by the conditional probability q(xt∣xt−1). This transition is defined as adding Gaussian noise with a specific mean and variance:
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)Here, the noise schedule value βt directly sets the variance of the Gaussian noise added at timestep t. The mean is scaled by 1−βt to ensure the overall variance of the data doesn't explode. Since βt is small, 1−βt is slightly less than 1, gradually shrinking the contribution of the previous state xt−1.
In summary, the Gaussian noise schedule {βt}t=1T is a sequence of pre-defined hyperparameters controlling the magnitude of noise added at each step of the forward diffusion process. Its design (e.g., linear, cosine) and the range of values are important choices that influence the trajectory from data to noise and affect the subsequent learning of the reverse process. We will see later how these βt values, and related quantities derived from them, appear in both the training objective and the sampling process.
© 2025 ApX Machine Learning