As reviewed previously, the noise schedule, typically defined by the sequence of variance parameters β1,…,βT, governs the rate at which noise is added during the forward diffusion process q(xt∣xt−1). Consequently, it significantly influences the nature of the denoising task that the model pθ(xt−1∣xt) must learn for the reverse process. While standard schedules like the linear and cosine schedules provide a reasonable starting point and have been used successfully, they possess inherent limitations that can hinder performance, especially for complex datasets or when aiming for state-of-the-art results. Understanding these drawbacks motivates the exploration of more sophisticated scheduling techniques.
Linear Schedule Issues
The linear schedule, where βt increases linearly from a small value βstart (e.g., 10−4) to a larger value βend (e.g., 0.02) over T timesteps, is perhaps the simplest formulation. Its primary characteristic is a constant increment in added noise variance at each step.
βt=βstart+T−1t−1(βend−βstart)
However, this simplicity comes at a cost:
- Inefficiency at Extremes: At the beginning of the forward process (small t), the added noise βt is very small. This means xt remains highly correlated with xt−1 and x0. While this preserves information, it also implies that many initial steps in the reverse process are dedicated to making very fine adjustments, potentially slowing down effective sampling. Conversely, towards the end of the forward process (large t), βt becomes relatively large, causing the signal-to-noise ratio (SNR) to drop rapidly. The distribution q(xT∣x0) effectively approaches a standard Gaussian, discarding almost all information about x0. This makes the initial steps of the reverse process (starting from t=T) very challenging, as the model must recover significant structure from near-pure noise.
- SNR Decay: The cumulative noise level, often tracked via αˉt=∏i=1t(1−βi), determines the overall SNR at step t. A linear schedule leads to a specific pattern of αˉt decay that might not be optimal. The SNR decreases very slowly initially and then accelerates towards the end. This mismatch between the schedule and the potential complexity of denoising at different noise levels can affect sample quality.
- Suboptimal for High-Resolution Images: For high-resolution images, fine details are often represented by high-frequency components. The linear schedule's rapid information destruction at later timesteps can make reconstructing these fine details difficult during the reverse process, sometimes leading to slightly blurry or less detailed results compared to schedules that manage the SNR decay more carefully.
Cosine Schedule Issues
The cosine schedule was proposed to address some limitations of the linear schedule, particularly the rapid SNR drop near t=T. It defines the cumulative product αˉt directly using a cosine function, aiming for a smoother transition across timesteps:
αˉt=f(0)f(t),wheref(t)=cos2((1+st/T+s)2π)
Here, s is a small offset (e.g., 0.008) to prevent βt from being too small near t=0. From αˉt, the individual variances are derived: βt=1−αˉt−1αˉt.
This schedule generally yields better results than the linear schedule and has become a popular default. However, it's not without its own potential drawbacks:
- Still Heuristic: While motivated by improving SNR characteristics, the cosine schedule is still a fixed, pre-defined function. It doesn't adapt to the specific properties of the data being modeled. It's possible, and often likely, that the optimal way to add noise depends on the dataset's intrinsic structure, which the cosine schedule doesn't account for.
- Slow Start: The offset s prevents β1 from being zero, but the noise added at the very beginning (t≪T) is still quite small compared to later steps. Similar to the linear schedule, this might mean that early reverse steps contribute less significantly to the overall generation quality, potentially requiring more steps than strictly necessary if noise were managed differently.
- Dependence on T: The shape of the cosine schedule is intrinsically tied to the total number of timesteps T. If T is reduced significantly (e.g., for faster sampling), the shape gets compressed, potentially altering the noise injection profile in ways that degrade performance compared to the regime it was initially tuned for (often T=1000 or T=4000).
Visualizing Schedule Differences
The differences in how these schedules affect the signal level over time can be visualized by plotting αˉt. A lower αˉt indicates more noise has been added, corrupting the original signal x0.
Comparison of αˉt for linear and cosine schedules over T=1000 timesteps. The cosine schedule maintains a higher signal level for longer before decaying more sharply towards the end, compared to the linear schedule's more gradual initial decay and faster drop-off earlier. Note that the exact values depend on βstart, βend, and s.
Consequences and Motivation for Advancement
The limitations of these fixed, standard schedules often manifest as:
- Suboptimal Sample Quality: Generated samples might lack fine details, appear slightly blurry, or fail to capture the full diversity of the training data.
- Inefficient Sampling: Achieving high quality may require a large number of sampling steps (T), increasing computational cost and latency during inference. Reducing T with standard schedules often leads to a noticeable drop in quality.
- Training Challenges: In some cases, suboptimal schedules can contribute to training instability or slower convergence.
These shortcomings highlight the need for noise scheduling strategies that are more carefully designed or even learned from data. By tailoring the noise schedule, we can potentially improve sample fidelity, accelerate sampling, and enhance training stability. This motivates the techniques discussed in the following sections, such as designing custom schedules based on SNR analysis or implementing models that learn the variance schedule itself.