Not All Prediction Targets Keep Training-Free Diffusion Guidance on the Manifold

Abstract

Training-free guidance (TFG) steers a pretrained diffusion model toward a desired attribute at inference. To be effective, this guidance must be applied from the earliest, high-noise steps of sampling. Because its objective (a classifier or energy) is defined on clean images, ε- and v-prediction models must first estimate the clean image x̂ from the noisy state at each step, and the accuracy of that estimate determines how easily guidance drifts off the data manifold. x-prediction, a recent alternative, outputs the clean image directly, removing this source of error even at high noise. This is our motivation. We provide a theoretical analysis of how each prediction target shapes this accuracy, and introduce guided-class FID (Child FID), a metric that exposes the manifold damage standard evaluation misses. Experiments on a new fine-grained bird benchmark and on style transfer confirm that x-prediction keeps guided samples on the manifold most reliably, making it the strongest foundation for training-free guidance.

Publication
European Conference on Computer Vision (ECCV)
Date