Probing Intrinsic Bias: Internal Attention Feature Analysis for Social Bias Evaluation in Diffusion Models

Hyeongmin Lee, Kyungjune Baek

Abstract

Text-to-image generative models, while achieving impressive performance, inherit biases from their training data, limiting their real-world applicability and potentially affecting downstream tasks. Existing methods for evaluating these biases often rely on external classification models, which are practical in settings without access to model internals. However, external tools may have their own biases and coverage limitations, and can be particularly unreliable for underrepresented groups, introducing uncertainty into downstream conclusions. For instance, we demonstrate that FairFace, a common tool for evaluating racial bias, struggles to correctly classify Native American faces generated by SDXL, with 69% of faces going undetected and many others misclassified. To address this limitation, we propose a novel approach that leverages the internal knowledge of the generative model itself (specifically SDXL and Stable Diffusion 1.5) to assess bias. We investigate the similarity of intermediate features within the generation process of a diffusion model, across different timesteps and layers, when generating images with varying attributes. Our findings reveal that trends in feature similarity at specific layers and timesteps correlate with well-defined categories used by external classifiers, and can even provide insights into biases for attributes not easily evaluated by classifier-based methods. This work highlights the limitations of relying solely on external modules for bias evaluation, and proposes a complementary internal diagnostic when model internals are available, enabling a more comprehensive assessment of bias.

Type

International Journal

Publication

Neurocomputing, Volume 694, 2026

Date

September, 2026

Links

PDF