Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

ICLR 2025

1University of Illinois Urbana-Champaign, 2Carnegie Mellon University 3Tsinghua University

A single, unified diffusion-based model Diff-2-in-1 bridging generative and discriminative learning

Abstract

Beyond high-fidelity image synthesis, diffusion models have recently exhibited promising results in dense visual perception tasks. However, most existing work treats diffusion models as a standalone component for perception tasks, employing them either solely for off-the-shelf data augmentation or as mere feature extractors.

In contrast to these isolated and thus sub-optimal efforts, we introduce a unified, versatile, diffusion-based framework, Diff-2-in-1, that can simultaneously handle both multi-modal data generation and dense visual perception, through a unique exploitation of the diffusion-denoising process. Within this framework, we further enhance discriminative visual perception via multi-modal generation, by utilizing the denoising network to create multi-modal data that mirror the distribution of the original training set. Importantly, Diff-2-in-1 optimizes the utilization of the created diverse and faithful data by leveraging a novel self-improving learning mechanism.

Comprehensive experimental evaluations validate the effectiveness of our framework, showcasing consistent performance improvements across various discriminative backbones and high-quality multi-modal data generation characterized by both realism and usefulness.

Diff-2-in-1 Pipeline

Our Diff-2-in-1 framework is a unified diffusion model with two sets of parameters. The data creation parameter θC generates synthetic samples, while the data exploitation parameter θE performs discriminative learning with both the original and the synthetic samples. The creation parameter θC gets self-improved during optimization.

Experimental Results

Comparison with Baselines

We show the qualitative results on surface normal task in NYUv2 and multi-task benchmarks NYUD-MT and PASCAL-Context, showing the performance boost after building our Diff-2-in-1 on existing methods.


Qualitative comparison for surface normal on NYUv2 dataset:

Qualitative comparison for depth estimation, surface normal estimation, and semantic segmentation on NYUD-MT dataset:



Qualitative comparison for semantic segmentation, human parsing, saliency detection, and surface normal estimation on PASCAL-Context dataset:




Multi-modal Generation

Besides the evaluations on how powerful our Diff-2-in-1 is from the discriminative perspective. Below we also show the high quality of the generated multi-modal data, which are not only realistic but also useful for applications like data augmentation.



The generated multi-modal data with Diff-2-in-1 trained on surface normal task on NYUv2:



The generated multi-modal data with Diff-2-in-1 trained on multiple tasks of depth estimation, surface normal estimation, and semantic segmentation on NYUD-MT:



The generated multi-modal data with Diff-2-in-1 trained on multiple tasks of semantic segmentation, human parsing, saliency detection, and surface normal estimation on PASCAL-Context:



When using the generated samples from our Diff-2-in-1 and the same number of additional unlabeled real images in the NYUv2 dataset. When using the generated data, we could achieve comparable performance with the unlabeled real captured data, demonstrating the good quality of the synthetic data from our Diff-2-in-1 framework.

BibTeX

@inproceedings{zheng2023diff2in1,
  title={Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models},
  author={Zheng, Shuhong and Bao, Zhipeng and Zhao, Ruoyu and Hebert, Martial and Wang, Yu-Xiong},
  booktitle={ICLR},
  year={2025}
}