Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

Abstract

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation.

To address these limitations, we condition diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A novel Dual Layer Aggregation (DLA) module is designed to aggregate multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy is applied to progressively balance the semantic information from MLLM and fine-detail identity from VAE during inference.

Extensive experiments demonstrate that our approach harmonizes multimodal understanding with identity preservation, mitigates copy-paste issues, and achieves superior performance regarding human preference on subject-driven image generation.

Benefits of using MLLMs for Subject-driven Generation

Benefits of leveraging MLLMs for subject-driven generation. MLLMs mitigate the copy-paste issue within VAE-based methods, and enables the multimodal understanding of the subject-driven generation pipeline by jointly modeling input image and text, while VAE-based methods encode them separately.

Method

Overview of our framework. (a) The full model architecture, consisting of an MLLM for multimodal understanding, a VAE encoder for mapping images into latent space, a DiT backbone for diffusion denoising, and a Dual Layer Aggregator (DLA) that aligns MLLM embeddings for DiT. (b) Details of the token attention pooling module inside the DLA module, including its layerwise attention and pooling operations, to form MLLM aggregated embeddings from all MLLM layers.

Comparison with Existing Methods

Comparisons of our method with state-of-the-art subject-driven generation approaches. Red dashed lines indicate instances where other methods suffer from the copy-paste issue. In contrast, our method produces images that maintain strong subject identity, exhibit creative pose variations, and respect underlying physical constraints.

Copy-paste Alleviation

We propose a quantitative approach to evaluate the copy-paste issue by adopting Orient Anything to estimate the azimuth and polar angles of subjects in both the reference and generated images, and compute their average orientation discrepancy. We further propose a Recall@k° metric, defined as the percentage of generated samples whose orientation errors in both azimuth and polar angles are below k°, and report the Average Recall Rate, computed by averaging Recall@k° over k° ∈ {5°, 10°, 15°, 20°}.

Method	Azimuth (↑)	Polar (↑)	Average “Recall” Rate (↓)
OmniGen2	22.6	7.0	0.486
DreamO	22.1	9.6	0.372
USO	20.8	9.6	0.401
Qwen-Image	17.6	7.8	0.460
Ours	25.7	10.4	0.349

Quantitative comparisons of subject variation between the reference and generated images, measuring the copy-paste issue. We evaluate differences in azimuth and polar angles to assess the subject pose diversity produced by the model, showing its ability to generate more varied poses and reduce copy-paste artifacts.

Reasoning Capability

Comparisons on reasoning capability show that VAE-based methods often fail on complex user prompts, producing copy-pasted subjects or incorrect concept binding. MLLM-DiT pipelines like Qwen-Image also struggle in understanding these challenging user prompts, demonstrating the prior solution for connecting MLLM and DiT is suboptimal. In contrast, our method, conditioned solely on MLLM signals, accurately interprets the prompts and associates concepts with the appropriate visual elements.

Analysis on the Roles of Different Layers

Qualitative results of zero-out layers for text and image modalities in our DLA module. First, when early layers of the image modality are dropped (e.g., zeroing out layers 0-19), the model struggles to preserve identity, whereas using only early layers achieves comparable ID consistency. Second, for the text modality, the later layers appear more critical as using only layers 0-9 significantly weakens the model's ability to understand and follow the prompt.

Stress Testing

Stress testing performance of our method on challenging scenarios, including attribute binding in long context prompts, and test cases that contain multiple instances of the subject. The results demonstrate the robustness of the strong reasoning capability within our MLLM-DiT system.

Acknowledgements

Thank the authors of UNO for the open-source code and dataset, along with many other inspiring works in the community!

BibTeX

@article{zheng2026squeezingcapacitymultimodallarge,
      title={Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation}, 
      author={Shuhong Zheng and Aashish Kumar Misraa and Yu-Teng Li and Yu-Jhe Li and Igor Gilitschenski},
      journal={arXiv preprint arXiv:2605.26111},
      year={2026}
}