You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
2 Shanghai Collaborative Innovation Center on Intelligent Visual Computing
3 Minimax
We introduce ControlThinker, a novel framework bridging the semantic gap in controllable image generation through enhanced visual reasoning. ControlThinker employs a "comprehend-then-generate" paradigm. It utilizes a Multimodal Large Language Model (MLLM) specifically enhanced via supervised and reinforcement fine-tuning to extract latent semantics from control images, generating enriched prompts that significantly enhance visual quality and semantic coherence of generated images without modifying image generators. Extensive experiments across various control types confirm ControlThinker's effectiveness.
📢 News
June 2, 2025: We have released the paper of ControlThinker.
May 30, 2025: The codes and models are coming soon.
📝 TODO
Release checkpoints and evaluation code
Released the training code along with an easy-to-follow tutorial.
Release the paper of ControlThinker
🧁 Results
Visualisation of generated images of ControlThinker and other baselines.
💥 Quick Start
1️⃣ Codes and tutorial will be available soon.
✍️ Citation
@article{han2025controlthinker,
title={ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning},
author={Han, Feng and Jiao, Yang and Chen, Shaoxiang and Xu, Junhao and Chen, Jingjing and Jiang, Yu-Gang},
journal={arXiv preprint arXiv:2506.03596},
year={2025}
}
📃 License
ControlThinker is licensed under the Apache 2.0.
About
ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning