| CARVIEW |
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
Ting Zhang, Jianmin Bao, Zheng Zhang, Han Hu, Dong Chen, Baining Guo
*Indicates Equal Contribution
†Indicates Corresponding Author
Abstract
We present InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions. Unlike existing approaches that integrate prior knowledge and pre-define the output space (\eg, categories and coordinates) for each vision task. We cast diverse vision tasks into a human-intuitive image-manipulating process whose output space is a flexible and interactive pixel space. Concretely, the model is based on the diffusion process and learned to predict the pixel according to user instructions (such as circling the left shoulder of the man with red and placing a blue mask on the left car). InstructDiffusion could handle various vision tasks such as understanding tasks (segmentation and keypoint detection) and generative tasks (editing and restoration). It even demonstrates the ability to handle unseen tasks and outperforms previous methods on unseen datasets. This represents a significant step towards a generalist modeling interface for vision tasks and advancing artificial general intelligence in computer vision.
Keypoint Detection
(a) Mark the car logo with a blue circle.
(b) Put a blue circle on the nose of the white tiger and use the red color to draw a circle around the left shoulder of the white tiger.
(c) Create a yellow circle around the right eye of the whale.
(d) Use blue to encircle the right wrist of the person on the far left and draw a yellow circle over the left wrist of the person on the far right.
Segmentation
(a) Mark the pixels of cat in the mirror to blue and leave the rest unchanged.
(b) Fill in the pixels of neutrophil with yellow, retaining the existing colors of the remaining pixels.
(c) Modify the pixels of Oriental Pearl Tower to red without affecting any other pixels.
(d) Paint the pixels of shadow in blue and maintain the current appearance of the other pixels.
Low Level Tasks
Image Editing
BibTeX
@article{Geng23instructdiff,
author = {Zigang Geng and
Binxin Yang and
Tiankai Hang and
Chen Li and
Shuyang Gu and
Ting Zhang and
Jianmin Bao and
Zheng Zhang and
Han Hu and
Dong Chen and
Baining Guo},
title = {InstructDiffusion: {A} Generalist Modeling Interface for Vision Tasks},
journal = {CoRR},
volume = {abs/2309.03895},
year = {2023},
url = {https://doi.org/10.48550/arXiv.2309.03895},
doi = {10.48550/arXiv.2309.03895},
}