2025.09.19π OneIG-Bench has been accepted to NeurIPS 2025 DB Track.2025.09.19π We updated the Seedream 4.0, Gemini-2.5-Flash-Image(Nano Banana), Step-3o Vision and HunyuanImage-2.1 evaluation results on our leaderboard here.2025.09.19π We updated the NextStep-1, Lumina-DiMOO and IRG official evaluation results on our leaderboard here.2025.08.13π We updated the Qwen-Image official evaluation results on our leaderboard here.2025.08.13π We updated the fine-grained analysis script here.
2025.07.03π We updated the Ovis-U1 evaulation results on our leaderboard here.2025.06.25π We updated the Show-o2 and OmniGen2 evaulation results on our leaderboard here.2025.06.23π We released the T2I generation script here.2025.06.10π We released the OneIG-Bench benchmark on π€huggingface.2025.06.10π We released the tech report and the project page.2025.06.10π We released the evaluation scripts.
- Fine-grained Analysis Script
- Real-time Updating Leaderboard
- OneIG-Bench Release
- Evaluation Scripts, Technical Report & Project Page Release
We introduce OneIG-Bench, a meticulously designed comprehensive benchmark framework for fine-grained evaluation of T2I models across multiple dimensions, including subject-element alignment, text rendering precision, reasoning-generated content, stylization, and diversity. Specifically, these dimensions can be flexibly selected for evaluation based on specific needs.
Key contribution:
- We present OneIG-Bench, which consists of six prompt sets, with the first five β 245 Anime and Stylization, 244 Portrait, 206 General Object, 200 Text Rendering, and 225 Knowledge and Reasoning prompts β each provided in both English and Chinese, and 200 Multilingualism prompts, designed for the comprehensive evaluation of current text-to-image models.
- A systematic quantitative evaluation is developed to facilitate objective capability ranking through standardized metrics, enabling direct comparability across models. Specifically, our evaluation framework allows T2I models to generate images only for prompts associated with a particular evaluation dimension, and to assess performance accordingly within that dimension.
- State-of-the-art open-sourced methods as well as the proprietary model are evaluated based on our proposed benchmark to facilitate the development of text-to-image research.
We test our benchmark using torch==2.6.0, torchvision==0.21.0 with cuda-11.8, python==3.10.
Install requirements:
pip install -r requirements.txtThe version of flash-attention is in the last line of requirements.txt.
To evaluate style performance, please download the CSD model and CLIP model, then put them under ./scripts/style/models.
Also, you can download the OneIG-StyleEncoder here.
For diversity metrics, some models and packages link1, link2, link3 are needed to download and save in the folder [models] that is a sibling to [assests] and [scripts].
You can use the scirpt to generate images. You only need to set up the inference function in the script for generating images.
It's better for you to generate 4 images(with different seeds) for each prompt in OneIG-Bench. Each prompt's generated images should be saved into subfolders based on their category Anime & Stylization, Portrait, General Object, Text Rendering, Knowleddge & Reasoning, Multilingualism, corresponding to folders anime, human, object, text, reasoning, multilingualism. If any image cannot be generated, the script will save a black image with the specified filename.
The filename for each image should follow the id assigned to that prompt in OneIG-Bench.csv/OneIG-Bench-ZH.csv. The structure of the images to be saved should look like:
π images/
βββ π anime/
β βββ π gpt-4o/
β β βββ 000.webp
β β βββ 001.webp
β β βββ ...
β βββ π imagen4/
β βββ ...
βββ π human/
β βββ π gpt-4o/
β βββ π imagen4/
β βββ ...
βββ π object/
β βββ ...
βββ π text/
β βββ ...
βββ π reasoning/
β βββ ...
βββ π multilingualism/ # For OneIG-Bench-ZH
βββ ..../run_{overall, alignment, diversity, reasoning, style, text}.shThe run_overall.sh script contains the execution of all metrics. By running run_overall.sh, you can obtain the results of all metrics in the results directory. You can also choose the metric you want to evaluate by running the corresponding script: run_{metric_name}.sh.
To ensure that the generated images are correctly loaded for evaluation, you can modify the following parameters in each script:
-
mode: Select whether EN/ZH to evaluate on OneIG-Bench or OneIG-Bench-ZH. -
image_dir: The directory where the images generated by your model are stored. -
model_names: The names or identifiers of the models you want to evaluate. -
image_grid: This corresponds to the number of images(with different seeds) generated by the model per prompt, where a value of 1 means 1 image, 2 means 4 images, and so on. -
class_items: The prompt categories or image sets you want to evaluate.
You can copy all the CSV files generated for each prompt dimension (in particular, for the style dimension, the files are named style_style*.csv) into a subfolder named as the model name inside the RESULT_DIR directory.
Then, in fine_grained_analysis.py, adjust the MODE, RESULT_DIR, and KEYS
parameters as needed to perform the fine-grained analysis.
We define the sets of images generated based on the OneIG-Bench prompt categories: General Object (O), Portrait (P), Anime and Stylization (A) for prompts without stylization, (S) for prompts with stylization, Text Rendering (T), Knowledge and Reasoning (KR), and Multilingualism (L).
The correspondence between the evaluation metrics and the evaluated image sets in OneIG-Bench and OneIG-Bench-ZH is presented in the table below.
- π Metrics and Image Sets Correspondence
| Alignment | Text | Reasoning | Style | Diversity | |
|---|---|---|---|---|---|
| OneIG-Bench | O, P, A, S | T | KR | S | O, P, A, S, T, KR |
| OneIG-Bench-ZH | Ozh, Pzh, Azh, Szh, Lzh | Tzh | KRzh | Szh | Ozh, Pzh, Azh, Szh, Lzh, Tzh, KRzh |
- Method Comparision on OneIG-Bench:
- Method Comparision on OneIG-Bench-ZH:
- Benchmark Comparison:
Β Β Β Β OneIG-Bench (also referred to as OneIG-Bench-EN) denotes the English benchmark set.
If you find our work helpful for your research, please consider citing our work.
@article{chang2025oneig,
title={OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation},
author={Jingjing Chang and Yixiao Fang and Peng Xing and Shuhan Wu and Wei Cheng and Rui Wang and Xianfang Zeng and Gang Yu and Hai-Bao Chen},
journal={arXiv preprint arxiv:2506.07977},
year={2025}
}We would like to express our sincere thanks to the contributors of Qwen, CLIP, CSD_Score, DreamSim, and HuggingFace teams, for their open research and exploration.



