You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Evaluating text-to-vision content hinges on two crucial aspects: visual quality and alignment. While significant progress has been made in developing objective models to assess these dimensions, the performance of such models heavily relies on the scale and quality of human annotations. According to Scaling Law, increasing the number of human-labeled instances follows a predictable pattern that enhances the performance of evaluation models.
Therefore, we introduce a comprehensive dataset designed to Evaluate Visual quality and Alignment Level for text-tovision content (Q-EVAL-100K), featuring the largest collection of human-labeled Mean Opinion Scores (MOS) for
the mentioned two aspects.
The Q-EVAL-100K dataset encompasses both text-to-image and text-to-video models, with 960K human annotations specifically focused on visual quality and alignment for 100K instances (60K images and
40K videos).
Leveraging this dataset with context prompt, we propose Q-Eval-Score, a unified model capable of evaluating both visual quality and alignment with special improvements for handling long-text prompt alignment. Experimental results indicate that the proposed Q-Eval-Score achieves superior performance on both visual quality and
alignment, with strong generalization capabilities across other benchmarks.
These findings highlight the significant value of the Q-EVAL-100K dataset.
Due to the Meituan copyright policies, currently we are not allowed to release the Q-Eval-Score model.
However, to support the research community, we have re-trained the model from scratch using the public part of Q-Eval data only. The weights of this fully open-source version are now available on Hugging Face:
We hope this model can serve as a starting point for building strong and explainable visual evaluators.
💻 Inference
We provide a Python script infer.py for running inference using the open-source Q-Eval-Score model.
Task
PLCC
SRCC
Image Alignment
0.797
0.826
Image Quality
0.760
0.747
Video Alignment
0.613
0.614
Video Quality
0.700
0.673
The current performance is obtained by retesting our open-sourced version of the Q-Eval-Score model, which is trained entirely on publicly available data. This version does not include any proprietary annotations or Meituan internal data, which were used in the original Q-Eval release.
Citation
If you find our work useful, please cite our paper as:
@misc{zhang2025qeval100kevaluatingvisualquality,
title={Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content},
author={Zicheng Zhang and Tengchuan Kou and Shushi Wang and Chunyi Li and Wei Sun and Wei Wang and Xiaoyu Li and Zongyu Wang and Xuezhi Cao and Xiongkuo Min and Xiaohong Liu and Guangtao Zhai},
year={2025},
eprint={2503.02357},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.02357},
}
About
Repo for "Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content"