| CARVIEW |
Select Language
HTTP/2 301
server: GitHub.com
content-type: text/html
location: https://google.github.io/dsg/
x-github-request-id: A906:1387E:833C67:93A21A:69520196
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 04:20:39 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210070-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766982039.084729,VS0,VE202
vary: Accept-Encoding
x-fastly-request-id: 30c91faa53549ef40719f502a560d5defd5ffdee
content-length: 162
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Tue, 16 Jan 2024 13:40:11 GMT
access-control-allow-origin: *
etag: W/"65a6873b-2880"
expires: Mon, 29 Dec 2025 04:30:39 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: E589:234FE9:84FAC8:956099:69520196
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 04:20:39 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210070-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766982039.305643,VS0,VE213
vary: Accept-Encoding
x-fastly-request-id: 22d8cc3defd46780ac2b23d4498f8815d613e34c
content-length: 2833
Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image
Generation
Copy to clipboard
Close
Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation
Abstract
Evaluating text-to-image models is notoriously difficult. A strong recent approach for
assessing text-image faithfulness is based on QG/A (question generation and answering),
which uses pre-trained foundational models to automatically generate a set of questions and
answers from the prompt, and output images are scored based on whether these answers
extracted with a visual question answering model are consistent with the prompt-based
answers. This kind of evaluation is naturally dependent on the quality of the underlying QG
and QA models. We identify and address several reliability challenges in existing QG/A work:
(a) QG questions should respect the prompt (avoiding hallucinations, duplications, and
omissions) and (b) VQA answers should be consistent (not asserting that there is no
motorcycle in an image while also claiming the motorcycle is blue). We address these issues
with Davidsonian Scene Graph (DSG), an empirically grounded evaluation framework inspired by
formal semantics. DSG is an automatic, graph-based QG/A that is modularly implemented to be
adaptable to any QG/A module. DSG produces atomic and unique questions organized in
dependency graphs, which (i) ensure appropriate semantic coverage and (ii) sidestep
inconsistent answers. With extensive experimentation and human evaluation on a range of
model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the
challenges noted above. Finally, we present DSG-1k, an open-sourced evaluation benchmark
that includes 1,060 prompts, covering a wide range of fine-grained semantic categories with
a balanced distribution. We will release the DSG-1k prompts and the corresponding DSG
questions.
QG/A: New Paradigm in T2I Alignment Eval
Reliability Issues in Existing QG/A Methods
DSG Solution to the Reliability Issues
Publication
@inproceedings{JaeminCho2024,
author = {Jaemin Cho and Yushi Hu and Roopal Garg and Peter Anderson and Ranjay Krishna and Jason Baldridge and Mohit Bansal and Jordi Pont-Tuset and Su Wang},
title = {{Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation}},
booktitle = {ICLR},
year = {2024}
}