Project homepage: https://multiref.github.io/
MultiRef introduces the first comprehensive benchmark for evaluating image generation models' ability to combine multiple visual references.
This repository provides an automated data engine and evaluation pipeline for multi-condition image generation tasks. The evaluation pipeline supports various metrics, including FID, Aesthetics, Mask, Caption, Sketch, Subject, Depth, Canny, BBox, Semantic Map, Style, and Pose. Parallel processing is supported for efficient evaluation.
Before running any code, please create and activate the multi environment using the provided environment.yml:
conda env create -f environment.yml
conda activate multi- Python 3.8+ is recommended.
- External dependencies (must be installed manually at the specified locations):
- Depth-Anything-V2 (
../Depth-Anything-V2/) - Grounded-SAM-2 (
../Grounded-SAM-2/) - Florence-2-large (downloaded automatically by
transformers)
- Depth-Anything-V2 (
- Local scripts:
- All required
.pyfiles in../conditions/(e.g.,to_depth.py,to_sketch_new.py,to_caption.py,to_extrapolation.py,to_ground_sam.py,to_pose_no_args.py, etc.)
- All required
To run the evaluation pipeline, you must install the following external models and code in the specified locations (relative to the MultiRef-code directory):
- Path:
../Grounded-SAM-2/ - Usage: generate semantic map, mask, bounding boxes.
- Install:
Follow the official instructions to install dependencies and download model weights.
git clone https://github.com/IDEA-Research/Grounded-SAM-2.git ../Grounded-SAM-2
- Path:
../Depth-Anything-V2/ - Usage: generate depth map.
- Install:
Follow the official instructions to install dependencies and download model weights.
git clone https://github.com/DepthAnything/Depth-Anything-V2.git ../Depth-Anything-V2
- Path:
../informative-drawings/ - Usage: generate sketch reference.
- Install:
git clone https://github.com/carolineec/informative-drawings.git ../informative-drawings
- Check the code in
to_sketch_new.pyfor the expected model file path. - Download or copy the model weights to the correct location.
- Ensure all required Python packages are installed (see
requirements.txt).
- Check the code in
- Path:
../HigherHRNet-Human-Pose-Estimation/ - Usage: generate pose reference.
- Install:
Follow the official instructions in the HigherHRNet-Human-Pose-Estimation repository to install dependencies and download model weights.
git clone https://github.com/HRNet/HigherHRNet-Human-Pose-Estimation.git ../HigherHRNet-Human-Pose-Estimation
The data engine pipeline generates reference images and condition data for each original image in your dataset. Below is an example for stylebooth dataset using stylebooth2condition.py:
- For each original image, generate a set of reference images (e.g., style, semantic map, caption, etc.) and save them in a specified output directory.
We take the StyleBooth dataset as metadata for example. You can use your own data.
python stylebooth2condition.py- The script will process all images in the dataset directory and generate the corresponding reference images and condition files.
- Input and output directories are specified at the bottom of the script:
base_dataset_path = '../../MetaData/X2I-mm-instruction/stylebooth'save_path = '../../Condition_stylebooth'
- You can modify these paths as needed.
- For each original image, the pipeline will generate reference images (e.g., style, semantic map, mask, etc.) in the output directory. Besides, for one image, it will generate one json file for detail image paths.
- The output json of each image will follow this structure:
{
"original_image_path": "<path to original image>",
"conditions": {
"semantic_map_path": "<path to semantic map image>",
"sketch_path": "<path to sketch image>",
"canny_path": "<path to canny image>",
"bbox_path": "<path to bbox image>",
"depth_path": "<path to depth image>",
"mask_path": "<path to mask image>",
"caption": "<caption text>",
"style_path": [
"<path to style image 1>",
"<path to style image 2>"
]
}
}Before that, you should judge the quality and alignment of generated reference images. The judging structure should follow:
{
"...":{},
"judge": {
"Semantic-Map Alignment": 5,
"Semantic-Map Quality": 5,
"Sketch Alignment": 5,
"Sketch Quality": 5,
"Canny Alignment": 5,
"Canny Quality": 5,
"Bounding-Box Accuracy": 5,
"Depth Alignment": 5,
"Depth Quality": 5,
"Mask Alignment": 5,
"Caption Alignment": 5
}
}To generate instructions for your dataset, use generate_instructions_new.py.
-
At the top of
generate_instructions_new.py, set the following environment variables:SELECT_INPUT_FILE_PATH: Path to the input JSON file (e.g., a JSON containing your data entries)ENHANCED_PROMPTS_OUTPUT_FILE_PATH: Path to the output JSON file where the enhanced instructions will be saved
-
Then run:
python generate_instructions_new.pyThe script will read the input file, process the data, and write the enhanced instructions to the output file you specified.
This pipeline ensures that for each original image, you have a full set of reference images and instructions, supporting downstream evaluation and benchmarking tasks.
The benchmark can be downloaded at: https://huggingface.co/datasets/wsnHowest/MultiRef
For details on the evaluation pipeline and metrics, see eval/README.md.
