| CARVIEW |
ACM MM 2024 Grand Challenge: Visual Spatial Description
Leaderboard
Congratulations to the winners!
| Rank | Team | Score |
| 1 | Token | 63.2064 |
| 2 | USTC-IAT-United | 62.1149 |
| 3 | ppjj | 59.8638 |
| 4 | GXU-LIPE | 59.6864 |
| 5 | DMCV | 59.3998 |
Introduction
We propose the visual spatial description (VSD) challenge on the platform of ACM MM. This challenge falls within the research domain of visual spatial semantics understanding. In the VSD challenge, models and systems are expected to generate an accurate textual descriptive sentence to describe the spatial relationship between two given target objects in an input image. Alongside the challenge, a large-scale Visual Spatial Description dataset will be provided, consisting of 29,272 high-quality manually annotated image-text pairs.
The challenge contains three subtasks, from easy to hard.
Challenge Task Definition and Metrtics
Task 1: Classification of Visual Spatial Relationship.
Task-1: Classification of Visual Spatial Relationship. Participants are required to construct models that can extract spatial relationships between two given objects O₁ and O₂ and output a triplet containing the spatial relationships of them. The relationships could be chosen from nine labels: “on”, “in”, “next”, “under”, “above”, “behind”, “in front of”, “left”, and “right”. The evaluation of this sub-task is the F1 score of the multi-label classification.
z1 = F1
Task 2: Description of Single Spatial Relationship.
Task-2: Description of Single Spatial Relationship. Participants are required to build models that generate a textual description of the single spatial relationship between the two objects O₁ and O₂. There are one or more ground-truth for each data entry. We calculate BLEU-4 and SPICE for the predicted sentence and each ground truth and choose the max score. We rank the submitted models by a weighted sum z of BLEU-4 and SPICE score:
z2 = 0.4BLEU4 + 0.6SPICE
Task 3: Description of open-ended spatial relationship.
Task-3: Description of open-ended spatial relationship. In this task, we provide a more challenging dataset that contains more complex spatial description. Contestants are required to construct models that can generate textual description to describe the spatial relationship between O₁ and O₂. Different to task-2, in the task-3, the model should generate 3 diverse descriptions. The evaluation of task-3 contains two parts, the Correctness and the Diversity. For the correctness we use the SPICE as task-2. For the diversity, we use mBLEU-4.
z3 = 0.5*(50 - mBLEU4) + 0.5SPICE
Finally, we use the following score for ranking:
overall=0.2z1+0.3z2+0.5z3
We provide python scripts for evaluation, please refer to the baseline codes.
Dataset
Our dataset comprises of two versions, VSDv1 and VSDv2, containing same images, two objects with bounding boxes, and spatial descriptions in English. The v2 contains more complicate sentences than v1.
Example of a sample:
[
{
"img_id": "1.jpg", // image id
"triple_list": [
{
"s": "book", // tag of the subject
"o": "table", // tag of the object
"p": "on", // predicate label, one of “on”, “in”, “next”, “under”, “above”, “behind”, “in front of”, “left”, and “right”
"s_bbox": [ymin,ymax,xmin,xmax], // coordinates of the subject box, with the 0 points at upper-left corner of the image. (xmin, ymin) is the upper-left corner of the box.
"o_bbox": ymin,ymax,xmin,xmax], // coordinates of the object box
}
],
"description": "The book is on the table."
}
]
The datasets (with images) are avaliable on Google Drive
Submission
Please sumbit predicted results with TWO json files, one for task1 and task2 (task1-2.json), the other for the task3 (task3.json).
For task1-2.json:
[
{
"img_id": "1.jpg", // image id
"triple_list": [
{
"s": "book", // tag of the subject
"o": "table", // tag of the object
"p": "predicted label", // predicate label, one of “on”, “in”, “next”, “under”, “above”, “behind”, “in front of”, “left”, and “right”
"s_bbox": [ymin,ymax,xmin,xmax], // coordinates of the subject box, with the 0 points at upper-left corner of the image. (xmin, ymin) is the upper-left corner of the box.
"o_bbox": ymin,ymax,xmin,xmax], // coordinates of the object box
}
],
"description": ["sentence1"] // generated sentences
}
]
For task3.json:
[
{
"img_id": "1.jpg", // image id
"description": ["sentence1", "sentence2", "sentence3"] // generated sentences
}
]
Participants can submit at Codabench, or send the results at in a Zip file to vsdchallenge@gmail.com. We will review the submissions publish the ranking here.
Baseline
We use pre-trained Vison-Language Models as the baselines.
Link to the code https://github.com/LLLogen/VSDcode
Registration
Welcome and please apply for the VSD challenge via a form at this link.
Feel free to contact us at vsdchallenge@gmail.com.
Timeline
Please note: The submission deadline is at 11:59 p.m. (Anywhere on Earth) of the stated deadline date.
| Registration open | May 24, 2024 |
| Release of all datasets | MAy 28, 2024 |
| Evaluation results and ranking open | June 10, 2024 |
| Results Submission Deadline | August 1, 2024 |
| Challenge Paper Submission Deadline(follow MM2024 Workshop Dates) | August 19, 2024 |
Rewards
Top-ranked participants in this competition will receive a certificate of achievement and will be recommended to write a technical paper for submission to the ACM ToMM Special Issue.
Organizers
Yu Zhao. College of Intelligence and Computin, Tianjin University, China.
Hao Fei. Skywork AI, National University of Songalpore, Singapore.
Bobo Li. School of Cyber Science and Engineering, Wuhan University, China.
Meishan Zhang. Harbin Institute of Technology (Shenzhen), Shenzhen, China.
Min Zhang. Harbin Institute of Technology (Shenzhen), Shenzhen, China.
References
[1] Zhao, Y., Wei, J., Lin, Z., Sun, Y., Zhang, M., & Zhang, M. (2022, December). Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 1437-1449).
[2] Zhao, Y., Fei, H., Ji, W., Wei, J., Zhang, M., Zhang, M., & Chua, T. S. (2023, July). Generating Visual Spatial Description via Holistic 3D Scene Understanding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (pp. 7960-7977).
[3] Yang, K., Russakovsky, O., & Deng, J. (2019). Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2051-2060).
[4] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140), 1-67.