The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving is a growing trend. However, while MLLMs excel at semantic understanding, their ability to perform precise, quantitative spatial-temporal reasoning in real-world applications remains largely unexamined. To address this gap, we introduce the Spatial-Temporal Intelligence Benchmark (STI-Bench), detailed in our paper “STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?”. STI-Bench evaluates MLLMs' spatial-temporal intelligence through challenging tasks on real-world video data, including estimating and predicting object appearance, pose, displacement, and motion. Our benchmark covers diverse robot and vehicle operations across desktop, indoor, and outdoor scenarios. Extensive experiments reveal that even state-of-the-art MLLMs struggle significantly with these tasks, particularly those requiring precise distance estimation and motion analysis, highlighting a critical area for future research and development.
This repository provides reference evaluation scripts, such as openai_test.py for the OpenAI API and opensource_test.py for open-source models like Qwen2.5VL. These are intended as a starting point for running your own evaluations.
Here is a brief guide to get you started:
First, you need both the evaluation code from this repository and the dataset from Hugging Face.
-
Clone the code repository:
git clone https://github.com/MINT-SJTU/STI-Bench.git cd STI-Bench -
Download the dataset: You will need
git-lfsto handle the large video files.Note: We recommend cloning the dataset into a separate parent directory to avoid folder name conflicts.
# Make sure git-lfs is installed (https://git-lfs.com) git lfs install git clone https://huggingface.co/datasets/MINT-SJTU/STI-BenchThis will download the
qa.parquetfile andvideo.zip.
Next, prepare the data and configure the script paths before running the evaluation.
-
Prepare Data: Unzip the
video.zipfile located in the dataset directory you just cloned. This will create avideosfolder. -
Update Paths: Open the evaluation script you wish to use (e.g.,
opensource_test.py). Update thePARQUET_FILEandVIDEO_DIRvariables to the absolute paths of your dataset files.# Example paths to modify in the script PARQUET_FILE = "/path/to/your/dataset/STI-Bench/qa.parquet" VIDEO_DIR = "/path/to/your/dataset/STI-Bench/videos/"
-
Run Evaluation: After installing the necessary dependencies for the model, try to execute the script.
python opensource_test.py
STI-Bench provides a comprehensive benchmark for evaluating MLLMs' spatial-temporal understanding. Our findings reveal significant limitations in current models, particularly in precise quantitative tasks, highlighting inaccuracies in spatial quantification, temporal dynamics understanding, and cross-modal integration. There is a substantial gap between current capabilities and the reliability needed for real-world applications like embodied AI and autonomous driving. STI-Bench serves as a valuable tool for driving progress in developing MLLMs that can accurately perceive and reason about the physical world.
@article{li2025sti,
title={STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?},
author={Yun Li and Yiming Zhang and Tao Lin and XiangRui Liu and Wenxiao Cai and Zheng Liu and Bo Zhao},
year={2025},
journal={arXiv preprint arXiv:2503.23765},
}

