Eval-anything aims to track the performance of all modality large models (any-to-any models) on safety tasks and evaluate their true capabilities.
-
Datasets
-
Self-developed Dataset: A dataset specifically designed for assessing all-modality safety of large models.
-
Integration of Over 50 Open-source Datasets: Diverse data sources for comprehensive safety assessment.
-
Five Core Evaluation Dimensions with 35 sub-dimensions.
-
- Embodied Safety Evaluation Framework:
- Covering Various Modality Evaluations: Text, image, video, speech, and action.
- Defining Major Task Categories in Embodied Safety: Corner cases, blind spots, fragile collections, critical points, and dangerous equipment.
- Proposing Major Goals of Embodied Safety Evaluation: Execution safety, long-range trajectory safety, and hardware safety.
- Platform Integration
- Eval-anything seamlessly integrates with FlagEval to enhance assessment effectiveness.
Eval-anything integrated a diversity of open-source/self-developed benchmarks on LM safety. See benchmark document for more information.
-
Step1: Install
eval-anything
by:pip install -e . conda create -n eval-anything python==3.11
-
Step2: Set up configuration files.
-
Step3: Run the evaluation task by:
bash scripts/run.sh
- Configuring
objaverse
python -m objathor.dataset.download_annotations --version 2023_07_28 --path /path/to/objaverse_assets
python -m objathor.dataset.download_assets --version 2023_07_28 --path /path/to/objaverse_assets
- Configuring
house
python scripts/download_objaverse_houses.py --save_dir /path/to/objaverse_houses --subset val
or
python scripts/download_objaverse_houses.py --save_dir /path/to/objaverse_houses --subset train
- Downloading Datasets
python scripts/download_dataset.py --save_dir /path/to/dataset
- Configuring Environments
pip install -e .[vla]
pip install --extra-index-url https://ai2thor-pypi.allenai.org ai2thor==0+966bd7758586e05d18f6181f459c0e90ba318bec
pip install -e "git+https://github.com/allenai/allenact.git@d055fc9d4533f086e0340fe0a838ed42c28d932e#egg=allenact&subdirectory=allenact" --no-deps
pip install -e "git+https://github.com/allenai/allenact.git@d055fc9d4533f086e0340fe0a838ed42c28d932e#egg=allenact_plugins[all]&subdirectory=allenact_plugins" --no-deps
- Running tasks
bash scripts/run_vla.sh
We are accepting PRs for new benchmarks. Please read the development document carefully before you contribute your benchmark.
If you have any questions in the process of using align-anything, don't hesitate to ask your questions on the GitHub issue page, we will reply to you in 2-3 working days.
Eval-anything is released under Apache License 2.0.
This repository benefits from multiple open-source projects. Thanks for their wonderful works and their efforts for promoting the LLM research.
This work is supported by the Beijing Academy of Artificial Intelligence, Peking University and Beijing University of Posts and Telecommunications.
![]() |
![]() |
![]() |