Introduction

📘Benchmark List | 🛠️Quick Start | 💡Contributing a New Bench | 🤔Reporting Issues

Introduction

Eval-anything aims to track the performance of all modality large models (any-to-any models) on safety tasks and evaluate their true capabilities.

Datasets
- Self-developed Dataset: A dataset specifically designed for assessing all-modality safety of large models.
- Integration of Over 50 Open-source Datasets: Diverse data sources for comprehensive safety assessment.
- Five Core Evaluation Dimensions with 35 sub-dimensions.

Embodied Safety Evaluation Framework:
- Covering Various Modality Evaluations: Text, image, video, speech, and action.
- Defining Major Task Categories in Embodied Safety: Corner cases, blind spots, fragile collections, critical points, and dangerous equipment.
- Proposing Major Goals of Embodied Safety Evaluation: Execution safety, long-range trajectory safety, and hardware safety.

Platform Integration
- Eval-anything seamlessly integrates with FlagEval to enhance assessment effectiveness.

Tasks and Datasets

Eval-anything integrated a diversity of open-source/self-developed benchmarks on LM safety. See benchmark document for more information.

Quick Start

Step1: Install eval-anything by:

pip install -e .
conda create -n eval-anything python==3.11

Step2: Set up configuration files.
Step3: Run the evaluation task by:
```
bash scripts/run.sh
```

Running VLA Benchmarks

Configuring objaverse

python -m objathor.dataset.download_annotations --version 2023_07_28 --path /path/to/objaverse_assets
python -m objathor.dataset.download_assets --version 2023_07_28 --path /path/to/objaverse_assets

Configuring house

python scripts/download_objaverse_houses.py --save_dir /path/to/objaverse_houses --subset val

or

python scripts/download_objaverse_houses.py --save_dir /path/to/objaverse_houses --subset train

Downloading Datasets

python scripts/download_dataset.py --save_dir /path/to/dataset

Configuring Environments

pip install -e .[vla]
pip install --extra-index-url https://ai2thor-pypi.allenai.org ai2thor==0+966bd7758586e05d18f6181f459c0e90ba318bec
pip install -e "git+https://github.com/allenai/allenact.git@d055fc9d4533f086e0340fe0a838ed42c28d932e#egg=allenact&subdirectory=allenact" --no-deps
pip install -e "git+https://github.com/allenai/allenact.git@d055fc9d4533f086e0340fe0a838ed42c28d932e#egg=allenact_plugins[all]&subdirectory=allenact_plugins" --no-deps

Running tasks

bash scripts/run_vla.sh

Contributing a New Bench

We are accepting PRs for new benchmarks. Please read the development document carefully before you contribute your benchmark.

Reporting Issues

If you have any questions in the process of using align-anything, don't hesitate to ask your questions on the GitHub issue page, we will reply to you in 2-3 working days.

License

Eval-anything is released under Apache License 2.0.

Acknowledgements

This repository benefits from multiple open-source projects. Thanks for their wonderful works and their efforts for promoting the LLM research.

This work is supported by the Beijing Academy of Artificial Intelligence, Peking University and Beijing University of Posts and Telecommunications.

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
assets		assets
eval_anything		eval_anything
scripts		scripts
.codespell-ignore-words.txt		.codespell-ignore-words.txt
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
development.md		development.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Contents

Tasks and Datasets

Quick Start

Running VLA Benchmarks

Contributing a New Bench

Reporting Issues

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 15

Languages

License

PKU-Alignment/eval-anything

Folders and files

Latest commit

History

Repository files navigation

Introduction

Contents

Tasks and Datasets

Quick Start

Running VLA Benchmarks

Contributing a New Bench

Reporting Issues

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 15

Languages

Packages