OpenCodeEval is a comprehensive framework for evaluating Large Language Models (LLMs) on code generation tasks. It provides standardized benchmarks, flexible configurations, and robust evaluation metrics to assess model performance across different programming challenges.
OpenCodeEval is a robust framework designed to evaluate LLMs' performance on code generation tasks. It supports multiple benchmark datasets and provides flexible evaluation configurations.
-
Multiple benchmark dataset support:
- HumanEval & HumanEvalPlus
- MBPP & MBPPPlus
- BigCodeBench & BigCodeBench-Hard
- LeetCode
-
Flexible model support:
- Base models
- Chat models
-
Backend support:
- vLLM acceleration
- Sglang acceleration
- OpenAI API integration
-
Comprehensive evaluation tools:
- Pass@k metrics
- Multiple sample evaluation
- Parallel processing
- Clone the repository:
git clone https://github.com/richardodliu/OpenCodeEval.git
cd OpenCodeEval
- Download benchmark datasets:
cd src/data
bash dataset.sh
- Install dependencies:
pip install -e .
- Run evaluation:
Basic usage:
OpenCodeEval --model_name <your_model_name> \
--save_path <output_directory> \
--num_gpus <number_of_gpus> \
--batch_size <batch_size> \
--task <benchmark_name>
Complete example:
OpenCodeEval --model_name '/path/to/your/model/checkpoint' \
--task 'LeetCodeTest' \
--save 'test/output' \
--num_gpus 1 \
--num_samples 1 \
--list_k '1' \
--temperature 0.0 \
--num_workers 10 \
--batch_size 200 \
--max_tokens 4096 \
--model_type 'Chat' \
--prompt_type 'Instruction' \
--prompt_prefix '' \
--prompt_suffix '' \
--trust_remote_code
- Standard code generation benchmark
- Function completion tasks
- Python programming problems
- Automated test cases
- Basic programming tasks
- Few-shot learning support
- Python implementation
- Test-driven evaluation
- Comprehensive coding tasks
- Multiple difficulty levels
- Various programming challenges
- Extensive test coverage
- Algorithm problems
- Data structure challenges
- Multiple difficulty levels
- Real-world coding scenarios
OpenCodeEval/
├── src/
│ ├── backend/ # Model backend implementations
│ ├── benchmark/ # Benchmark dataset implementations
│ ├── data/ # Dataset files
│ ├── eval/ # Evaluation utilities
│ └── main.py # Main entry point
├── LICENSE # Apache 2.0 license
└── README.md
The framework supports various configuration options:
-
Model configurations:
- Model type (Base/Chat)
- Number of GPUs
- Batch size
- Temperature
- Max tokens
-
Prompt configurations:
- Prompt type (Completion/Instruction)
- Prompt prefix/suffix
- Stop words
-
Evaluation configurations:
- Number of samples
- Number of workers
- Timeout settings
- Fork the repository
- Create your feature branch
- Make your changes
- Submit a pull request
Please ensure your code follows the project's coding standards and includes appropriate tests.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
If you use OpenCodeEval in your research, please cite:
@software{OpenCodeEval,
title = {OpenCodeEval: An Extensible, Efficient, and Easy-to-use Evaluation Framework for Code Generation Tasks on Large Language Models},
author = {Ren-Biao Liu, Yun-Hui Xia, Wei Shen, Tian-Hao Cheng, Chong-Han Liu},
year = {2024},
url = {https://github.com/richardodliu/OpenCodeEval}
}
We would like to thank the following projects and individuals for their contributions to OpenCodeEval:
For questions and feedback, please open an issue in the GitHub repository.