OpenCodeEval

OpenCodeEval is a comprehensive framework for evaluating Large Language Models (LLMs) on code generation tasks. It provides standardized benchmarks, flexible configurations, and robust evaluation metrics to assess model performance across different programming challenges.

Overview

OpenCodeEval is a robust framework designed to evaluate LLMs' performance on code generation tasks. It supports multiple benchmark datasets and provides flexible evaluation configurations.

Features

Multiple benchmark dataset support:
- HumanEval & HumanEvalPlus
- MBPP & MBPPPlus
- BigCodeBench & BigCodeBench-Hard
- LeetCode
Flexible model support:
- Base models
- Chat models
Backend support:
- vLLM acceleration
- Sglang acceleration
- OpenAI API integration
Comprehensive evaluation tools:
- Pass@k metrics
- Multiple sample evaluation
- Parallel processing

Quick Start

Clone the repository:

git clone https://github.com/richardodliu/OpenCodeEval.git
cd OpenCodeEval

Download benchmark datasets:

cd src/data
bash dataset.sh

Install dependencies:

pip install -e .

Run evaluation:

Basic usage:

OpenCodeEval  --model_name <your_model_name> \
              --save_path <output_directory> \
              --num_gpus <number_of_gpus> \
              --batch_size <batch_size> \
              --task <benchmark_name>

Complete example:

OpenCodeEval  --model_name '/path/to/your/model/checkpoint' \
              --task 'LeetCodeTest' \
              --save 'test/output' \
              --num_gpus 1 \
              --num_samples 1 \
              --list_k '1' \
              --temperature 0.0 \
              --num_workers 10 \
              --batch_size 200 \
              --max_tokens 4096 \
              --model_type 'Chat' \
              --prompt_type 'Instruction' \
              --prompt_prefix '' \
              --prompt_suffix '' \
              --trust_remote_code

Supported Benchmarks

1. HumanEval

Standard code generation benchmark
Function completion tasks
Python programming problems
Automated test cases

2. MBPP (Mostly Basic Programming Problems)

Basic programming tasks
Few-shot learning support
Python implementation
Test-driven evaluation

3. BigCodeBench

Comprehensive coding tasks
Multiple difficulty levels
Various programming challenges
Extensive test coverage

4. LeetCode

Algorithm problems
Data structure challenges
Multiple difficulty levels
Real-world coding scenarios

Project Structure

OpenCodeEval/
├── src/
│   ├── backend/         # Model backend implementations
│   ├── benchmark/       # Benchmark dataset implementations
│   ├── data/           # Dataset files
│   ├── eval/           # Evaluation utilities
│   └── main.py         # Main entry point
├── LICENSE             # Apache 2.0 license
└── README.md

Configuration

The framework supports various configuration options:

Model configurations:
- Model type (Base/Chat)
- Number of GPUs
- Batch size
- Temperature
- Max tokens
Prompt configurations:
- Prompt type (Completion/Instruction)
- Prompt prefix/suffix
- Stop words
Evaluation configurations:
- Number of samples
- Number of workers
- Timeout settings

Contributing

Fork the repository
Create your feature branch
Make your changes
Submit a pull request

Please ensure your code follows the project's coding standards and includes appropriate tests.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Citation

If you use OpenCodeEval in your research, please cite:

@software{OpenCodeEval,
  title = {OpenCodeEval: An Extensible, Efficient, and Easy-to-use Evaluation Framework for Code Generation Tasks on Large Language Models},
  author = {Ren-Biao Liu, Yun-Hui Xia, Wei Shen, Tian-Hao Cheng, Chong-Han Liu},
  year = {2024},
  url = {https://github.com/richardodliu/OpenCodeEval}
}

Acknowledgments

We would like to thank the following projects and individuals for their contributions to OpenCodeEval:

Datasets

Backends

Contact

For questions and feedback, please open an issue in the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
OpenCodeEval		OpenCodeEval
eval		eval
script		script
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
dataset.sh		dataset.sh
pyproject.toml		pyproject.toml
requirements-bigcodebench.txt		requirements-bigcodebench.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenCodeEval

Overview

Features

Quick Start

Supported Benchmarks

1. HumanEval

2. MBPP (Mostly Basic Programming Problems)

3. BigCodeBench

4. LeetCode

Project Structure

Configuration

Contributing

License

Citation

Acknowledgments

Datasets

Backends

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

richardodliu/OpenCodeEval

Folders and files

Latest commit

History

Repository files navigation

OpenCodeEval

Overview

Features

Quick Start

Supported Benchmarks

1. HumanEval

2. MBPP (Mostly Basic Programming Problems)

3. BigCodeBench

4. LeetCode

Project Structure

Configuration

Contributing

License

Citation

Acknowledgments

Datasets

Backends

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages