EvalPlus v0.2.0

🔥 Announcing MBPP+

MBPP is a dataset curated by Google. Its full set includes around 1000 crowd-sourced Python programming problems. However, certain amount of problems can be noisy (e.g., prompts make no sense or tests are broken). Consequently, a subset (~427 problems) of the data has been hand-verified by original author -- MBPP-sanitized.

MBPP+ improves MBPP based on its sanitized version (MBPP-sanitized):

We further hand-verify the problems to trim ill-formed problems to keep 399 problems
We also fix the problems whose implementation is wrong (more details can be found here)
We perform test augmentation to improve the number of tests by 35x (on avg from 3.1 to 108.5)
We mantain the scripting compatibility against HumanEval+ where one simply needs to toggle the switch by --dataset mbpp for evalplus.evaluate, codegen/generate.py, tools/checker.py as well as tools/sanitize.py
Initial leaderboard is made available on https://evalplus.github.io/leaderboard.html and we will keep updating

A typical workflow to use MBPP+:

# Step 1: Generate MBPP solutions
from evalplus.data import get_mbpp_plus, write_jsonl
def GEN_SOLUTION(prompt: str) -> str:
    # LLM produce the whole solution based on prompt
samples = [
    dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
    for task_id, problem in get_mbpp_plus().items()
]
write_jsonl("samples.jsonl", samples)
# May perform some post-processing to sanitize LLM produced code
# e.g., https://github.com/evalplus/evalplus/blob/master/tools/sanitize.py

# Step 2: Evaluation on MBPP+
docker run -v $(pwd):/app ganler/evalplus:latest --dataset mbpp --samples samples.jsonl
# STDOUT will display the scores for "base" (with MBPP tests) and "base + plus" (with additional MBPP+ tests)

🔥 HumanEval+ Maintainance

Leaderboard updates (now 41 models!): https://evalplus.github.io/leaderboard.html
- DeepSeek Coder series
- Phind-CodeLlama
- Mistral and Zephyr series
- Smaller StarCoders
HumanEval+ now upgrades to v0.1.9 from v0.1.6
- Test-case fixes: 0, 3, 9, 148
- Prompt fixes: 114
- Contract fixes: 1, 2, 99, 35, 28, 32, 160

PyPI: https://pypi.org/project/evalplus/0.2.0/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.2.0/images/sha256-6f1b9bd13930abfb651a99d4c6a55273271f73e5b44c12dcd959a00828782dd6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EvalPlus v0.2.0

🔥 Announcing MBPP+

🔥 HumanEval+ Maintainance

Uh oh!