| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Wed, 16 Apr 2025 20:13:07 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"68000f53-106a0"
expires: Mon, 29 Dec 2025 07:59:05 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 95ED:2D64E0:86CDF9:978FCB:69523271
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 11:44:37 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210058-BOM
x-cache: HIT
x-cache-hits: 0
x-timer: S1767008678.709407,VS0,VE209
vary: Accept-Encoding
x-fastly-request-id: d3ed00adc5d9a72920f6365297f7a8efa0bb370b
content-length: 13098
BigCodeBench Leaderboard
BigCodeBench evaluates LLMs with practical and
challenging programming tasks.
πΈ BigCodeBench Leaderboard
BigCodeBench evaluates LLMs with practical and
challenging programming tasks.
π Notes
- Evaluated using BigCodeBench;
-
Hard Set vs Full Set:
Hard Set: A subset of ~150 BigCodeBench tasks which is more user-facing and challenging.
Full Set: The full set of 1140 BigCodeBench tasks. - Models are ranked according to (calibrated) Pass@1 using greedy decoding. Setup details can be found here.
-
Complete vs Instruct:
Complete: Code Completion based on the structured long-context docstring. This variant tests if the models are good at coding.
Instruct (π₯Vibe Checkπ₯): Code Generation based on the brief NL-oriented instructions. This variant tests if the models are really capable enough to understand human intents to code. - Wonder the relative performance among models, or the current progress of task solve rate? Check out the π€ Hugging Face Leaderboard!
- π§ indicates an evaluation setup without response prefilling during generation, potentially leading to the reasoning process.
- β¨ marks models evaluated using a chat setting, while others perform direct code completion. We note that some instruction-tuned models miss the chat template in their tokenizer configuration.
- Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
- π means open weights and open data. π means open weights and open SFT data, but the base model is not data-open. What does this imply? ππ models open-source the data such that one can concretely reason about contamination.
- "Size" here is the number of model parameters during inference.
π€ More Leaderboards
In addition to BigCodeBench leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:- SWE Arena
- EvalPlus Leaderboard
- Spider 2.0
- BigCodeBench Leaderboard
- Chatbot Arena Leaderboard
- CrossCodeEval
- ClassEval
- CRUXEval
- Code Lingua
- Evo-Eval
- HumanEval.jl - Julia version HumanEval with EvalPlus test cases
- InfiCoder-Eval
- LiveCodeBench
- NaturalCodeBench
- RepoBench
- SWE-bench
- TabbyML Leaderboard
- OOP
π Acknowledgements
- We thank the EvalPlus team for providing the leaderboard template.
- We are grateful for the significant contributions from the BigCode community.