You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
🎉 Our paper has been accepted by the 31st International Conference on Computational Linguistics (COLING 25).
If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏
Introduction
Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities.
If you find CodeJudge-Eval useful for your research and applications, please cite using this BibTeX:
@misc{zhao2024codejudgeevallargelanguagemodels,
title={CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?},
author={Yuwei Zhao and Ziyang Luo and Yuchen Tian and Hongzhan Lin and Weixiang Yan and Annan Li and Jing Ma},
year={2024},
eprint={2408.10718},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2408.10718},
}
About
[COLING25] CodeJudge Eval: Can Large Language Models be Good Judges in Code Understanding?