Carview!

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

🎉 Our paper has been accepted by the 31st International Conference on Computational Linguistics (COLING 25).

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

Introduction

Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities.

Experiment Results

More Details

More details can be found in our paper.

📑 Citation

If you find CodeJudge-Eval useful for your research and applications, please cite using this BibTeX:

@misc{zhao2024codejudgeevallargelanguagemodels,
      title={CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?}, 
      author={Yuwei Zhao and Ziyang Luo and Yuchen Tian and Hongzhan Lin and Weixiang Yan and Annan Li and Jing Ma},
      year={2024},
      eprint={2408.10718},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2408.10718}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
experiments.png		experiments.png
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

🎉 Our paper has been accepted by the 31st International Conference on Computational Linguistics (COLING 25).

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

Introduction

Experiment Results

More Details

📑 Citation

About

Uh oh!

Releases

Packages

License

CodeLLM-Research/CodeJudge-Eval

Folders and files

Latest commit

History

Repository files navigation

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

🎉 Our paper has been accepted by the 31st International Conference on Computational Linguistics (COLING 25).

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

Introduction

Experiment Results

More Details

📑 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages