CodeHalu: Code Hallucinations in LLMs Driven by Execution-based Verification

Dataset Description

CodeHaluEval is a comprehensive evaluation tool for assessing the performance of Large Language Models (LLMs) in code generation tasks. It includes 8,883 samples from 699 diverse programming tasks, specifically designed to quantify and understand the tendencies of LLMs to produce code hallucinations and other errors during code generation. Utilizing our newly developed CodeHalu dynamic detection algorithm, researchers can identify and categorize various types of code issues, thereby enhancing the model’s application efficacy in real-world programming environments.

For more detailed introduction of the data, please see the 🤗 Huggingface Dataset.

Getting Started

Set Up

If you want to use some model APIs, you need to set variables in models.py

erniebot_api_key: Your Paddle API key.
gemini_api_key = Your Google API key.
openai_api_key = Your OpenAI API key.
claude_api_key= Your Claude API key.

Inference

An example for GPT-4 generation:

python generation.py \
    --model gpt4 \
    --data_path <path_to_the_test_set> \
    --save_path "results/gpt4_codehalu_test.jsonl"

Evaluation

To evaluate the results generated by GPT-4, run:

python eval.py \
    --halu_type <The type of hallucination you want to evaluate.> \
    --generation_file <File containing generations to be evaluated.>

Citation

Please consider citing if you find our work useful:

@misc{tian2024codehaluinvestigatingcodehallucinations,
      title={CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification}, 
      author={Yuchen Tian and Weixiang Yan and Qian Yang and Xuandong Zhao and Qian Chen and Wen Wang and Ziyang Luo and Lei Ma and Dawn Song},
      year={2024},
      eprint={2405.00253},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2405.00253}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
benchmarks		benchmarks
README.md		README.md
eval.py		eval.py
generation.py		generation.py
models.py		models.py
testing_utils.py		testing_utils.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CodeHalu: Code Hallucinations in LLMs Driven by Execution-based Verification

Dataset Description

Getting Started

Set Up

Inference

Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

yuchen814/CodeHalu

Folders and files

Latest commit

History

Repository files navigation

CodeHalu: Code Hallucinations in LLMs Driven by Execution-based Verification

Dataset Description

Getting Started

Set Up

Inference

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages