🚨 WIP: Artifacts for the leaderboard is expected to finish soon 🚨
Code Lingua leaderboard evaluates LLMs in Programming Language Translation. While other leaderboards assess abilities of LLMs to understand Natural Language (NL) for code synthesis, the ultimate way of assessing whether LLMs understand code syntax and semantics is code translation. Code Lingua serves as such leaderboard, and compares the ability of LLMs to understand what the code implements in source language and translate the same semantics in target language.
Execute the following to install all requirements:
pip3 install -r requirements.txt
To create a docker image, execute the following:
docker build -t codetlingua .
The dataset used in this study is available on HuggingFace. The current version of the leaderboard consists of the following datasets:
- PLs: C, C++, Go, Java, Python
- # Samples / Language: 200
- # Tests / Sample: 1
- PLs: Java, Python
- # Samples / Language: 250
- # Tests / Sample: ~50
In order to use GPT, Claude and Gemini, the following environment variables must be set before running the code.
- GPT: OPENAI_API_KEY
- Claude: ANTHROPIC_KEY
- Gemini: GEMINI_KEY
The current version has been tested with gpt3.5-turbo, gpt4, gpt-4-0125, gemini-pro (1.0), claude-3-opus-20240229 .
The artifacts of Code Lingua has multiple modules which can be used for evaluating new LLMs on our benchmarks. You can either use our artifacts to evaluate your model or file a request so we could evaluate your model and add it to our leaderboard.
The first step is to use the model for generating raw translations. Please see the translate.sh script on how to generate translations. A sample translation command is provided below:
bash scripts/translate.sh deepseek-coder-1.3b-instruct codenet Java Python deepseek 0.2 10 16 1024 3 0
The raw translations generated by LLMs contain extra template-related tokens and natural language. Please see the sanitize.sh script on how to sanitize the generated translations. A sample sanitization command is provided below:
bash scripts/sanitize.sh translations deepseek-coder-1.3b-instruct codenet Java Python 0.2 remove_prompt
The final step is to evaluate the correctness of sanitized translations. Please check the evaluate.sh script on how to run the test suites against the translations. A sample evaluation command is given below:
bash scripts/evaluate.sh translations deepseek-coder-1.3b-instruct codenet Java Python 0.2 8
The artifacts of Code Lingua leaderboard is consistently being improved. If you see any inconsistencies, please feel free to open a PR or contact Ali (alirezai@illinois.edu).