This repository contains a benchmarking toolkit for evaluating Large Language Models (LLMs) on competitive programming tasks. The toolkit provides a standardized way to test your LLM's code generation capabilities across a diverse set of problems.
LiveCodeBench Pro evaluates LLMs on their ability to generate solutions for programming problems. The benchmark includes problems of varying difficulty levels from different competitive programming platforms.
- Ubuntu 20.04 or higher (or other distros with kernel version >= 3.10, and cgroup support. Refer to go-judge for more details)
- Python 3.12 or higher
- pip package manager
- docker (for running the judge server), and ensure the user has permission to run docker commands
-
Install the required dependencies:
pip install -r requirements.txt
Or install directly using
uv:uv sync
-
Ensure Docker is installed and running:
docker --version
Make sure your user has permission to run Docker commands. On Linux, you may need to add your user to the docker group:
sudo usermod -aG docker $USERThen log out and back in for the changes to take effect.
Create your own LLM class by extending the abstract LLMInterface class in api_interface.py. Your implementation needs to override the call_llm method.
Example:
from api_interface import LLMInterface
class YourLLM(LLMInterface):
def __init__(self):
super().__init__()
# Initialize your LLM client or resources here
def call_llm(self, user_prompt: str):
# Implement your logic to call your LLM with user_prompt
# Return a tuple containing (response_text, metadata)
# Example:
response = your_llm_client.generate(user_prompt)
return response.text, response.metadataYou can use the ExampleLLM class as a reference, which shows how to integrate with OpenAI's API.
Edit the benchmark.py file to use your LLM implementation:
from your_module import YourLLM
# Replace this line:
llm_instance = YourLLM() # Update with your LLM classAnd change the number of judge workers (recommended to <= physical CPU cores).
Execute the benchmark script:
python benchmark.pyThe script will:
- Load the LiveCodeBench-Pro dataset from Hugging Face
- Process each problem with your LLM
- Extract C++ code from LLM responses automatically
- Submit solutions to the integrated judge system for evaluation
- Collect judge results and generate comprehensive statistics
- Save the results to
benchmark_result.json
Email your benchmark_result.json file to zz4242@nyu.edu to have it displayed on the leaderboard.
Please include the following information in your submission:
- LLM name and version
- Any specific details
- Contact information
This file defines the abstract interface for LLM integration:
LLMInterface: Abstract base class with methods for LLM interactionExampleLLM: Example implementation with OpenAI's GPT-4o
The main benchmarking script that:
- Loads the dataset
- Processes each problem through your LLM
- Extracts C++ code from responses
- Submits solutions to the judge system
- Collects results and generates statistics
- Saves comprehensive results with judge verdicts
Contains the judge system integration:
Judge: Abstract base class for judge implementationsLightCPVerifierJudge: LightCPVerifier integration for local solution evaluation- Automatic problem data downloading from Hugging Face
Utility functions for code processing:
extract_longest_cpp_code(): Intelligent C++ code extraction from LLM responses
The benchmark uses the QAQAQAQAQ/LiveCodeBench-Pro and QAQAQAQAQ/LiveCodeBench-Pro-Testcase datasets from Hugging Face, which contains competitive programming problems with varying difficulty levels.
For questions or support, please contact us at zz4242@nyu.edu.
