HardTests: Synthesizing High-Quality Test Cases for LLM Coding

📂 Dataset Download

📎 Huggingface - Problems This dataset contains the problem descriptions and oracle programs of each problem.

📎 Huggingface - Test Cases This dataset contains the test cases generated by our HARDTESTGEN pipeline for each problem.

Please refer to the Huggingface dataset pages for a detailed description of the data structure.

📦 Environment Setup

Step 1. Install Python packages

pip install -r requirements.txt

Step 2. Install Bubblewrap (bwrap)

Bubblewrap is an open-source sandbox tool that allows you to create and manage sandbox environments on Linux systems without requiring root privileges.

Install with root:

sudo apt-get update
sudo apt install bubblewrap

Install without root (build from source):

pip install meson ninja
export PATH="$HOME/.local/bin:$PATH"
git clone https://github.com/containers/bubblewrap.git
cd bubblewrap
meson setup build --prefix=$HOME/.local
meson compile -C build
meson install -C build
~/.local/bin/bwrap --version # check installation
bwrap --version # check installation

Step 3. Setup LLM API Keys

Set up your API keys for the LLM services you plan to use (e.g., OpenAI, Deepseek, Anthropic). You can set the API keys in the HardTestGen/hard_tests_gen/llm_api.py.

📝 How to Run

Step 1: Generate Test Cases Kit

Run test_cases_kit_generation.py to generate the test cases kit for each problem. This step uses LLMs to synthesize input validators (IV), output judging functions (OJF), and input generators (IG) for each problem.

Example command:

python hard_tests_gen/test_cases_kit_generation.py \
  --dataset_name_or_path <huggingface_dataset_name> \
  --target_pids_path <target_pids.json> \
  --iv_and_ojf_gen_prompt_template test_cases_kit_prompt_iv_and_ojf \
  --ig_gen_prompt_template test_cases_kit_prompt_ig \
  --num_LLMGen_input 10 \
  --iv_and_ojf_gen_responses_save_path ./iv_and_ojf_gen_responses.jsonl \
  --ig_gen_responses_save_path ./ig_gen_responses.jsonl \
  --test_cases_kit_save_path ./test_cases_kits.jsonl \
  --model_name gpt-4o \
  --temperature 0.1 \
  --max_tokens 5120 \
  --num_parallel 10

Arguments:

--dataset_name_or_path: Name of the Huggingface dataset to use (e.g., sigcp/hardtests_problems), or path to a local dataset in JSON/JSONL format.
--target_pids_path: (Optional) Path to a JSON file containing a list of target problem IDs. If not provided, all problems in the dataset will be used.
--iv_and_ojf_gen_prompt_template: Prompt template for generating Input Validator and Output Judging Function (default: test_cases_kit_prompt_iv_and_ojf).
--ig_gen_prompt_template: Prompt template for Input Generation (default: test_cases_kit_prompt_ig).
--num_LLMGen_input: Number of LLMGen inputs to generate for each problem (default: 10).
--iv_and_ojf_gen_responses_save_path: Path to save LLM responses for IV/OJF generation.
--ig_gen_responses_save_path: Path to save LLM responses for IG generation.
--test_cases_kit_save_path: Path to save the generated test cases kits.
--model_name: LLM model name (e.g., gpt-4o, deepseek).
--temperature: Sampling temperature for LLM (default: 0.1).
--max_tokens: Maximum tokens for LLM output (default: 5120).
--num_parallel: Number of parallel LLM requests (default: 10).

Step 2: Generate Test Cases

Run test_cases_generation.py to generate concrete test cases (input-output pairs) for each problem, based on the previously generated test cases kit.

python hard_tests_gen/test_cases_generation.py \
  --problem_data_path <problem_data_path> \
  --test_cases_kit_path ./test_cases_kits.jsonl \
  --test_cases_save_path ./test_cases.jsonl \
  --test_cases_related_contents_save_path ./test_cases_related_contents.jsonl \
  --bwrap_path bwrap \
  --cpp_compiler_path g++ \
  --python_interpreter_path python3 \
  --code_exec_temp_dir ./tmp \
  --max_workers 3 \
  --save_steps 10 \
  --log_steps 10 \
  --start 0 \
  --end 1000000

Arguments:

--problem_data_path: Name of the Huggingface dataset to use (e.g., sigcp/hardtests_problems), or path to a local dataset in JSONL format.
--test_cases_kit_path: Path to the test cases kit file generated in Step 1.
--test_cases_save_path: Path to save the generated test cases (JSONL format).
--test_cases_related_contents_save_path: Path to save related contents for each test case (JSONL format).
--bwrap_path: Path to the Bubblewrap executable (default: bwrap).
--cpp_compiler_path: Path to the C++ compiler (default: g++).
--python_interpreter_path: Path to the Python interpreter (default: python3).
--code_exec_temp_dir: Directory for temporary code execution files.
--max_workers: Number of parallel workers (default: 3).
--save_steps: Save results every N problems (default: 10).
--log_steps: Log progress every N problems (default: 10).
--start: Start index for problems to process (default: 0).
--end: End index for problems to process (default: 1000000).
(Other advanced arguments are available, see code for details.)

Note:

You must run Step 1 before Step 2, as Step 2 depends on the test cases kit generated in Step 1.
Make sure all paths are set correctly and required files exist before running each step.

Step 3: Filter and Convert Format (Optional)

You may use the following command to filter out the problems whose test cases are not successfully generated and convert the rest of the problems into the format same as that in our Huggingface dataset.

Python HardTestGen/hard_tests_gen/convert_format.py \
  --input_file ./test_cases.jsonl \
  --output_file ./filtered_test_cases.jsonl \

Cite Us

Please cite us if you find it useful using the following bibtex.

@misc{he2025hardtests,
    title={HardTests: Synthesizing High-Quality Test Cases for LLM Coding},
    author={Zhongmou He and Yee Man Choi and Kexun Zhang and Jiabao Ji and Junting Zhou and Dejia Xu and Ivan Bercovich and Aidan Zhang and Lei Li},
    year={2025},
    month={May}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HardTests: Synthesizing High-Quality Test Cases for LLM Coding

📂 Dataset Download

📦 Environment Setup

Step 1. Install Python packages

Step 2. Install Bubblewrap (bwrap)

Step 3. Setup LLM API Keys

📝 How to Run

Step 1: Generate Test Cases Kit

Step 2: Generate Test Cases

Step 3: Filter and Convert Format (Optional)

Cite Us

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
hard_tests_gen		hard_tests_gen
README.md		README.md
requirements.txt		requirements.txt

LeiLiLab/HardTestGen

Folders and files

Latest commit

History

Repository files navigation

HardTests: Synthesizing High-Quality Test Cases for LLM Coding

📂 Dataset Download

📦 Environment Setup

Step 1. Install Python packages

Step 2. Install Bubblewrap (bwrap)

Step 3. Setup LLM API Keys

📝 How to Run

Step 1: Generate Test Cases Kit

Step 2: Generate Test Cases

Step 3: Filter and Convert Format (Optional)

Cite Us

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages