📎 Huggingface - Problems This dataset contains the problem descriptions and oracle programs of each problem.
📎 Huggingface - Test Cases This dataset contains the test cases generated by our HARDTESTGEN pipeline for each problem.
Please refer to the Huggingface dataset pages for a detailed description of the data structure.
pip install -r requirements.txtBubblewrap is an open-source sandbox tool that allows you to create and manage sandbox environments on Linux systems without requiring root privileges.
Install with root:
sudo apt-get update
sudo apt install bubblewrapInstall without root (build from source):
pip install meson ninja
export PATH="$HOME/.local/bin:$PATH"
git clone https://github.com/containers/bubblewrap.git
cd bubblewrap
meson setup build --prefix=$HOME/.local
meson compile -C build
meson install -C build
~/.local/bin/bwrap --version # check installation
bwrap --version # check installationSet up your API keys for the LLM services you plan to use (e.g., OpenAI, Deepseek, Anthropic). You can set the API keys in the HardTestGen/hard_tests_gen/llm_api.py.
Run test_cases_kit_generation.py to generate the test cases kit for each problem. This step uses LLMs to synthesize input validators (IV), output judging functions (OJF), and input generators (IG) for each problem.
Example command:
python hard_tests_gen/test_cases_kit_generation.py \
--dataset_name_or_path <huggingface_dataset_name> \
--target_pids_path <target_pids.json> \
--iv_and_ojf_gen_prompt_template test_cases_kit_prompt_iv_and_ojf \
--ig_gen_prompt_template test_cases_kit_prompt_ig \
--num_LLMGen_input 10 \
--iv_and_ojf_gen_responses_save_path ./iv_and_ojf_gen_responses.jsonl \
--ig_gen_responses_save_path ./ig_gen_responses.jsonl \
--test_cases_kit_save_path ./test_cases_kits.jsonl \
--model_name gpt-4o \
--temperature 0.1 \
--max_tokens 5120 \
--num_parallel 10Arguments:
--dataset_name_or_path: Name of the Huggingface dataset to use (e.g.,sigcp/hardtests_problems), or path to a local dataset in JSON/JSONL format.--target_pids_path: (Optional) Path to a JSON file containing a list of target problem IDs. If not provided, all problems in the dataset will be used.--iv_and_ojf_gen_prompt_template: Prompt template for generating Input Validator and Output Judging Function (default:test_cases_kit_prompt_iv_and_ojf).--ig_gen_prompt_template: Prompt template for Input Generation (default:test_cases_kit_prompt_ig).--num_LLMGen_input: Number of LLMGen inputs to generate for each problem (default: 10).--iv_and_ojf_gen_responses_save_path: Path to save LLM responses for IV/OJF generation.--ig_gen_responses_save_path: Path to save LLM responses for IG generation.--test_cases_kit_save_path: Path to save the generated test cases kits.--model_name: LLM model name (e.g.,gpt-4o,deepseek).--temperature: Sampling temperature for LLM (default: 0.1).--max_tokens: Maximum tokens for LLM output (default: 5120).--num_parallel: Number of parallel LLM requests (default: 10).
Run test_cases_generation.py to generate concrete test cases (input-output pairs) for each problem, based on the previously generated test cases kit.
python hard_tests_gen/test_cases_generation.py \
--problem_data_path <problem_data_path> \
--test_cases_kit_path ./test_cases_kits.jsonl \
--test_cases_save_path ./test_cases.jsonl \
--test_cases_related_contents_save_path ./test_cases_related_contents.jsonl \
--bwrap_path bwrap \
--cpp_compiler_path g++ \
--python_interpreter_path python3 \
--code_exec_temp_dir ./tmp \
--max_workers 3 \
--save_steps 10 \
--log_steps 10 \
--start 0 \
--end 1000000Arguments:
--problem_data_path: Name of the Huggingface dataset to use (e.g.,sigcp/hardtests_problems), or path to a local dataset in JSONL format.--test_cases_kit_path: Path to the test cases kit file generated in Step 1.--test_cases_save_path: Path to save the generated test cases (JSONL format).--test_cases_related_contents_save_path: Path to save related contents for each test case (JSONL format).--bwrap_path: Path to the Bubblewrap executable (default:bwrap).--cpp_compiler_path: Path to the C++ compiler (default:g++).--python_interpreter_path: Path to the Python interpreter (default:python3).--code_exec_temp_dir: Directory for temporary code execution files.--max_workers: Number of parallel workers (default: 3).--save_steps: Save results every N problems (default: 10).--log_steps: Log progress every N problems (default: 10).--start: Start index for problems to process (default: 0).--end: End index for problems to process (default: 1000000).- (Other advanced arguments are available, see code for details.)
Note:
- You must run Step 1 before Step 2, as Step 2 depends on the test cases kit generated in Step 1.
- Make sure all paths are set correctly and required files exist before running each step.
You may use the following command to filter out the problems whose test cases are not successfully generated and convert the rest of the problems into the format same as that in our Huggingface dataset.
Python HardTestGen/hard_tests_gen/convert_format.py \
--input_file ./test_cases.jsonl \
--output_file ./filtered_test_cases.jsonl \Please cite us if you find it useful using the following bibtex.
@misc{he2025hardtests,
title={HardTests: Synthesizing High-Quality Test Cases for LLM Coding},
author={Zhongmou He and Yee Man Choi and Kexun Zhang and Jiabao Ji and Junting Zhou and Dejia Xu and Ivan Bercovich and Aidan Zhang and Lei Li},
year={2025},
month={May}
}