DSDBench: Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors
β’ π Introduction β’ π News β’ β¨ DSDBench β’ π Methodology
β’ β‘οΈ Getting Started β’ βοΈ Configuration Details β’ π Experiment Results β’ π Citation β’ π Paper
Debugging data science code presents significant challenges, especially when multiple logical errors interact in intricate ways. Existing benchmarks often focus on simple, isolated error scenarios, leaving the debugging of multi-hop, multi-bug errors largely unexplored. DSDBench fills this critical gap by offering a comprehensive dataset and evaluation framework designed to assess and improve large language models (LLMs) in debugging complex, real-world data science code problems.
- September 15, 2025: DSDBench has been accepted as EMNLP 2025 Oral! π
- March 21, 2025: DSDBench dataset and evaluation framework officially released! π
DSDBench is the first systematic benchmark explicitly created for data science code debugging, featuring:
- Realistic Errors: Logical and runtime errors that mirror real-world data science workflows.
- Multi-Hop Debugging: Scenarios where error identification requires tracing back through multiple code execution steps.
- Multi-Bug Scenarios: Cases involving concurrent errors within a single code snippet.
- Comprehensive Annotations: Includes 1,117 meticulously annotated examples, clearly labeling cause-effect error lines and runtime error messages.
Our contributions include:
- Automated Error Injection: Leveraging advanced LLM techniques to systematically introduce realistic runtime errors.
- Dynamic Error Annotation: Utilizing runtime tracing (with tools like
snoop) to accurately capture cause-effect relationships in errors. - Rigorous Evaluation Protocols: Employing a four-dimensional evaluation approach covering cause lines, effect lines, error types, and error messages.
To start using DSDBench, follow these installation and execution steps:
You can install DSDBench and its dependencies using one of the following methods:
-
Using pip with requirements file:
pip install -r requirements.txt
-
Installing as a package (development mode):
pip install -e .
To use DSDBench with language models that require API access (like GPT-4o), you need to configure your API credentials:
-
Copy the example environment file:
cp env.example .env
-
Edit the
.envfile and add your API credentials:# OpenAI API Configuration OPENAI_API_KEY=your-api-key-here OPENAI_BASE_URL=https://api.openai.com/v1 # Or for other providers (e.g., Azure OpenAI) OPENAI_BASE_URL=https://your-endpoint.openai.azure.com/
-
Alternatively, you can directly modify the configuration files in
agents/config/directory.
Note: If you're using a different model provider (like Azure OpenAI), set the appropriate base URL according to your provider's documentation.
The DSDBench repository has the following structure:
DSDBench/
βββ π agents/ # Agent implementations
β βββ error_verifier_agent/ # Error verification and evaluation
β β βββ agent.py # Main evaluation agent
β β βββ exact_match_evaluator.py # Exact match evaluation logic
β β βββ prompt.py # Evaluation prompts
β βββ data_analysis_agent/ # Data analysis agent
β βββ error_suggest_agent/ # Error suggestion agent
β βββ agent_environment/ # Agent environment setup
β βββ openai_chatComplete.py # OpenAI API client
β βββ vllm_client.py # vLLM client for local inference
βββ π config/ # Configuration files
β βββ single_bug_eval_agent_config.py # Single-bug evaluation config
β βββ multi_bug_eval_agent_config.py # Multi-bug evaluation config
β βββ vllm_single_bug_eval_agent_config.py # vLLM single-bug config
β βββ vllm_multi_bug_eval_agent_config.py # vLLM multi-bug config
β βββ ... # Other configuration files
βββ π workspace/ # Workspace directory
β βββ π benchmark_evaluation/ # Benchmark evaluation directory
β βββ bench_final_annotation_single_error.jsonl # Single-bug dataset
β βββ bench_final_annotation_multi_errors.jsonl # Multi-bug dataset
β βββ compute_single_eval_results.py # Single-bug evaluation script
β βββ compute_multi_eval_results.py # Multi-bug evaluation script
βββ π assets/ # Assets and figures
βββ run_single_bug_eval.py # Single-bug evaluation runner
βββ run_multi_bug_eval.py # Multi-bug evaluation runner
βββ run_vllm_single_bug_eval.py # vLLM single-bug evaluation runner
βββ workflow_generic.py # Generic workflow execution
βββ requirements.txt # Python dependencies
βββ setup.py # Package setup
DSDBench provides helper scripts to simplify the evaluation process:
For single-bug scenarios:
python run_single_bug_eval.pyThis command automatically runs the workflow using the single-bug configuration and computes the evaluation results.
For multi-bug scenarios:
python run_multi_bug_eval.pyThis command executes the multi-bug workflow and calculates the multi-error evaluation metrics.
Using vLLM for local inference:
# Single-bug evaluation with vLLM
python run_vllm_single_bug_eval.py
# Multi-bug evaluation with vLLM
python workflow_generic.py --config config/vllm_multi_bug_eval_agent_config.pyThese commands use vLLM for high-performance local model inference. See VLLM_README.md for detailed setup instructions.
For more control, you can run individual workflow components manually:
For single-bug evaluation:
python workflow_generic.py --config config/single_bug_eval_agent_config.py
cd workspace/benchmark_evaluation
python compute_single_eval_results.pyFor multi-bug evaluation:
python workflow_generic.py --config config/multi_bug_eval_agent_config.py
cd workspace/benchmark_evaluation
python compute_multi_eval_results.pyThe evaluation scripts will generate detailed metrics including:
- Overall Scores: Percentage scores for cause lines, effect lines, error types, and error messages
- Dimension-wise Metrics: Precision, Recall, F1-score, and Accuracy for each evaluation dimension
- Confusion Matrix: True Positives (TP), False Positives (FP), and False Negatives (FN) for each dimension
Example output:
Overall Cause Line Score: 31.25%
Overall Effect Line Score: 100.00%
Overall Error Type Score: 0.00%
Overall Error Message Score: 82.81%
Dimension-wise Metrics:
{
"cause_line": {
"precision": 0.3125,
"recall": 0.3125,
"f1_score": 0.3125,
"accuracy": 1.0,
"TP": 5,
"FP": 11,
"FN": 0
},
...
}
To generate datasets from scratch, execute the pipeline steps in the following order:
# First, run the initial data generation workflows
python workflow_generic.py --config config/data_annotate_agent_config.py
python workflow_generic.py --config config/library_error_inject_agent_config.py
python workflow_generic.py --config config/error_snoop_agent_config.py
python workflow_generic.py --config config/weak_llm_direct_analysis_config.py
# Then process the data with our improved utilities
cd workspace
# Filter for executable errors
python filter_non_executable_data.py --input path/to/monitored_errors.jsonl --output path/to/filtered_errors.jsonl
# Find multi-hop errors
python find_multi_hop_data.py --input path/to/filtered_errors.jsonl --output path/to/annotated_errors.jsonl
# Merge annotations from multiple sources
python merge_final_annotation.py --input path/to/file1.jsonl path/to/file2.jsonl --output path/to/bench_final_annotation_single_error.jsonl
# Generate multi-bug scenarios
python merge_multiple_errors.py --input path/to/bench_final_annotation_single_error.jsonl --output path/to/bench_final_annotation_multi_errors.jsonl --samples_per_entry 5Each utility script supports command-line arguments for flexible input/output path configuration:
- filter_non_executable_data.py: Filters data to keep only error versions with valid traceback information
- find_multi_hop_data.py: Identifies cause and effect error lines in traceback output
- merge_final_annotation.py: Merges multiple JSONL annotation files into a single dataset
- merge_multiple_errors.py: Generates multi-bug scenarios by combining single-bug errors
The configuration files in the config/ directory manage different aspects of the benchmark. Here's a brief overview:
single_bug_eval_agent_config.py: Configuration for single-bug evaluation scenarios.multi_bug_eval_agent_config.py: Configuration for multi-bug evaluation scenarios.data_annotate_agent_config.py: Configuration for the data annotation process.library_error_inject_agent_config.py: Configuration for error injection in libraries.error_snoop_agent_config.py: Configuration for error monitoring.weak_llm_direct_analysis_config.py: Configuration for weak LLM error analysis.
To use a specific configuration file when running the workflow, use the --config argument:
python workflow_generic.py --config config/your_chosen_config.pyEach configuration file adheres to a standard structure defined as follows:
AGENT_CONFIG = {
'workspace': './workspace/path', # Base workspace directory
'agents': [
{
'name': 'agent_name', # Name of the agent
'class': AgentClass, # The agent class to instantiate
'prompts': { # Prompts used by the agent
'system': SYSTEM_PROMPT,
'user': USER_PROMPT,
'eval': EVAL_PROMPT,
# Other prompts as needed
},
'kwargs': { # Additional agent parameters
'query': 'Default query',
# Other parameters as needed
}
},
# Additional agents as needed
]
}
WORKFLOW = [
{
'agent': 'agent_name', # Name of the agent to run
'method': 'method_name', # Agent method to execute
'args': { # Arguments for the method
'model_type': 'gpt-4o', # LLM model to use
'eval_folder': 'workspace/results' # Output location
},
'input': {'data': 'path/to/input.jsonl'}, # Input data source
'data_ids': [1, 2, 3], # Specific data IDs to process
'data_range': [1, 50], # Mutual exclusive with 'data_ids', specify a range of data IDs to process
'output': 'result_name', # Name for the output
'output_type': 'analysis' # Type of output
},
# Additional workflow steps as needed
]The model_type parameter in workflow steps specifies the LLM to be used for evaluation:
openai/gpt-4o: OpenAI GPT-4o modelopenai/gpt-oss-120b: OpenAI GPT-OSS-120B modelQwen/Qwen2.5-72B-Instruct: Qwen 2.5 modeldeepseek/deepseek-v3: DeepSeek v3 model- And other models supported by your API provider
Agents can be customized by modifying the kwargs dictionary within their configuration. Common parameters include:
Evaluations of state-of-the-art LLMs reveal significant challenges in multi-bug debugging scenarios. Key results are summarized below:
| Model | Cause Line Acc. | Effect Line Acc. | Error Type Acc. | Error Message Acc. |
|---|---|---|---|---|
| GPT-4o | 39.0% | 34.3% | 30.6% | 31.4% |
| Claude 3.5 | 43.7% | 35.2% | 36.3% | 34.0% |
| Deepseek-V3 | 48.3% | 34.5% | 35.9% | 34.7% |
Detailed analysis and ablation studies further emphasize the benchmark's complexity and its value in diagnosing model limitations.
If DSDBench is helpful in your research, please cite our work using the following BibTeX entry:
@misc{yang2025stoperrorbenchmarkingllms,
title={Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors},
author={Zhiyu Yang and Shuo Wang and Yukun Yan and Yang Deng},
year={2025},
eprint={2503.22388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.22388},
}
