You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From Conversation to Evaluation: Benchmarking LLMs on Development Knowledge via SimpleDevQA
SimpleDevQA is a multilingual Development Knowledge QA benchmark derived from large-scale real user dialogues via a rigorous three-phase pipeline.
The data pipeline is as follow:
🔍 Dataset Overview
📊 2,740 Dev Knowledge QA pairs
🌍 Trilingual support (EN/CN/RU)
💻 Diverse development topics
🔗 Verifiable answers with web references
🚀Key Features
Pipeline Implementation
The code/generate directory contains code for collecting web documents based on real conversations and jointly inputting them with actual dialogues into LLMs to regenerate Q&A pairs.
The code/filter directory implements a series of rigorous filtering processes to ensure data quality.
Reference Support
All generated Q&A pairs come with corresponding reference URLs stored in data/reference, enabling verification of answer accuracy.
Efficient Evaluation Framework
The code/eval directory provides ready-to-use code for conveniently and efficiently evaluating LLM performance on the SimpleDevQA benchmark.
🛠️Implementation
⚙️ Environment
Create the environment and install the required packages