
๐ If you like our project, please give us a star โญ on GitHub for the latest update.
็ฎไฝไธญๆ | English
DataFlow_Video_En_Final.mp4
๐ [2025-06-28] Weโre excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates.
DataFlow is a data preparation and training system designed toย parse, generate, process and evaluateย high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuing, RL training) or RAG using knowledge base cleaning. DataFlow has been empirically validated to improve domain-oriented LLM's performance in fields such as healthcare, finance, and law.
Specifically, we constructing diverseย operators
ย leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinctย pipelines
, collectively forming the comprehensiveย DataFlowย system
. Additionally, we develop an intelligentย DataFlow-agent
ย capable of dynamically assembling newย pipelines
ย by recombining existingย operators
ย on demand.
DataFlow adopts a modular operator design philosophy, building flexible data processing pipelines by combining different types of operators. As the basic unit of data processing, an operator can receive structured data input (such as in json/jsonl/csv format) and, after intelligent processing, output high-quality data results. For a detailed guide on using operators, please refer to the Operator Documentation.
In the DataFlow framework, operators are divided into three core categories based on their functional characteristics:
Operator Type | Quantity | Main Function |
---|---|---|
Generic Operators | 80+ | Covers general functions for text evaluation, processing, and synthesis |
Domain-Specific Operators | 40+ | Specialized processing for specific domains (e.g., medical, financial, legal) |
Evaluation Operators | 20+ | Comprehensively evaluates data quality from 6 dimensions |
Current Pipelines in Dataflow are as follows:
- ๐ Text Pipeline: Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training.
- ๐ง Reasoning Pipeline: Enhances existing questionโanswer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation.
- ๐๏ธ Text2SQL Pipeline: Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information.
- ๐ Knowlege Base Cleaning Pipeline: Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation.
- ๐ค Agentic RAG Pipeline: Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks.
In this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the documentation for details.
-
DataFlow Agent: An intelligent assistant that performs data analysis, writes custom
operators
, and automatically orchestrates them intopipelines
based on specific task objectives.
For environment setup and installation, please using the following commands๐
conda create -n dataflow python=3.10
conda activate dataflow
pip install open-dataflow
If you want to use your own GPU to inference locally, please use:
pip install open-dataflow[vllm]
Dataflow supports Python>=3.10
You can use follwing command to check if installed correctly:
dataflow -v
You are expected to see following outputs:
open-dataflow codebase version: 1.0.0
Checking for updates...
Local version: 1.0.0
PyPI newest version: 1.0.0
You are using the latest version: 1.0.0.
You can quickly launch a Gradio-based interface to test DataFlow operators with the following command:
dataflow webui
This will start an interactive web UI, allowing you to visualize all operators seamlessly.
For Quick-Start and Guide, please visit our Documentation.
For Detailed Experiments setting, please visit our documentation.
The pre-training data processing pipeline
was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using QuratingScorer
are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.
We filted 3k record from alpaca
dataset and compare it with radom selected 3k data from alpaca
dataset by training it on Qwen2.5-7B. Results are:
We verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are:
We fine-tuned the Qwen2.5-Coder-7B-Instruct model using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:
Our team has published the following papers that form core components of the DataFlow system:
Paper Title | DataFlow Component | Venue | Year |
---|---|---|---|
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification | Multimodal reasoning verification framework for data processing and evaluation | ACL | 2025 |
Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration | Multi-actor collaborative data selection mechanism for enhanced data filtering and processing | ACL | 2025 |
We sincerely appreciate MinerU's outstanding contribution, particularly its robust text extraction capabilities from PDFs and documents, which greatly facilitates data loading.
Join the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!
โข ๐ฎ GitHub Issues: Report bugs or suggest features
โข ๐ง GitHub Pull Requests: Contribute code improvements
โข ๐ฌ Join our community groups to connect with us and other contributors!
If you use DataFlow in your research, feel free to give us a cite.
@misc{dataflow2025,
author = {DataFlow Develop Team},
title = {DataFlow: A Unified Framework for Data-Centric AI},
year = {2025},
howpublished = {\url{https://github.com/OpenDCAI/DataFlow}},
note = {Accessed: 2025-07-08}
}