OpenCoder Data Filtering Pipeline

Overview

This repository presents the first heuristic filtering framework tailored to large-scale code pretraining corpus by considering the unique characteristics of different programming language, which contains over 100 filtering rules. Based on RedPajamaV2, this framework extends and refines the existing rules from StarCoder to better align with the unique properties of code datasets, resulting in more precise and higher-quality data cleansing.

Features

Flexibility
- Separation of rule properties and rule thresholds. Different types of files can use specific sets of filtering rules or share the same rule properties with customized thresholds.
- Thanks to the registration mechanism implemented using Python decorators (details in base.py), the rules applied to different types of files can be easily reused in the same code implementation.
Extensibility
- In the document.py file, we implemented a class specifically for loading code documents. This class can automatically compute and parse common attributes of code documents, making it easier to create new rules for the code documents.
- This framework also supports custom implementations for filtering multiple other types of corpora (e.g., general text files, math-related files, etc.). By setting the parameter spec of the corresponding corpus rule registries, it is very easy to perform filtering process of multiple types of corpora within the same repository.
Transferability
- This framework is implemented as a class interface, making it very easy to migrate to various distributed systems such as Hadoop, Spark, MaxCompute, and others.

Repository Structure

📦 opc_data_filtering
├── 📂 artifacts                          (resource files)
├── 📂 examples
│   └── data_cleaning_example.ipynb       (an example of data filtering workflow)
├── 📂 pipeline
│   ├── code_filter_config.py             (configure of filtering thresholds)
│   ├── compute_filtering.py              (pipeline class for doing filtering process)
│   └── compute_quality_signals.py        (pipeline class for computing quality signals)
├── 📂 quality_signals                    (implementation of all the quality signals)
├── 📂 redpajama                          (copy from redpajama with a little modification)
├── 📂 test_data                          (test data used for the usage example)
├── 📂 utils    
├── base.py                               (implementation of quality signal register)
├── document.py                           (implementation of code document class)
├── README.md
└── requirements.txt

Requirements

Python version: python>=3.7

You can install the required packages with the following commands:

pip install -r requirements.txt

or

conda env create -f environment.yml

You also need to download our FastText model: lang_predictor.bin, which is used to predict the language of each file, and put it into ./artifacts/.

Get Started

We have set up a simple data filtering workflow in data_cleaning_example.ipynb, which processes the sample data in ./test_data/raw_code/ for users to use as a reference.

Details of Filtering rules

Filter Naming Rules

The name of filters will follow the following pattern:

qsc_[type]_[metric]_[unit]_[description]

qsc: Code quality signal flag, which is fixed
[type]: Category of quality signals, eg: doc, code, codec, codepython ...
[metric]: Measurement metric, eg: num, frac, score, cate ...
[unit]: Units used for statistics, eg: character, word, line ...
[description]: Brief description of the quality signal

How to add a new filtering rule?

Take the filter "qsc_code_num_chars" as an example:

Add an implementation of the quality signal in ./quality_signals/. Please follow the filter naming rules.

@register_quality_signal('qsc_code_num_chars', 'codedocument')
class QSC_Doc_Num_Chars(QSCodeBase):
    """
    The number of characters.
    """
    def __call__(self, document: QSCodeDocument) -> SignalType:
        return [(0, len(document), float(len(document)))]

Add the corresponding filtering thresholds into ./pipeline/code_filter_config.py. Please consider that different type of files can set the specific thresholds.

...
code_filter_config['others'] = {
  ...
  'qsc_code_num_chars': 'lambda x: x < 50',
  ...
}
code_filter_config['data'] = {
    ...
  'qsc_code_num_chars': 'lambda x: x < 50 or x > 5000',
  ...
}
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenCoder Data Filtering Pipeline

Overview

Features

Repository Structure

Requirements

Get Started

Details of Filtering rules

Categories

Filter Naming Rules

How to add a new filtering rule?

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
artifacts		artifacts
examples		examples
pipeline		pipeline
quality_signals		quality_signals
redpajama		redpajama
test_data/raw_code		test_data/raw_code
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
base.py		base.py
document.py		document.py
environment.yml		environment.yml
requirements.txt		requirements.txt

License

OpenCoder-llm/opc_data_filtering

Folders and files

Latest commit

History

Repository files navigation

OpenCoder Data Filtering Pipeline

Overview

Features

Repository Structure

Requirements

Get Started

Details of Filtering rules

Categories

Filter Naming Rules

How to add a new filtering rule?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages