This repository presents the first heuristic filtering framework tailored to large-scale code pretraining corpus by considering the unique characteristics of different programming language, which contains over 100 filtering rules. Based on RedPajamaV2, this framework extends and refines the existing rules from StarCoder to better align with the unique properties of code datasets, resulting in more precise and higher-quality data cleansing.
- Flexibility
- Separation of rule properties and rule thresholds. Different types of files can use specific sets of filtering rules or share the same rule properties with customized thresholds.
- Thanks to the registration mechanism implemented using Python decorators (details in base.py), the rules applied to different types of files can be easily reused in the same code implementation.
- Extensibility
- In the document.py file, we implemented a class specifically for loading code documents. This class can automatically compute and parse common attributes of code documents, making it easier to create new rules for the code documents.
- This framework also supports custom implementations for filtering multiple other types of corpora (e.g., general text files, math-related files, etc.). By setting the parameter
spec
of the corresponding corpus rule registries, it is very easy to perform filtering process of multiple types of corpora within the same repository.
- Transferability
- This framework is implemented as a class interface, making it very easy to migrate to various distributed systems such as Hadoop, Spark, MaxCompute, and others.
π¦ opc_data_filtering
βββ π artifacts (resource files)
βββ π examples
β βββ data_cleaning_example.ipynb (an example of data filtering workflow)
βββ π pipeline
β βββ code_filter_config.py (configure of filtering thresholds)
β βββ compute_filtering.py (pipeline class for doing filtering process)
β βββ compute_quality_signals.py (pipeline class for computing quality signals)
βββ π quality_signals (implementation of all the quality signals)
βββ π redpajama (copy from redpajama with a little modification)
βββ π test_data (test data used for the usage example)
βββ π utils
βββ base.py (implementation of quality signal register)
βββ document.py (implementation of code document class)
βββ README.md
βββ requirements.txt
- Python version:
python>=3.7
You can install the required packages with the following commands:
pip install -r requirements.txt
or
conda env create -f environment.yml
You also need to download our FastText model: lang_predictor.bin, which is used to predict the language of each file, and put it into ./artifacts/
.
We have set up a simple data filtering workflow in data_cleaning_example.ipynb, which processes the sample data in ./test_data/raw_code/
for users to use as a reference.
We developed the following three categories of filtering rules:
- Natural Language Filtering Rules: These rules filter data based on common properties for all text files, such as file size, number of lines, and other general metrics. Both text and code files share these filtering rules.
- General Code Filtering Rules: These rules apply to all code files by filtering data based on general code characteristics, such as the number of variables, average function length, and other common features.
- Language-Specific Filtering Rules: These rules are designed according to the unique characteristics of specific programming languages, such as the frequency of βpassβ statements in Python or the use of βgotoβ statements in C. We have developed these rules for the following eight commonly used programming languages: Python, C, C++, C#, Java, JavaScript, Go, and HTML.
The name of filters will follow the following pattern:
qsc_[type]_[metric]_[unit]_[description]
- qsc: Code quality signal flag, which is fixed
- [type]: Category of quality signals, eg: doc, code, codec, codepython ...
- [metric]: Measurement metric, eg: num, frac, score, cate ...
- [unit]: Units used for statistics, eg: character, word, line ...
- [description]: Brief description of the quality signal
Take the filter "qsc_code_num_chars" as an example:
- Add an implementation of the quality signal in
./quality_signals/
. Please follow the filter naming rules.
@register_quality_signal('qsc_code_num_chars', 'codedocument')
class QSC_Doc_Num_Chars(QSCodeBase):
"""
The number of characters.
"""
def __call__(self, document: QSCodeDocument) -> SignalType:
return [(0, len(document), float(len(document)))]
- Add the corresponding filtering thresholds into
./pipeline/code_filter_config.py
. Please consider that different type of files can set the specific thresholds.
...
code_filter_config['others'] = {
...
'qsc_code_num_chars': 'lambda x: x < 50',
...
}
code_filter_config['data'] = {
...
'qsc_code_num_chars': 'lambda x: x < 50 or x > 5000',
...
}
...