You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A training dataset generator for Guesslang's deep learning model.
Description
GuesslangTools purpose is to find and download a million source code files.
These files are used to train, evaluate and test
Guesslang,
a deep learning programming language detection tool.
The files are retrieved from more than 100k public open source
GitHub repositories.
Workflow
The million source code files used to feed Guesslang are generated as follows:
Randomly select the repositories that will be used to create
Guesslang's training, validation and test datasets.
Download each selected repository.
Extract some source code files from the downloaded repositories.
This workflow is fully automated but takes several hours to complete,
especially the download part.
Fortunately, it can be stopped and resumed at any moment.
Constraints
GuesslangTools ensures that:
Each source code file in the datasets is unique.
There are no empty files.
Only text files are retrieved, binary files are skipped.
All the files are converted to UTF-8 encoding.
Each selected repository is associated to only one dataset
(training, validation or test),
therefore files from a training repository can only be in
the training dataset. Same for the validation and test datasets.
Usage
Prerequisite
GuesslangTools requires Python 3.7 or later.
At least 16GB of total system memory is recommended.
At least 150GB of free storage space is recommended.
Installation
You can install GuesslangTools from the source code by running:
pip install .
Execution
You can run Guesslang tools on a terminal as follows:
gltool /path/to/generated_datasets/
Several options and hacks are available to fine tune the size and
the diversity of the generated datasets. To list all the options, please run: