Efficient Contrastive Learning via Novel Data Augmentation and Curriculum Learning (EMNLP 2021)

The official code for our paper: Efficient Contrastive Learning via Novel Data Augmentation and Curriculum Learning (Accepted at EMNLP 2021 short paper)

The implementation is based on the paper DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations and code implementation at (https://github.com/JohnGiorgi/DeCLUTR).

Installation

This repository requires Python 3.6.1 or later.

Setting up a virtual environment

Before installing, you should create and activate a Python virtual environment. See here for detailed instructions.

Installing the library and dependencies

git clone https://github.com/vano1205/EfficientCL
cd EfficientCL
pip install -r requirements.txt

Usage

Preparing a dataset

A dataset is simply a file containing one item of text (a document, a scientific paper, etc.) per line. For demonstration purposes, we have provided a script that will download the WikiText-103 dataset and match our minimal preprocessing

python scripts/preprocess_wikitext_103.py path/to/output/wikitext-103/train.txt --min-length 2048

See scripts/preprocess_openwebtext.py for a script that can be used to recreate the (much larger) dataset used in our paper.

You can specify the train set path in the configs under "train_data_path".

Training

To train the model, use the allennlp train command with our efficientcl.jsonnet config. Run the following

allennlp train "training_config/efficientcl.jsonnet" \
    --serialization-dir "output" \
    --overrides "{'train_data_path': 'path/to/your/dataset/train.txt'}" \
    --include-package "efficientcl"

The --overrides flag allows you to override any field in the config with a JSON-formatted string, but you can equivalently update the config itself if you prefer. During training, models, vocabulary, configuration, and log files will be saved to the directory provided by --serialization-dir. This can be changed to any directory you like.

Multi-GPU training

To train on more than one GPU, provide a list of CUDA devices in your call to allennlp train. For example, to train with four CUDA devices with IDs 0, 1, 2, 3

--overrides "{'distributed.cuda_devices': [0, 1, 2, 3]}"

Training with mixed-precision

If your GPU supports it, mixed-precision will be used automatically during training and inference.

Exporting a trained model to HuggingFace Transformers

We have provided a simple script to export a trained model so that it can be loaded with Hugging Face Transformers

python scripts/save_pretrained_hf.py "output" "pretrained"

Evaluating with GLUE

Jiant package is used for evaluation of GLUE benchmark. First, download the specific dataset (CoLA, MNLI, MRPC, QNLI, QQP, RTE, SST, STSB) by the following command:

cd jiant
export PYTHONPATH=/path/to/jiant:$PYTHONPATH
python jiant/scripts/download_data/runscript.py \
    download \
    --tasks mrpc \
    --output_path data

Evaluate each dataset by loading the exported model from above.

python jiant/proj/simple/runscript.py run --run_name simple --data_dir data --hf_pretrained_model_name_or_path ../pretrained --tasks mrpc --train_batch_size 16 --num_train_epochs 10  --exp_dir roberta

Citing

@article{ye2021efficient,
  title={Efficient Contrastive Learning via Novel Data Augmentation and Curriculum Learnings},
  author={Ye, Seonghyeon and Kim, Jiseon and Oh, Alice},
  journal={arXiv preprint arXiv::2109.05941},
  year={2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
efficientcl		efficientcl
jiant		jiant
scripts		scripts
training_config		training_config
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Efficient Contrastive Learning via Novel Data Augmentation and Curriculum Learning (EMNLP 2021)

Installation

Setting up a virtual environment

Installing the library and dependencies

Usage

Preparing a dataset

Training

Multi-GPU training

Training with mixed-precision

Exporting a trained model to HuggingFace Transformers

Evaluating with GLUE

Citing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

seonghyeonye/EfficientCL

Folders and files

Latest commit

History

Repository files navigation

Efficient Contrastive Learning via Novel Data Augmentation and Curriculum Learning (EMNLP 2021)

Installation

Setting up a virtual environment

Installing the library and dependencies

Usage

Preparing a dataset

Training

Multi-GPU training

Training with mixed-precision

Exporting a trained model to HuggingFace Transformers

Evaluating with GLUE

Citing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages