You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyTorch implementation of BERT in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (https://arxiv.org/abs/1810.04805)
First things first, you need to prepare your data in an appropriate format.
Your corpus is assumed to follow the below constraints.
Each line is a document.
A document consists of sentences, seperated by vertical bar (|).
A sentence is assumed to be already tokenized. Tokens are seperated by space.
A sentence has no more than 256 tokens.
A document has at least 2 sentences.
You have two distinct data files, one for train data and the other for val data.
This repo comes with example data for pretraining in data/example directory.
Here is the content of data/example/train.txt file.
One, two, three, four, five,|Once I caught a fish alive,|Six, seven, eight, nine, ten,|Then I let go again.
I’m a little teapot|Short and stout|Here is my handle|Here is my spout.
Jack and Jill went up the hill|To fetch a pail of water.|Jack fell down and broke his crown,|And Jill came tumbling after.
Also, this repo includes SST-2 data in data/SST-2 directory for sentiment classification.