You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This page shows step-by-step instructions on how to train a FastText model from the NoReC (The Norwegian Review Corpus) dataset.
The dataset contains annotated reviews in Norwegian ranging from 1 to 6 stars.
The trained model will predict a rating of 1 to 6 starts of any given text.
Setup
git clone https://github.com/web64/norec-fasttext.git
cd norec-fasttext
Prepare training data
Download and extract the NoReC dataset
wget https://folk.uio.no/eivinabe/norec-1.0.1.tar.gz
tar -xvzf norec-1.0.1.tar.gz
tar -xvzf norec/conllu.tar.gz
Convert .conllu files to fastText format
php convert.php test
php convert.php dev
php convert.php train
This will create the fastText training files:
norec_test.txt
norec_dev.txt
norec_train.txt
The training text files are in this format:
__label__6 et episk eventyr et episk eventyr arkitektens læregutt er en storslagen roman ...
__label__1 tåpelig og flau kosebamse-reprise tåpelig og flau kosebamse-reprise komedien...
__label__5 test av mercedes c-klasse c350te...
The training texts has been lowercased and cleaned to reduce the number of tokens.
# Test model
fasttext test model_norec.bin norec_test.txt
Precision (P@1) is around 0.561 (this value might change each time the model is trained)
Recall (R@1) can be ignored as this only applies when trainingtexts
N 3517
P@1 0.561
R@1 0.561
Prediction
Run this command to try the interactive predictor.
Enter some text and it will return a predicted rating between 1 and 6.
# Predictions
>> fastText/fasttext predict model_norec.bin -
>> sjelden har så mange gode skuespillere gitt så mye for et så bedritent manus og en så flau film
>> __label__3