Morphological Priors for Probabilistic Neural Word Embeddings

================================= Implementation of Morphological Priors for Probabilistic Neural Word Embeddings.

VarEmbed in Blocks

This is the implementation for the following paper, to appear at EMNLP 2016: Morphological Priors for Probabilistic Neural Word Embeddings. Parminder Bhatia, Robert Guthrie, Jacob Eisenstein.

Using LSTM's for word embeddings that incorporate word-level and morpheme-level information using Blocks and Fuel. LSTM code modified from https://github.com/johnarevalo/blocks-char-rnn.git.

Requirements

Install Blocks. Please see the documentation for more information.
Install Fuel. Please see the documentation for more information.
Install the Morfessor Python package.

Results

Usage

The input can be any raw, pre-tokenized text. This will walk through how to generate the Morfessor model, preprocess and package the data as NDArrays, and train the model.

You will need to train a Morfessor model on your data. A script for this has been provided. It will output a serialized Morfessor model for later use.

python train_morfessor.py --training-data <input.txt> --output <output.bin>

The data set needs to be preprocessed and formatted, using preprocess_data.py and make_dataset.py. The -h flag will give the arguments needed. Preprocessing is downcasing so that capitalization doesn't affect Morfessor.

python preprocess_data.py <textfile>.trn -o <output_file> -n <unks all but top N words>

python make_dataset.py <textfile>.trn -mf <morfessor_model.bin>

Next, run train.py to train the model. It will print statistics after each mini-batch.

python train.py <filename>.hdf5

Parameters like batch size, embedding dimension, and the number of epochs can be changed in the config.py file.

Last, word vectors can be output in the format word dim1 dim2 ..., with 1 word per line, via the output_word_vectors.py script. Provide it a vocab of vectors to output, as well as a serialized network from training.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
model		model
scripts		scripts
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
config.py		config.py
make_dataset.py		make_dataset.py
package_embeddings.py		package_embeddings.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Morphological Priors for Probabilistic Neural Word Embeddings

VarEmbed in Blocks

Requirements

Results

Usage

About

Uh oh!

Releases 2

Packages

Contributors 2

Uh oh!

Languages

rguthrie3/MorphologicalPriorsForWordEmbeddings

Folders and files

Latest commit

History

Repository files navigation

Morphological Priors for Probabilistic Neural Word Embeddings

VarEmbed in Blocks

Requirements

Results

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Uh oh!

Languages

Packages