Herodotos Project NER Annotation and Tagger

This repository includes texts annotated for named entities as part of the Herodotos Project (Ohio State University / Ghent University) as well as a BiLSTM-CRF (Lample et al., 2016) NER tagger pre-trained on said annotation. Please check out the Humanities Entity Recognizer for more details on how it was trained.

Annotation

All texts are in Latin taken from the Latin Library Collection (collected by CLTK) or the Perseus Latin Collection. Greek will be added soon.

The data files in the Annotation directory were annotated for named entities by a team of Classics experts at Ohio State University. Texts presently included are excerpts from Caesar's Wars, both Gallic (GW) and Civil (CW), the Plinies' writings, both Elder and Younger, and Ovid's Ars Amatoria.

Names of peoples are annotated as GRP; names of persons are annotated as PRS; and names of geographical places are annotated as GEO in the BIO scheme.

Further information on the corpus, including splits for training and testing, can be found in Erdmann et al. (2016), "Challenges and Solutions for Latin Named Entity Recognition." For citation purposes however, please see the Acknowledgments section below for the more recent/relevant publication to cite.

Tagger

The Herodotos Project Latin NER Tagger is trained on the entire set of Latin data included in this repository using the BiLSTM-CRF architecture of Lample et al. (2016), "Neural Architectures for Named Entity Recognition".

Prerequisites

To run the tagger, make sure the below packages and any dependencies have been installed.

Usage

The tagger can be called with the following commands:

cd Herodotos_Project_Latin_NER_tagger
python tagger.py --input sample.in.tok > sample.out.tags

The input should already be tokenized with clitics separated and one sentence per line, as in sample.in.tok. Near optimal performance can be achieved by simply using punctuation-and-white-space or even just white-space tokenization due to the relative infrequency of Latin cliticization and the tagger's robust handling of character-level features. The output will return all identified named entities for each line as triples. Each triple contains the following information: (1) character offset within the corresponding line where the named entity starts (2) the full span of the named entity (3) the label of the named entity.

Alternative supported input formats can be specified with the --inputFormat option. They include the conll and crfsuite formats. Conll is one token per line, followed by a tab, then its label (though since we're predicting the label, it doesn't matter what you actually put as the label). Sentence breaks are indicated by a blank line. See sample.in.conll for an example.

python tagger.py --input sample.in.conll --inputFormat conll > sample.out.tags

Crfsuite formatting is the same as conll but the token-label order is reversed. See sample.in.crf for an example.

python tagger.py --input sample.in.crf --inputFormat crf > sample.out.tags

You can also request different output formats via the --outputFormat option. The following example will output to crfsuite format:

python tagger.py --input sample.in.tok --outputFormat crf > sample.out.crf

And this will output to conll format:

python tagger.py sample.in.tok --outputFormat conll > sample.out.conll

Alternatively, you can print out a list of all unique entities identified by label with the list option:

python tagger.py sample.in.tok --outputFormat list > sample.out.list

And of course, any combination of input and output formats is supported.

Acknowledgments

If you find either the tagger or the data useful in any way, please cite our forthcoming publication:

Alexander Erdmann, David Joseph Wrisley, Benjamin Allen, Christopher Brown, Sophie Cohen Bodénès, Micha Elsner, Yukun Feng, Brian Joseph, Béatrice Joyeaux-Prunel and Marie-Catherine de Marneffe. 2019. “Practical, Efficient, and Customizable Active Learning for Named Entity Recognition in the Digital Humanities.” In Proceedings of North American Association of Computational Linguistics (NAACL 2019). Minneapolis, Minnesota.

Contact ae1541@nyu.edu or any of the co-authors with questions regarding this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
Annotation_1-1-19		Annotation_1-1-19
Herodotos-beta @ f22fdd9		Herodotos-beta @ f22fdd9
Herodotos_Project_Latin_NER_tagger		Herodotos_Project_Latin_NER_tagger
recogito2-plugin		recogito2-plugin
.gitmodules		.gitmodules
GreekGroupNames1(GRP).xlsx		GreekGroupNames1(GRP).xlsx
GreekGroupNames2(grp-b).xlsx		GreekGroupNames2(grp-b).xlsx
Herodotos_summary.docx		Herodotos_summary.docx
LICENSE		LICENSE
LatinGroupNames(grp-b).xlsx		LatinGroupNames(grp-b).xlsx
README.md		README.md
predicted_latin_entities_1-1-2019.txt		predicted_latin_entities_1-1-2019.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Herodotos Project NER Annotation and Tagger

Annotation

Tagger

Prerequisites

Usage

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

Herodotos-Project/Herodotos-Project-Latin-NER-Tagger-Annotation

Folders and files

Latest commit

History

Repository files navigation

Herodotos Project NER Annotation and Tagger

Annotation

Tagger

Prerequisites

Usage

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages