You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This repository includes texts annotated for named entities as part of the Herodotos Project (Ohio State University / Ghent University) as well as a BiLSTM-CRF (Lample et al., 2016) NER tagger pre-trained on said annotation. Please check out the Humanities Entity Recognizer for more details on how it was trained.
The data files in the Annotation directory were annotated for named entities by a team of Classics experts at Ohio State University. Texts presently included are excerpts from Caesar's Wars, both Gallic (GW) and Civil (CW), the Plinies' writings, both Elder and Younger, and Ovid's Ars Amatoria.
Names of peoples are annotated as GRP; names of persons are annotated as PRS; and names of geographical places are annotated as GEO in the BIO scheme.
Further information on the corpus, including splits for training and testing, can be found in Erdmann et al. (2016), "Challenges and Solutions for Latin Named Entity Recognition." For citation purposes however, please see the Acknowledgments section below for the more recent/relevant publication to cite.
Tagger
The Herodotos Project Latin NER Tagger is trained on the entire set of Latin data included in this repository using the BiLSTM-CRF architecture of Lample et al. (2016), "Neural Architectures for Named Entity Recognition".
Prerequisites
To run the tagger, make sure the below packages and any dependencies have been installed.
The tagger can be called with the following commands:
cd Herodotos_Project_Latin_NER_tagger
python tagger.py --input sample.in.tok > sample.out.tags
The input should already be tokenized with clitics separated and one sentence per line, as in sample.in.tok. Near optimal performance can be achieved by simply using punctuation-and-white-space or even just white-space tokenization due to the relative infrequency of Latin cliticization and the tagger's robust handling of character-level features. The output will return all identified named entities for each line as triples. Each triple contains the following information: (1) character offset within the corresponding line where the named entity starts (2) the full span of the named entity (3) the label of the named entity.
Alternative supported input formats can be specified with the --inputFormat option. They include the conll and crfsuite formats. Conll is one token per line, followed by a tab, then its label (though since we're predicting the label, it doesn't matter what you actually put as the label). Sentence breaks are indicated by a blank line. See sample.in.conll for an example.