You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HanTa is a pure Python package for lemmatization and POS tagging of Dutch, English and German sentences. The approach is to some extent language indpendent and language models for more langauges will be added in future.
Lemmatization and POS tagging are based on the morphological analysis of a word. The morphological analysis is done by an Hidden Markov Model that tries to find the best sequence of morphemes underlying each word.
The package also contains a simple trigram based PoS tagger, that uses the probabilities from the morphological analysis for unknown words (and infrequent words from he training data).
importnltkfrompprintimportpprintsent="Die Europawahl in den Niederlanden findet immer donnerstags statt."words=nltk.word_tokenize(sent)
lemmata=tagger_de.tag_sent(words)
pprint(lemmata)
Further reading
For more information refer to the following resources:
The main documentation: The Hanover Tagger (Version 1.1.0) - Lemmatization, Morphological Analysis and POS Tagging in Python. Hannover, 2023 Online Available
Original publication: Christian Wartena. A probabilistic morphology model for German lemmatization.
In Proceedings of the 15th Conference on Natural Language Processing
(KONVENS 2019): Long Papers, pages 40–49, Erlangen, Germany, 2019.
German Society for Computational Linguistics & Language Technology. Online Available
About
The Hanover Tagger - A simple approach to lemmatization and POS-tagging of German morphology based on heuristics and hidden markov models