You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This project is a direct port of Pragmatic Segmenter which provides rule-based sentence
boundary detection.
Usage
The Segmenter class provides the Segment method which in the simplest usage takes a string:
using PragmaticSegmenterNet;
IReadOnlyList<string> result = Segmenter.Segment("One Sentence. And another sentence.");
// ["One Sentence.", "And another sentence."]
IReadOnlyList<string> result2 = Segmenter.Segment("Anything.", Language.Italian);
// ["Anything"]
The Segment method has a number of optional parameters:
IReadOnlyList<string> Segment(string text, Language language = Language.English, bool cleanText = true, DocumentType documentType = DocumentType.Any)
Language - An enum representing the supported languages, the default is English, see the supported languages list below for the list of currently supported languages.
CleanText - A boolean indicating whether the input text should be cleaned prior to segmentation. Cleaning removes extra newlines and whitespace. Defaults to true.
DocumentType - Used by the text cleaning process to determine which reformatting to apply. For PDFs this handles newlines in the middle of a sentence whereas for HTML documents this will handle HMTL tags. Defaults to any which does not apply any special formatting.
Languages
English = 0 (default)
Amharic = 1
Arabic = 2
Armenian = 3
Bulgarian = 4
Burmese = 5
Chinese = 6
Danish = 7
Dutch = 8
French = 9
German = 10
Greek = 11
Hindi = 12
Italian = 13
Japanese = 14
Kazakh = 15 (partial support, potentially only for the Cyrillic form of the alphabet)
Persian = 16
Polish = 17
Russian = 18
Spanish = 19
Urdu = 20
Credit
This project wouldn't be possible without the work done by Pragmatic Segmenter team.
About
Port of PragmaticSegmenter for sentence boundary detection