Package for aligning two (similar) tokenized versions of a text.
The package provides the following functionality:
(1) Token alignment for two texts
(2) Splitting texts at positions where they align well
(3) Creating sentence alignments from the tokenized alignments
The alignment of longer texts is achieved by, first, (1) splitting the documents at positions where they align well. (2) Next the resulting shorter text sequences are aligned token-wise. (3) After that, the prelimenary alignments are joined back together and cleaned. (4) Optionally, the text can then be serialized back to sentence strings, giving us aligned sentence pairs.
textalign.AlignmentPipeline
provides all of the above functionality. The individual aspects are provided by other modules of the package.
Python3.10+
See requirements.txt
Set up virtual environment like this:
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-dev.txt
The script align_short_demo.py
contains the code for aligning a short(ish) text, without splitting it first.
For aligning two versions of a longer text, refer to test_alignment_pipeline.py
for the moment. Here we use the AlignmentPipeline
(see below). This entails steps (1-4) described above.
Class to create a token-wise alignment (e.g. with Needleman-Wunsch algorithm).
Prerequisite for a correct alignment: Must have been created by sliding window via norm/orig. For this to work, there must not be too large passages missing from orig or norm that are present in the other text.
Post-correction of alignments:
- Needleman-Wunsch algorithm generates a 1:1 (or 0:1/1:0) alignment of tokens. However, there are also n:1 or 1:n assignments (
in consequence of which
<->in consequence of which
). These must be generated from the 1:1 assignments.
tokens_a
:List[str]
tokens_b
:List[str]
tok_alignments
nach Anwendung vonclean_alignments
List[textalign.AlignedPair]
, z.B.[(a=0,b=0),(a=1,b=None),(a=2,b=1),(a=None,b=2),...]
.aligned_tokidxs
hat die Länge des erzeugten Alignments: diese ist höchstenslen(tokens_a)+len(tokens_b)
(wenn es überhaupt keine Übereinstimmung gäbe und die günstigste Alignierung darin bestünde, nur inserts/deletions ("indel") vorzunehmen), und mindestensmax(len(tokens_a),len(tokens_b))
also die Länge der längeren Tokensequenz (z.B. wenn eine Seq ein Substring der anderen ist: nur keep-Operationen und dann ggf. indels bis zur Länge der längeren Sequenz).- Die ints in den Paaren sind Tokenindizes auf tokens in
tokens_a
undtokens_b
. WennNone
, ist eine indel-Zuordnung vorgenommen worden. D.h. es handelt sich entweder um Text ohne Entsprechung im anderen Text, z.B. eine fehlende Überschrift, Seitenzahl, Editorennotiz, ... oder um split/merge-Fälle wie ((in,folge,dessen)
<->infolgedessen
odergehts
<->(geht,es)
)
tok_alignments
nach Anwendung vonclean_alignments
None
-Mappings wurden behandelt: auf einer oder beiden Seiten der Paare gibt es keinNone
mehr. Entweder aufgelöst oder gelöscht.
Utilities for splitting documents at positions where they match well. Useful for creating well-alignable parts of two documents. This way, apply token-wise alignment can be computed on shorter passages, saving time and memory.
Re-serialise the sentences (insert spaces where they were in the original).
- we need both texts in tokenised form (
tokens_orig: List[str]
andtokens_hist: List[str]
) and store the sentence boundaries of the orig text in a listsent_start_idxs: List[int]
, e.g.[1,10,28]
, ints indicate the token index where a sentence starts. - we need a token alignment, created with
Aligner
- make aligned sentences from sentence indices and token alignment.
Initialize a pipeline with a config file in YAML. Refer to the config.yaml
in the root directory of this project and to the test script test_alignment_pipeline.py
for examples.
Call the pipeline object with two files containing tokenized documents. TODO: Currently, only WASTE-style input files are supported.
WASTE format:
Solchen 1091 7
naͤrꝛiſchen 1099 15
Leuten 1115 6
nun 1122 3
/ 1125 1 [$(]
mag 1127 3
ich 1131 3
mich 1135 4
nicht 1140 5
A pipeline call returns an object of type List[AlignedSentences]
, based on the sentences in the source text. For each sentence in the source text, we get the tokens from the source text, the aligned or unaligned intervening tokens from the target text and alignment mapping between the two.
This can be used to create a representation of the document's sentence alignment with serialized sentences, e.g. in JSON format, like:
{
"orig": "(du moͤgſt wol Eſelsleben ſagen) in welchem man ſich auch nichts umb die Medicin bekuͤmmert.",
"norm": "(du mögst wohl Eselsleben sagen) in welchem man sich auch nichts um die Medizin bekümmert."
}