You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This data is in conllup format, which adds user defined columns to the original 10 columns from the CoNLL-U format (from UD). Our data consists of four columns: the original ID columns, plus three additional columns UP:PRED, UP:ARGHEADS, and UP:ARGSPANS.
ID (column 1) is the token id consistent with corresponding UD sentence.
UP:PRED (column 11) contains predicate sense label for this predicate. This sense provides roleset specific meanings for each of its arguments, as defined in EN propbank.
UP:ARGHEADS (column 12) contains the argument heads for arguments of this predicate. Each argument is in the format label:token_id. The arguments are separated by pipe | charactor.
UP:ARGSPANS (column 13) contains the argument spans for arguments of this predicate. Each argument is in the format label:start_token_id-end_token_id. The arguments are separated by pipe | charactor.
We provide a python script to combine such a UP file with its corresponding UD file to produce the desired 13 column .conllup file. The script is available in tools repository: up2/merge_ud_up.py. Follow the procedure:
Because of the underlying parser mistakes in identifying the correct lemma for certain verbs, and as we name the frame files based on the lemma in the target language, one might expect to see frame filenames that do not make sense in that particular language.
Language peculiarities
For the languages where subject/object can be omitted, one may expect to observe incorrect role label transfer. One potential reason for such issues is incorrect word alignment.
AUX (be, have, do) in EN is likely to be misaligned with other tokens in other languages. In EN, these AUX are used to construct tenses (perfect perfective), polarity etc., but different languages represent tense and polarity differently.