Cross-Document Language Modeling

This repository contains the accompanying code for the paper:

"CDLM: Cross-Document Language Modeling." Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew E Peters, Arie Cattan and Ido Dagan. In EMNLP Findings, 2021. [PDF]

Structure

The repository contains:

Implementation of the CDMLM pretraining, based on the Huggingface code (in pretraining dir).
Code for finetuning over cross-document coreference resolution (in cross_encoder dir).
Code for finetuning over multi-document classification tasks (in CDA dir).
Code for the attention analysis over the sampled ECB+ dataset (in attention_analysis dir).
Code for finetuning over the multi-hop question answering task, using the HotpotQA dataset, including instructions, appears here.

Pretrained Model Usage

You can either pretrain by yourself or use the pretrained CDLM model weights and tokenizer files, which are available on HuggingFace.

Then, use

from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/cdlm')
model = AutoModel.from_pretrained('biu-nlp/cdlm')

Please note that during our pretraining we used the document and sentence separators, which you might want to add to your data. The document and sentence separators are <doc-s>, </doc-s> (the last two tokens in the vocabulary), and <s>, </s>, respectively.

Citation:

If you find our work useful, please cite the paper as:

@article{caciularu2021CDLM,
  title={CDLM: Cross-Document Language Modeling},
  author={Caciularu, Avi and Cohan, Arman and Beltagy, Iz and Peters, Matthew E and Cattan, Arie and Dagan, Ido},
  journal={Findings of the Association for Computational Linguistics: EMNLP 2021},
  year={2021}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cross-Document Language Modeling

Structure

Pretrained Model Usage

Citation:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
CDA		CDA
attention_analysis		attention_analysis
cross_encoder		cross_encoder
pretraining		pretraining
README.md		README.md

aviclu/CDLM

Folders and files

Latest commit

History

Repository files navigation

Cross-Document Language Modeling

Structure

Pretrained Model Usage

Citation:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages