CokeBERT: Contextual Knowledge Selection and Embedding towards Enhanced Pre-Trained Language Models
-
EMNLP-Findings 2020 Accepted.
-
AI-Open 2021 Accepted.
- CokeBert-1.0 provides the original codes and details to reproduce the results in the paper.
- CokeBert-2.0-latest refactors the CokeBert-1.0 and provides more user-friendly codes for users. In this
README.md, we mainly demostrate the usage of theCokeBert-2.0-latest.
- python==3.8
Please install all required packages by running
bash requirements.shIf you want to use our pre-trained Coke models directly, you can ignore this section and skip to fine-tuning part.
Go to CokeBert-2.0-latest
cd CokeBert-2.0-latestPlease follow the ERNIE pipline to pre-process your pre-training data. Note that you need to decide the backbone model and utilize its corresponding tokenizer to process the data. Coke framework supports two series of models (BERT and RoBERTa) currently. Then, you will obtain merbe.bin and merge.idx and move them to the following directories.
# BACKBONE can be `bert-base-uncased`, `roberta-base`, `bert-large-uncased`, `roberta-large`
export BACKBONE=bert-base-uncased
export HOP=2
mkdir data/pretrain/$BACKBONE
mv merge.bin data/pretrain/$BACKBONE
mv mergr.idx data/pretrain/$BACKBONEDownload the backbone model checkpoints from Huggingface, and move them to the corresponding checkpoint folder for pre-training. Note you do not download the config.json, since we create new config for coke.
# BACKBONE can be `bert-base-uncased`, `roberta-base`, `bert-large-uncased`, `roberta-large`
BACKBONE=bert-base-uncased
wget https://huggingface.co/$BACKBONE/resolve/main/vocab.txt -O checkpoint/coke-$BACKBONE/vocab.txt
wget https://huggingface.co/$BACKBONE/resolve/main/pytorch_model.bin -O checkpoint/coke-$BACKBONE/pytorch_model.bin
mv vocab.txt $BACKBONE/
mv pytorch_model.bin $BACKBONE/
mv $BACKBONE checkpoint/Download the Knowledge Embedding (including entity and relation to id information) and knowledge graph neighbor information from here1 or here2. Move them to data/pretrain folder and unzip them.
cd data/pretrain
tar zxvf kg_embed.tar.gz
rm -rf kg_embed.tar.gz
tar zxvf kg_neighbor.tar.gz
rm -rf kg_neighbor
cd ../..(Optional) If you want to generate knowledge graph neighbors by yourself, you can
run this code to get the new kg_neighbor data.
cd data/pretrain
python preprocess_n.pyGo to examples and run the run_pretrain.sh.
cd example
bash run_pretrain.shYou can assign BACKBONE (backbone models) and HOP (the number of hop) in run_pretrain.sh.
# BACKBONE can be `bert-base-uncased`, `roberta-base`, `bert-large-uncased`, `roberta-large`
export BACKBONE=bert-base-uncased
export HOP=2
export PYTHONPATH=../src:$PYTHONPATH
rm outputs/pretrain_coke-$BACKBONE-$HOP/*
python run_pretrain.py \
--output_dir outputs \
--data_dir ../data/pretrain \
--backbone $BACKBONE \
--neighbor_hop $HOP \
--do_train \
--max_seq_length 256 \
--K_V_dim 100 \
--Q_dim 768 \
--train_batch_size 32 \
--self_attIt will write log and checkpoint to ./outputs. Check CokeBert-2.0-latest/src/coke/training_args.py for more arguments.
We download the fine-tuning datasets and the coresponding annotations from here1 or here2. Then, please unzip and save them to the corresopinding dir.
cd CokeBert-2.0-latest/data
wget https://cloud.tsinghua.edu.cn/f/3036fa28168c4fb7a320/?dl=1
mv 'index.html?dl=1' data.zip
tar -xvf data.zip finetune(Optiona1: Load from Huggieface) You can load the pre-trained Coke checkpoints from here and start using it in python and start to fine-tune. For example, the following code demostrates how to load a 2-hop Coke bert-base model.
from coke import CokeBertModel
model = CokeBertModel.from_pretrained('yushengsu/coke-bert-base-uncased-2hop')
# You can use this model to start fine-tune.(Option2: Load from the local) You can also downlaod the pre-trained Coke checkpoints from here and run the following script to fine-tune. Note that you need to move the pre-trained Coke model checkpoints pytorch_model.bin to the corresponding dir, such as DKPLM/data/DKPLM_BERTbase_2layer for 2-hop bert-base-uncased model and DKPLM/data/DKPLM_RoBERTabase_2layer for 2-hop roberta-base model.
# $BACKBONE=BACKBONE (`bert-base-uncased`, `roberta-base`, etc.)
# $HOP=HOP (1 or 2)
mv outputs/pretrain_coke-$BACKBONE-$HOP/pytorch_model.bin ../checkpoint/coke-$BACKBONE/pytorch_model.binThen you can start fine-tuning by running the following commands (Refer to CokeBert-2.0-latest/examples/run_finetune.sh).
cd CokeBert-2.0-latest
bash example/run_finetune.shThe script of run_finetune.sh.
# BACKBONE can be `bert-base-uncased`, `roberta-base`, `bert-large-uncased`, `roberta-large`
export BACKBONE=bert-base-uncased
export HOP=2
export PYTHONPATH=../src:$PYTHONPATH
# DATASET can be `FIGER`, `OpenEntity`, `fewrel`, `tacred`
DATASET=DATASET
python3 run_finetune.py \
--output_dir outputs \
--do_train \
--do_lower_case \
--data_dir ../data/finetune/$DATASET/ \
--backbone $BACKBONE \
--neighbor_hop $HOP \
--max_seq_length 256 \
--train_batch_size 64 \
--learning_rate 2e-5 \
--num_train_epochs 16 \
--loss_scale 128 \
--K_V_dim 100 \
--Q_dim 768 \
--self_attPlease cite our paper if you use CokeBert in your work:
@article{SU2021,
title = {CokeBERT: Contextual Knowledge Selection and Embedding towards Enhanced Pre-Trained Language Models},
author = {Yusheng Su and Xu Han and Zhengyan Zhang and Yankai Lin and Peng Li and Zhiyuan Liu and Jie Zhou and Maosong Sun},
journal = {AI Open},
year = {2021},
issn = {2666-6510},
doi = {https://doi.org/10.1016/j.aiopen.2021.06.004},
url = {https://arxiv.org/abs/2009.13964},
}
