Value Retrieval with Arbitrary Queries for Form-like Documents

Introduction

Pytorch Implementation of Value Retrieval with Arbitrary Queries for Form-like Documents.

Environment

CUDA="11.0"
CUDNN="8"
UBUNTU="18.04"

Install

bash install.sh
git clone https://github.com/NVIDIA/apex && cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
pip install .
# under our project root folder
pip install .

Data Preparation

Our model is pre-trained on IIT-CDIP dataset, fine-tuned on FUNSD train set and evaluated on FUNSD test set and INV-CDIP test set.

Download our processed OCR results of IIT-CDIP with hocr_list_addr.txt and put under PRETRAIN_DATA_FOLDER/.
Download our processed FUNSD and INV-CDIP datasets and put under DATA_DIR/.

Reproduce Our Results

Download our model fine-tuned on FUNSD here.
Do inference following

# $MODEL_PATH here is where you save the fine-tuned model.
# DATASET_NAME is FUNSD or INV-CDIP.
bash reproduce_results.sh $MODEL_PATH $DATA_DIR/DATASET_NAME

You should get the following results.

Datasets	Precision	Recall	F1
FUNSD	60.4	60.9	60.7
INV-CDIP	50.5	47.6	49.0

Pre-training

You can skip the following steps by downloading our pre-trained SimpleDLM model here.
Or download layoutlm-base-uncased.
Do pre-training following

# $NUM_GPUS is the number of gpus you want to do the pretraining on. To reproduce the paper's results we recommend to use 8 gpus.
# $MODEL_PATH here is where you save the LayoutLM model.
# $PRETRAIN_DATA_FOLDER is the folder of IIT-CDIP hocr files.
python -m torch.distributed.launch --nproc_per_node=$NUM_GPUS pretraining.py \
--model_name_or_path $MODEL_PATH  --data_dir $PRETRAIN_DATA_FOLDER \
--output_dir $OUTPUT_DIR

Fine-tuning

Do fine-tuning following

# $MODEL_PATH is where you save the pre-trained simpleDLM model.
CUDA_VISIBLE_DEVICES=0 python run_query_value_retrieval.py --model_type simpledlm --model_name_or_path $MODEL_PATH \
--data_dir $DATA_DIR/FUNSD/ --output_dir $OUTPUT_DIR --do_train --evaluate_during_training

Citation

If you find this codebase useful, please cite our paper:

@article{gao2021value,
  title={Value Retrieval with Arbitrary Queries for Form-like Documents},
  author={Gao, Mingfei and Xue, Le and Ramaiah, Chetan and Xing, Chen and Xu, Ran and Xiong, Caiming},
  journal={arXiv preprint arXiv:2112.07820},
  year={2021}
}

Contact

Please send an email to mingfei.gao@salesforce.com or lxue@salesforce.com if you have questions.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
modeling		modeling
utils		utils
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING-ARCHIVED.md		CONTRIBUTING-ARCHIVED.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
install.sh		install.sh
pretraining.py		pretraining.py
reproduce_results.sh		reproduce_results.sh
run_query_value_retrieval.py		run_query_value_retrieval.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Value Retrieval with Arbitrary Queries for Form-like Documents

Introduction

Environment

Install

Data Preparation

Reproduce Our Results

Pre-training

Fine-tuning

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

salesforce/QVR-SimpleDLM

Folders and files

Latest commit

History

Repository files navigation

Value Retrieval with Arbitrary Queries for Form-like Documents

Introduction

Environment

Install

Data Preparation

Reproduce Our Results

Pre-training

Fine-tuning

Citation

Contact

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages