OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

We introduce OCRVerse, which advances traditional document OCR to the next-generation holistic OCR via comprehensive data and methodological practices. OCRVerse not only recognizes traditional optical character, but also parses complex visual symbols through code-level representations, enabling broad applications across domains including statistics, office, math, chemical, physical etc. To this end, we constructed a large-scale interdisciplinary dataset spanning heterogeneous data sources, with innovative practices in data rendering and model synthesis. Based on this, we develop an end-to-end lightweight vision-language model (built on Qwen3-VL 4B) with two specialized variants: OCRVerse-text dedicated to character-level output and OCRVerse-code specialized in code-level output. We conduct extensive experiments to validate the effectiveness of our approach and reveal the potential of holistic OCR. Experimental results show that our method achieves an overall score of 87.9 on OmniDocbench, which is competitive with the state-of-the-art end-to-end VLM model. Besides, our method demonstrates comprehensive advancement on a wider range of charts, web pages, SVGs, molecular formulas, and circuit diagrams, taking a key step towards holistic OCR applications.

📢 News and Updates

2025.11.3 We upload our model weights OCRVerse-code to HuggingFace.
2025.10.27 We upload our model weights OCRVerse-text to HuggingFace.

🤗 Models

Model	Download Link
OCRVerse-text	DocTron/OCRVerse-text
OCRVerse-code	DocTron/OCRVerse-code

📚 Dataset Sources

OCRVerse encompasses both text-level and code-level data sources, comprehensively supporting the data requirements of holistic OCR.

The text-level data sources span nine scenario types: natural scenes, books, magazines, papers, reports, slides, exam papers, notes, and newspapers. These categories cover high-frequency daily text carriers, fulfill fundamental OCR needs, and avoid both scenario redundancy and gaps.
The code-level data sources comprise six scenario types: charts, webpages, icons, geometry, circuits, and molecules. These focus on professional structured scenarios and address gaps not covered by text-level categories.

📥 Data Processing

Our training dataset is constructed through a systematic multi-stage pipeline that integrates both text-level and code-level data sources to ensure comprehensive coverage and high quality.

Text-level data construction. To build a multi-scenario, multi-type document OCR dataset, we combine open-source and self-built data to balance scale and quality.

Open-source data provides low-cost, large-scale coverage but suffers from uneven quality due to scattered sources and lack of unified annotation standards; we employ VLM for quality optimization to improve usability.
To address gaps in real-world scenarios, self-built data serves as a key supplement:
- we collect real PDF documents matching practical layouts, fonts, colors, and resolutions with VLM-powered precise annotation.
- we crawl public high-quality online documents, converting them to images via browser rendering to enrich data types and expand scenario coverage.

Code-level data construction. We begin by curating a diverse corpus from open-source datasets through rigorous filtering and diversity-aware sampling. Subsequently, we employ specialized VLMs for high-quality re-annotation to ensure label accuracy and consistency. Finally, we enhance the data through execution validation and rendering processes to generate executable code-image pairs.

📊 Performance

OCRVerse-text

OCRVerse-text is evaluated on OmniDocBench v1.5, a comprehensive document OCR benchmark covering diverse real-world scenarios (e.g., office documents, academic papers, scanned materials). Results show OCRVerse-text delivers competitive performance, demonstrating strong adaptability to practical document OCR demands.

End-to-End Evaluation

End-to-end evaluation assesses the model's accuracy in parsing PDF page content. The evaluation uses the model's Markdown output of the entire PDF page parsing results as the prediction. The Overall metric is calculated as:

$$ \text{Overall} = \frac{(1-\text{Text Edit Distance}) \times 100 + \text{Table TEDS} +\text{Formula CDM}}{3} $$

Model Type	Methods	Release Date	End to End	Parameters	Overall↑	Text^Edit↓	Formula^CDM↑	Table^TEDS↑	Table^TEDS-S↑	Reading Order^Edit↓
Pipeline Tools	Marker-1.8.2	2025	❌	-	71.30	0.206	76.66	57.88	71.17	0.250
	Mineru2-pipeline	2025	❌	-	75.51	0.209	76.55	70.90	79.11	0.225
	PP-StructureV3	2024	❌	-	86.73	0.073	85.79	81.68	89.48	0.073
General VLMs	GPT-4o	2024	✅	-	75.02	0.217	79.70	67.07	76.09	0.148
	InternVL3-76B	2025	✅	76B	80.33	0.131	83.42	70.64	77.74	0.113
	InternVL3.5-241B	2025	✅	241B	82.67	0.142	87.23	75.00	81.28	0.125
	Qwen2.5-VL-72B	2025	✅	72B	87.02	0.094	88.27	82.15	86.22	0.102
	Gemini-2.5 Pro	2025	✅	-	88.03	0.075	85.82	85.71	90.29	0.097
Specialized VLMs	Dolphin	2025.05	❌	322M	74.67	0.125	67.85	68.70	77.77	0.124
	MinerU2-VLM	2025.06	❌	0.9B	85.56	0.078	80.95	83.54	87.66	0.086
	MonkeyOCR-pro-1.2B	2025.07	❌	1.9B	86.96	0.084	85.02	84.24	89.02	0.130
	MonkeyOCR-3B	2025.06	❌	3.7B	87.13	0.075	87.45	81.39	85.92	0.129
	MonkeyOCR-pro-3B	2025.07	❌	3.7B	88.85	0.075	87.25	86.78	90.63	0.128
	MinerU2.5	2025.09	❌	1.2B	90.67	0.047	88.46	88.22	92.38	0.044
	PaddleOCR-VL	2025.10	❌	0.9B	92.56	0.035	91.43	89.76	93.52	0.043
	OCRFlux-3B	2025.06	✅	3B	74.82	0.193	68.03	75.75	80.23	0.202
	Mistral OCR	2025.03	✅	-	78.83	0.164	82.84	70.03	78.04	0.144
	POINTS-Reader	2025.08	✅	3B	80.98	0.134	79.20	77.13	81.66	0.145
	olmOCR-7B	2025.02	✅	7B	81.79	0.096	86.04	68.92	74.77	0.121
	Nanonets-OCR-s	2025.06	✅	3B	85.59	0.093	85.90	80.14	85.57	0.108
	Deepseek-OCR	2025.10	✅	3B	87.01	0.073	83.37	84.97	88.80	0.086
	dots.ocr	2025.07	✅	3B	88.41	0.048	83.22	86.78	90.62	0.053
	OCRVerse	2025.10	✅	4B	88.65	0.051	88.38	82.67	86.63	0.062

Performance Across Diverse Page Types

The following table illustrates the text recognition performance (Edit Distance) of the OCRVerse model across 9 different document types. It is intended to offer deeper insights into the model’s performance on diverse page types, thereby enabling a more nuanced understanding of its capabilities and limitations in different real-world document scenarios.

Model Type	Models	End to End	Slides	Academic Papers	Book	Textbook	Exam Papers	Magazine	Newspaper	Notes	Financial Report
Pipeline Tools	Marker-1.8.2	❌	0.1796	0.0412	0.1010	0.2908	0.2958	0.1111	0.2717	0.4656	0.0341
	MinerU2-pipeline	❌	0.4244	0.0230	0.2628	0.1224	0.0822	0.395	0.0736	0.2603	0.0411
	PP-StructureV3	❌	0.0794	0.0236	0.0415	0.1107	0.0945	0.0722	0.0617	0.1236	0.0181
General VLMs	GPT-4o	✅	0.1019	0.1203	0.1288	0.1599	0.1939	0.142	0.6254	0.2611	0.3343
	InternVL3-76B	✅	0.0349	0.1052	0.0629	0.0827	0.1007	0.0406	0.5826	0.0924	0.0665
	InternVL3.5-241B	✅	0.0475	0.0857	0.0237	0.1061	0.0933	0.0577	0.6403	0.1357	0.1117
	Qwen2.5-VL-72B	✅	0.0422	0.0801	0.0586	0.1146	0.0681	0.0964	0.238	0.1232	0.0264
	Gemini-2.5 Pro	✅	0.0326	0.0182	0.0694	0.1618	0.0937	0.0161	0.1347	0.1169	0.0169
Specialized VLMs	Dolphin	❌	0.0957	0.0453	0.0616	0.1333	0.1684	0.0702	0.2388	0.2561	0.0186
	MinerU2-VLM	❌	0.0745	0.0104	0.0357	0.1276	0.0698	0.0652	0.1831	0.0803	0.0236
	MonkeyOCR-pro-1.2B	❌	0.0961	0.0354	0.053	0.111	0.0887	0.0494	0.0995	0.1686	0.0198
	MonkeyOCR-pro-3B	❌	0.0904	0.0362	0.0489	0.1072	0.0745	0.0475	0.0962	0.1165	0.0196
	MinerU2.5	❌	0.0294	0.0235	0.0332	0.0499	0.0681	0.0316	0.054	0.1161	0.0104
	OCRFlux	✅	0.0870	0.0867	0.0818	0.1843	0.2072	0.1048	0.7304	0.1567	0.0193
	Mistral-OCR	✅	0.0917	0.0531	0.0610	0.1341	0.1341	0.0581	0.5643	0.3097	0.0523
	POINTS-Reader	✅	0.0334	0.0779	0.0671	0.1372	0.1901	0.1343	0.3789	0.0937	0.0951
	olmOCR-7B	✅	0.0497	0.0365	0.0539	0.1204	0.0728	0.0697	0.2916	0.122	0.0459
	Nanonets-OCR-s	✅	0.0551	0.0578	0.0606	0.0931	0.0834	0.0917	0.1965	0.1606	0.0395
	dots.ocr	✅	0.0290	0.0231	0.0433	0.0788	0.0467	0.0221	0.0667	0.1116	0.0076
	OCRVerse	✅	0.0260	0.0427	0.0412	0.0921	0.0507	0.0303	0.0982	0.0695	0.0064

Performance Across Diverse Layouts

End-to-end reading order evaluation on OmniDocBench: results across different column layout types using Normalized Edit Distance.

model	Single Column	Double Column	Three Column	Other Layout
OCRVerse	0.022	0.042	0.09	0.16

Text Recognition Performance Across Attributes

The following table illustrates the text recognition performance (Edit Distance) of the OCRVerse model across diverse text attributes, including language, background, and rotation. It is intended to offer deeper insights into the model’s performance under different text properties, thereby enabling a more nuanced understanding of its capabilities and limitations in real-world document scenarios.

Model	Language			Text background			Text Rotate
Model	EN	ZH	Mixed	White	Single	Multi	Normal	Rotate270	Horizontal
OCRVerse	0.077	0.084	0.062	0.081	0.068	0.08	0.078	0.968	0.232

OCRVerse-code

OCRVerse-code is evaluated across key technical document and code generation benchmarks, including ChartMimic direct v2, UniSVG-ISVGEN, Design2Code, Image2Latex plot, and ChemDraw. The evaluation focuses on its ability to recognize, parse, and convert specialized content—such as charts, SVG graphics, design layouts, LaTeX plots, and chemical structures—into accurate, executable code or structured formats. Results demonstrate OCRVerse-code’s strong versatility and reliability in handling technical and visual-to-code conversion tasks across diverse professional scenarios.

Model	Parameter	ChartMimic_direct_v2			UniSVG-ISVGEN			Design2Code		Image2Latex_plot		ChemDraw
Model	Parameter	Exec.Rate	Low-Level	High-Level	Low-Level	High-Level	Score	Low-Level	High-Level	Ren.Succ.	EMS	Exec.Rate	Tani.Sim.
Closed-Source Models
Gemini-2.5-Pro	-	97.3	88.7	83.8	53.6	80.3	69.6	90.8	91.4	74.3	52.5	77.3	2.8
Claude-4.5-Sonnet	-	97.8	89.6	82.9	61.0	83.4	74.6	90.4	90.8	72.7	50.2	95.3	41.7
GPT-5	-	94.8	81.9	78.3	60.8	88.3	77.3	90.6	91.0	78.7	57.4	93.8	52.1
Open-Source Models
Qwen2.5-VL-7B	7B	68.7	42.2	40.1	47.5	73.8	63.3	83.4	87.6	42.7	25.5	21.1	11.7
Qwen3-VL-8B	8B	78.3	62.5	67.8	53.0	77.0	67.4	85.5	87.2	47.7	33.0	78.9	41.2
InternVL3.5-8B	8B	66.7	46.0	48.3	55.0	78.0	68.6	85.8	87.3	58.3	40.5	49.2	7.8
InternVL3.5-14B	14B	73.2	52.8	55.4	52.0	75.0	65.9	86.1	87.8	73.0	50.2	71.9	39.3
Qwen3-VL-32B	32B	83.0	66.9	77.5	68.0	86.0	78.8	88.6	89.8	75.7	53.3	37.5	48.8
InternVL3.5-38B	38B	79.0	60.0	71.8	51.9	77.3	67.1	87.8	88.4	72.6	49.5	55.5	31.4
Qwen2.5-VL-72B	72B	88.5	72.7	79.1	47.7	76.0	64.7	86.9	88.7	62.0	41.7	75.8	28.0
OCRVerse	4B	82.0	65.7	74.3	82.1	93.4	88.8	83.6	86.1	71.0	50.4	85.2	60.4

🔍 Usage Example

Inference

OCRVerse-text

This below is a simple example of how to use OCRVerse-text for document parsing tasks.

Please first install transformers using the following command:

pip install "transformers>=4.57.0"

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
# Load model
model_path = 'DocTron/OCRVerse-text'
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    dtype="auto", 
    device_map="cuda",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Prepare input with image and text
image_path = "./assets/ocrverse-text_test.jpg"
# We recommend using the following prompt to better performance, since it is used throughout the training process.
prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt},
        ]
    }
]
# Preparation for inference
inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
# $$
# r = \frac{\alpha}{\beta} \sin \beta (\sigma_1 \pm \sigma_2)
# $$

OCRVerse-code

Below is a simple example of how to use OCRVerse-code for chart-to-code generation tasks. We also recommend utilizing SGLang for inference.

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
# Load model
model_path = 'DocTron/OCRVerse-code'
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    dtype="auto", 
    device_map="cuda",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Prepare input with image and text
image_path = "./assets/chart2code_example.png"
prompt = "You are an expert Python developer who specializes in writing matplotlib code based on a given picture. I found a very nice picture in a STEM paper, but there is no corresponding source code available. I need your help to generate the Python code that can reproduce the picture based on the picture I provide.\nNote that it is necessary to use figsize=(7.0, 5.0) to set the image size to match the original size.\nNow, please give me the matplotlib code that reproduces the picture below."
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt},
        ]
    }
]
# Preparation for inference
inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

Example scripts for launching SGLang Server

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m sglang.launch_server \
--model-path DocTron/OCRVerse-code \
--host 0.0.0.0 \
--dist-init-addr 127.0.0.1:10002 \
--tp 4 \
--port 6002

Fine-tuning

If you want to continue training based on our model, you can use Llama Factory. For installation and usage of Llama Factory, please refer to its official documentation. A reference fine-tuning script with pre-specified parameters is provided below:

PROJECT_DIR=/path/to/llama_factory
cd ${PROJECT_DIR}
# Set parameters
GPUS_PER_NODE=8                  # Number of GPUs per node
NNODES=1                         # Total number of nodes
NODE_RANK=0                      # Rank of the current node (starts from 0)
MASTER_ADDR=localhost            # IP address of the master node
MASTER_PORT=12345                # Port for communication between nodes
MODEL_DIR=/path/to/ocrverse_text_model  # Path to the pre-trained OCRVerse model
DATA=/name/of/your/dataset               # Name/path of your custom dataset
OUTPUT_DIR=/path/to/output              # Directory to save fine-tuned results
# Llama Factory-based fine-tuning script
torchrun --nproc_per_node="${GPUS_PER_NODE}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" \
    src/train.py \
    --model_name_or_path "$MODEL_DIR" \
    --stage sft \
    --do_train True \
    --finetuning_type full \
    --dataset "$DATA" \
    --template qwen3_vl_nothink \
    --cutoff_len 8192 \
    --preprocessing_num_workers 128 \
    --preprocessing_batch_size 256 \
    --dataloader_num_workers 128 \
    --output_dir "$OUTPUT_DIR" \
    --logging_steps 1 \
    --save_steps 5000 \
    --plot_loss True \
    --save_only_model False \
    --report_to none \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 1e-5 \
    --num_train_epochs 1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --bf16 True

📌 Acknowledgement

We sincerely appreciate LLaMA-Factory for providing reference training framework.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

📢 News and Updates

🤗 Models

📚 Dataset Sources

📥 Data Processing

📊 Performance

OCRVerse-text

End-to-End Evaluation

Performance Across Diverse Page Types

Performance Across Diverse Layouts

Text Recognition Performance Across Attributes

OCRVerse-code

🔍 Usage Example

Inference

OCRVerse-text

OCRVerse-code

Fine-tuning

📌 Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

License

DocTron-hub/OCRVerse

Folders and files

Latest commit

History

Repository files navigation

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

📢 News and Updates

🤗 Models

📚 Dataset Sources

📥 Data Processing

📊 Performance

OCRVerse-text

End-to-End Evaluation

Performance Across Diverse Page Types

Performance Across Diverse Layouts

Text Recognition Performance Across Attributes

OCRVerse-code

🔍 Usage Example

Inference

OCRVerse-text

OCRVerse-code

Fine-tuning

📌 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Packages