We introduce OCRVerse, which advances traditional document OCR to the next-generation holistic OCR via comprehensive data and methodological practices. OCRVerse not only recognizes traditional optical character, but also parses complex visual symbols through code-level representations, enabling broad applications across domains including statistics, office, math, chemical, physical etc. To this end, we constructed a large-scale interdisciplinary dataset spanning heterogeneous data sources, with innovative practices in data rendering and model synthesis. Based on this, we develop an end-to-end lightweight vision-language model (built on Qwen3-VL 4B) with two specialized variants: OCRVerse-text dedicated to character-level output and OCRVerse-code specialized in code-level output. We conduct extensive experiments to validate the effectiveness of our approach and reveal the potential of holistic OCR. Experimental results show that our method achieves an overall score of 87.9 on OmniDocbench, which is competitive with the state-of-the-art end-to-end VLM model. Besides, our method demonstrates comprehensive advancement on a wider range of charts, web pages, SVGs, molecular formulas, and circuit diagrams, taking a key step towards holistic OCR applications.
2025.11.3We upload our model weights OCRVerse-code to HuggingFace.2025.10.27We upload our model weights OCRVerse-text to HuggingFace.
| Model | Download Link |
|---|---|
| OCRVerse-text | DocTron/OCRVerse-text |
| OCRVerse-code | DocTron/OCRVerse-code |
OCRVerse encompasses both text-level and code-level data sources, comprehensively supporting the data requirements of holistic OCR.
- The text-level data sources span nine scenario types: natural scenes, books, magazines, papers, reports, slides, exam papers, notes, and newspapers. These categories cover high-frequency daily text carriers, fulfill fundamental OCR needs, and avoid both scenario redundancy and gaps.
- The code-level data sources comprise six scenario types: charts, webpages, icons, geometry, circuits, and molecules. These focus on professional structured scenarios and address gaps not covered by text-level categories.
Our training dataset is constructed through a systematic multi-stage pipeline that integrates both text-level and code-level data sources to ensure comprehensive coverage and high quality.
Text-level data construction. To build a multi-scenario, multi-type document OCR dataset, we combine open-source and self-built data to balance scale and quality.
- Open-source data provides low-cost, large-scale coverage but suffers from uneven quality due to scattered sources and lack of unified annotation standards; we employ VLM for quality optimization to improve usability.
- To address gaps in real-world scenarios, self-built data serves as a key supplement:
- we collect real PDF documents matching practical layouts, fonts, colors, and resolutions with VLM-powered precise annotation.
- we crawl public high-quality online documents, converting them to images via browser rendering to enrich data types and expand scenario coverage.
Code-level data construction. We begin by curating a diverse corpus from open-source datasets through rigorous filtering and diversity-aware sampling. Subsequently, we employ specialized VLMs for high-quality re-annotation to ensure label accuracy and consistency. Finally, we enhance the data through execution validation and rendering processes to generate executable code-image pairs.
OCRVerse-text is evaluated on OmniDocBench v1.5, a comprehensive document OCR benchmark covering diverse real-world scenarios (e.g., office documents, academic papers, scanned materials). Results show OCRVerse-text delivers competitive performance, demonstrating strong adaptability to practical document OCR demands.
End-to-end evaluation assesses the model's accuracy in parsing PDF page content. The evaluation uses the model's Markdown output of the entire PDF page parsing results as the prediction. The Overall metric is calculated as:
| Model Type | Methods | Release Date | End to End | Parameters | Overall↑ | TextEdit↓ | FormulaCDM↑ | TableTEDS↑ | TableTEDS-S↑ | Reading OrderEdit↓ |
|---|---|---|---|---|---|---|---|---|---|---|
| Pipeline Tools | Marker-1.8.2 | 2025 | ❌ | - | 71.30 | 0.206 | 76.66 | 57.88 | 71.17 | 0.250 |
| Mineru2-pipeline | 2025 | ❌ | - | 75.51 | 0.209 | 76.55 | 70.90 | 79.11 | 0.225 | |
| PP-StructureV3 | 2024 | ❌ | - | 86.73 | 0.073 | 85.79 | 81.68 | 89.48 | 0.073 | |
| General VLMs | GPT-4o | 2024 | ✅ | - | 75.02 | 0.217 | 79.70 | 67.07 | 76.09 | 0.148 |
| InternVL3-76B | 2025 | ✅ | 76B | 80.33 | 0.131 | 83.42 | 70.64 | 77.74 | 0.113 | |
| InternVL3.5-241B | 2025 | ✅ | 241B | 82.67 | 0.142 | 87.23 | 75.00 | 81.28 | 0.125 | |
| Qwen2.5-VL-72B | 2025 | ✅ | 72B | 87.02 | 0.094 | 88.27 | 82.15 | 86.22 | 0.102 | |
| Gemini-2.5 Pro | 2025 | ✅ | - | 88.03 | 0.075 | 85.82 | 85.71 | 90.29 | 0.097 | |
| Specialized VLMs | Dolphin | 2025.05 | ❌ | 322M | 74.67 | 0.125 | 67.85 | 68.70 | 77.77 | 0.124 |
| MinerU2-VLM | 2025.06 | ❌ | 0.9B | 85.56 | 0.078 | 80.95 | 83.54 | 87.66 | 0.086 | |
| MonkeyOCR-pro-1.2B | 2025.07 | ❌ | 1.9B | 86.96 | 0.084 | 85.02 | 84.24 | 89.02 | 0.130 | |
| MonkeyOCR-3B | 2025.06 | ❌ | 3.7B | 87.13 | 0.075 | 87.45 | 81.39 | 85.92 | 0.129 | |
| MonkeyOCR-pro-3B | 2025.07 | ❌ | 3.7B | 88.85 | 0.075 | 87.25 | 86.78 | 90.63 | 0.128 | |
| MinerU2.5 | 2025.09 | ❌ | 1.2B | 90.67 | 0.047 | 88.46 | 88.22 | 92.38 | 0.044 | |
| PaddleOCR-VL | 2025.10 | ❌ | 0.9B | 92.56 | 0.035 | 91.43 | 89.76 | 93.52 | 0.043 | |
| OCRFlux-3B | 2025.06 | ✅ | 3B | 74.82 | 0.193 | 68.03 | 75.75 | 80.23 | 0.202 | |
| Mistral OCR | 2025.03 | ✅ | - | 78.83 | 0.164 | 82.84 | 70.03 | 78.04 | 0.144 | |
| POINTS-Reader | 2025.08 | ✅ | 3B | 80.98 | 0.134 | 79.20 | 77.13 | 81.66 | 0.145 | |
| olmOCR-7B | 2025.02 | ✅ | 7B | 81.79 | 0.096 | 86.04 | 68.92 | 74.77 | 0.121 | |
| Nanonets-OCR-s | 2025.06 | ✅ | 3B | 85.59 | 0.093 | 85.90 | 80.14 | 85.57 | 0.108 | |
| Deepseek-OCR | 2025.10 | ✅ | 3B | 87.01 | 0.073 | 83.37 | 84.97 | 88.80 | 0.086 | |
| dots.ocr | 2025.07 | ✅ | 3B | 88.41 | 0.048 | 83.22 | 86.78 | 90.62 | 0.053 | |
| OCRVerse | 2025.10 | ✅ | 4B | 88.65 | 0.051 | 88.38 | 82.67 | 86.63 | 0.062 |
The following table illustrates the text recognition performance (Edit Distance) of the OCRVerse model across 9 different document types. It is intended to offer deeper insights into the model’s performance on diverse page types, thereby enabling a more nuanced understanding of its capabilities and limitations in different real-world document scenarios.
| Model Type | Models | End to End | Slides | Academic Papers | Book | Textbook | Exam Papers | Magazine | Newspaper | Notes | Financial Report |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Pipeline Tools | Marker-1.8.2 | ❌ | 0.1796 | 0.0412 | 0.1010 | 0.2908 | 0.2958 | 0.1111 | 0.2717 | 0.4656 | 0.0341 |
| MinerU2-pipeline | ❌ | 0.4244 | 0.0230 | 0.2628 | 0.1224 | 0.0822 | 0.395 | 0.0736 | 0.2603 | 0.0411 | |
| PP-StructureV3 | ❌ | 0.0794 | 0.0236 | 0.0415 | 0.1107 | 0.0945 | 0.0722 | 0.0617 | 0.1236 | 0.0181 | |
| General VLMs | GPT-4o | ✅ | 0.1019 | 0.1203 | 0.1288 | 0.1599 | 0.1939 | 0.142 | 0.6254 | 0.2611 | 0.3343 |
| InternVL3-76B | ✅ | 0.0349 | 0.1052 | 0.0629 | 0.0827 | 0.1007 | 0.0406 | 0.5826 | 0.0924 | 0.0665 | |
| InternVL3.5-241B | ✅ | 0.0475 | 0.0857 | 0.0237 | 0.1061 | 0.0933 | 0.0577 | 0.6403 | 0.1357 | 0.1117 | |
| Qwen2.5-VL-72B | ✅ | 0.0422 | 0.0801 | 0.0586 | 0.1146 | 0.0681 | 0.0964 | 0.238 | 0.1232 | 0.0264 | |
| Gemini-2.5 Pro | ✅ | 0.0326 | 0.0182 | 0.0694 | 0.1618 | 0.0937 | 0.0161 | 0.1347 | 0.1169 | 0.0169 | |
| Specialized VLMs | Dolphin | ❌ | 0.0957 | 0.0453 | 0.0616 | 0.1333 | 0.1684 | 0.0702 | 0.2388 | 0.2561 | 0.0186 |
| MinerU2-VLM | ❌ | 0.0745 | 0.0104 | 0.0357 | 0.1276 | 0.0698 | 0.0652 | 0.1831 | 0.0803 | 0.0236 | |
| MonkeyOCR-pro-1.2B | ❌ | 0.0961 | 0.0354 | 0.053 | 0.111 | 0.0887 | 0.0494 | 0.0995 | 0.1686 | 0.0198 | |
| MonkeyOCR-pro-3B | ❌ | 0.0904 | 0.0362 | 0.0489 | 0.1072 | 0.0745 | 0.0475 | 0.0962 | 0.1165 | 0.0196 | |
| MinerU2.5 | ❌ | 0.0294 | 0.0235 | 0.0332 | 0.0499 | 0.0681 | 0.0316 | 0.054 | 0.1161 | 0.0104 | |
| OCRFlux | ✅ | 0.0870 | 0.0867 | 0.0818 | 0.1843 | 0.2072 | 0.1048 | 0.7304 | 0.1567 | 0.0193 | |
| Mistral-OCR | ✅ | 0.0917 | 0.0531 | 0.0610 | 0.1341 | 0.1341 | 0.0581 | 0.5643 | 0.3097 | 0.0523 | |
| POINTS-Reader | ✅ | 0.0334 | 0.0779 | 0.0671 | 0.1372 | 0.1901 | 0.1343 | 0.3789 | 0.0937 | 0.0951 | |
| olmOCR-7B | ✅ | 0.0497 | 0.0365 | 0.0539 | 0.1204 | 0.0728 | 0.0697 | 0.2916 | 0.122 | 0.0459 | |
| Nanonets-OCR-s | ✅ | 0.0551 | 0.0578 | 0.0606 | 0.0931 | 0.0834 | 0.0917 | 0.1965 | 0.1606 | 0.0395 | |
| dots.ocr | ✅ | 0.0290 | 0.0231 | 0.0433 | 0.0788 | 0.0467 | 0.0221 | 0.0667 | 0.1116 | 0.0076 | |
| OCRVerse | ✅ | 0.0260 | 0.0427 | 0.0412 | 0.0921 | 0.0507 | 0.0303 | 0.0982 | 0.0695 | 0.0064 |
End-to-end reading order evaluation on OmniDocBench: results across different column layout types using Normalized Edit Distance.
| model | Single Column | Double Column | Three Column | Other Layout |
|---|---|---|---|---|
| OCRVerse | 0.022 | 0.042 | 0.09 | 0.16 |
The following table illustrates the text recognition performance (Edit Distance) of the OCRVerse model across diverse text attributes, including language, background, and rotation. It is intended to offer deeper insights into the model’s performance under different text properties, thereby enabling a more nuanced understanding of its capabilities and limitations in real-world document scenarios.
| Model | Language | Text background | Text Rotate | ||||||
|---|---|---|---|---|---|---|---|---|---|
| EN | ZH | Mixed | White | Single | Multi | Normal | Rotate270 | Horizontal | |
| OCRVerse | 0.077 | 0.084 | 0.062 | 0.081 | 0.068 | 0.08 | 0.078 | 0.968 | 0.232 |
OCRVerse-code is evaluated across key technical document and code generation benchmarks, including ChartMimic direct v2, UniSVG-ISVGEN, Design2Code, Image2Latex plot, and ChemDraw. The evaluation focuses on its ability to recognize, parse, and convert specialized content—such as charts, SVG graphics, design layouts, LaTeX plots, and chemical structures—into accurate, executable code or structured formats. Results demonstrate OCRVerse-code’s strong versatility and reliability in handling technical and visual-to-code conversion tasks across diverse professional scenarios.
| Model | Parameter | ChartMimic_direct_v2 | UniSVG-ISVGEN | Design2Code | Image2Latex_plot | ChemDraw | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Exec.Rate | Low-Level | High-Level | Low-Level | High-Level | Score | Low-Level | High-Level | Ren.Succ. | EMS | Exec.Rate | Tani.Sim. | ||
| Closed-Source Models | |||||||||||||
| Gemini-2.5-Pro | - | 97.3 | 88.7 | 83.8 | 53.6 | 80.3 | 69.6 | 90.8 | 91.4 | 74.3 | 52.5 | 77.3 | 2.8 |
| Claude-4.5-Sonnet | - | 97.8 | 89.6 | 82.9 | 61.0 | 83.4 | 74.6 | 90.4 | 90.8 | 72.7 | 50.2 | 95.3 | 41.7 |
| GPT-5 | - | 94.8 | 81.9 | 78.3 | 60.8 | 88.3 | 77.3 | 90.6 | 91.0 | 78.7 | 57.4 | 93.8 | 52.1 |
| Open-Source Models | |||||||||||||
| Qwen2.5-VL-7B | 7B | 68.7 | 42.2 | 40.1 | 47.5 | 73.8 | 63.3 | 83.4 | 87.6 | 42.7 | 25.5 | 21.1 | 11.7 |
| Qwen3-VL-8B | 8B | 78.3 | 62.5 | 67.8 | 53.0 | 77.0 | 67.4 | 85.5 | 87.2 | 47.7 | 33.0 | 78.9 | 41.2 |
| InternVL3.5-8B | 8B | 66.7 | 46.0 | 48.3 | 55.0 | 78.0 | 68.6 | 85.8 | 87.3 | 58.3 | 40.5 | 49.2 | 7.8 |
| InternVL3.5-14B | 14B | 73.2 | 52.8 | 55.4 | 52.0 | 75.0 | 65.9 | 86.1 | 87.8 | 73.0 | 50.2 | 71.9 | 39.3 |
| Qwen3-VL-32B | 32B | 83.0 | 66.9 | 77.5 | 68.0 | 86.0 | 78.8 | 88.6 | 89.8 | 75.7 | 53.3 | 37.5 | 48.8 |
| InternVL3.5-38B | 38B | 79.0 | 60.0 | 71.8 | 51.9 | 77.3 | 67.1 | 87.8 | 88.4 | 72.6 | 49.5 | 55.5 | 31.4 |
| Qwen2.5-VL-72B | 72B | 88.5 | 72.7 | 79.1 | 47.7 | 76.0 | 64.7 | 86.9 | 88.7 | 62.0 | 41.7 | 75.8 | 28.0 |
| OCRVerse | 4B | 82.0 | 65.7 | 74.3 | 82.1 | 93.4 | 88.8 | 83.6 | 86.1 | 71.0 | 50.4 | 85.2 | 60.4 |
This below is a simple example of how to use OCRVerse-text for document parsing tasks.
Please first install transformers using the following command:
pip install "transformers>=4.57.0"from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
# Load model
model_path = 'DocTron/OCRVerse-text'
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
dtype="auto",
device_map="cuda",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Prepare input with image and text
image_path = "./assets/ocrverse-text_test.jpg"
# We recommend using the following prompt to better performance, since it is used throughout the training process.
prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt},
]
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
# $$
# r = \frac{\alpha}{\beta} \sin \beta (\sigma_1 \pm \sigma_2)
# $$Below is a simple example of how to use OCRVerse-code for chart-to-code generation tasks. We also recommend utilizing SGLang for inference.
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
# Load model
model_path = 'DocTron/OCRVerse-code'
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
dtype="auto",
device_map="cuda",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Prepare input with image and text
image_path = "./assets/chart2code_example.png"
prompt = "You are an expert Python developer who specializes in writing matplotlib code based on a given picture. I found a very nice picture in a STEM paper, but there is no corresponding source code available. I need your help to generate the Python code that can reproduce the picture based on the picture I provide.\nNote that it is necessary to use figsize=(7.0, 5.0) to set the image size to match the original size.\nNow, please give me the matplotlib code that reproduces the picture below."
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt},
]
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])Example scripts for launching SGLang Server
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m sglang.launch_server \
--model-path DocTron/OCRVerse-code \
--host 0.0.0.0 \
--dist-init-addr 127.0.0.1:10002 \
--tp 4 \
--port 6002If you want to continue training based on our model, you can use Llama Factory. For installation and usage of Llama Factory, please refer to its official documentation. A reference fine-tuning script with pre-specified parameters is provided below:
PROJECT_DIR=/path/to/llama_factory
cd ${PROJECT_DIR}
# Set parameters
GPUS_PER_NODE=8 # Number of GPUs per node
NNODES=1 # Total number of nodes
NODE_RANK=0 # Rank of the current node (starts from 0)
MASTER_ADDR=localhost # IP address of the master node
MASTER_PORT=12345 # Port for communication between nodes
MODEL_DIR=/path/to/ocrverse_text_model # Path to the pre-trained OCRVerse model
DATA=/name/of/your/dataset # Name/path of your custom dataset
OUTPUT_DIR=/path/to/output # Directory to save fine-tuned results
# Llama Factory-based fine-tuning script
torchrun --nproc_per_node="${GPUS_PER_NODE}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" \
src/train.py \
--model_name_or_path "$MODEL_DIR" \
--stage sft \
--do_train True \
--finetuning_type full \
--dataset "$DATA" \
--template qwen3_vl_nothink \
--cutoff_len 8192 \
--preprocessing_num_workers 128 \
--preprocessing_batch_size 256 \
--dataloader_num_workers 128 \
--output_dir "$OUTPUT_DIR" \
--logging_steps 1 \
--save_steps 5000 \
--plot_loss True \
--save_only_model False \
--report_to none \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--learning_rate 1e-5 \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--warmup_ratio 0.1 \
--bf16 TrueWe sincerely appreciate LLaMA-Factory for providing reference training framework.

