DocLayout-YOLO: Advancing Document Layout Analysis with Mesh-candidate Bestfit and Global-to-local perception
Official PyTorch implementation of DocLayout-YOLO.
Zhiyuan Zhao, Hengrui Kang, Bin Wang, Conghui He
Abstract
We introduce DocLayout-YOLO, which not only enhances accuracy but also preserves the speed advantage through optimization from pre-training and model perspectives in a document-tailored manner. In terms of robust document pretraining, we innovatively regard document synthetic as a 2D bin packing problem and introduce Mesh-candidate Bestfit, which enables the generation of large-scale, diverse document datasets. The model, pre-trained on the resulting DocSynth300K dataset, significantly enhances fine-tuning performance across a variety of document types. In terms of model enhancement for document understanding, we propose a Global-to-local Controllable Receptive Module which emulates the human visual process from global to local perspectives and features a controllable module for feature extraction and integration. Experimental results on extensive downstream datasets show that the proposed DocLayout-YOLO excels at both speed and accuracy.1. Pypi installation
pip install doclayout-yolo
2. Prediction
We provide model fine-tuned on DocStructBench for prediction, which is capable of handing various document types. Model can be downloaded from here and example images can be found under assets/example
.
Example code for prediction:
import cv2
from doclayout_yolo import YOLOv10
model = YOLOv10("path to provided model") # load an official model
det_res = model.predict(
"image to predict",
imgsz=1024, # prediction image size
conf=0.2, # prediction score threshold
device="0", # device to use
)
annotated_frame = det_res[0].plot(pil=True, line_width=5, font_size=20)
cv2.imwrite("result.jpg", annotated_frame)
You also can use predict_single.py
for prediction with custom inference settings. For batch process, please refer to PDF-Extract-Kit.
conda
virtual environment is recommended.
conda create -n doclayout_yolo python=3.9
conda activate doclayout_yolo
pip install -r requirements.txt
pip install -e .
- specify data root path
Find your ultralytics config file (for Linux user in $HOME/.config/Ultralytics/settings.yaml) and change datasets_dir
to project root path.
- Download prepared yolo-format D4LA and doclaynet data from below and put to
./layout_data
:
Dataset | Download |
---|---|
D4LA | link |
DocLayNet | link |
the file structure is as follows:
./layout_data
├── D4LA
│ ├── images
│ ├── labels
│ ├── test.txt
│ └── train.txt
└── doclaynet
├── images
├── labels
├── val.txt
└── train.txt
Training is conducted on 8 GPUs with a global batch size of 64 (8 images per device), detailed settings and checkpoints are as follows:
Dataset | Model | DocSynth300K Pretrained? | imgsz | Learning rate | Finetune | Evaluation | AP50 | mAP | Checkpoint |
---|---|---|---|---|---|---|---|---|---|
D4LA | DocLayout-YOLO | ✗ | 1600 | 0.04 | command | command | 81.7 | 69.8 | checkpoint |
D4LA | DocLayout-YOLO | ✓ | 1600 | 0.04 | command | command | 82.4 | 70.3 | checkpoint |
DocLayNet | DocLayout-YOLO | ✗ | 1120 | 0.02 | command | command | 93.0 | 77.7 | checkpoint |
DocLayNet | DocLayout-YOLO | ✓ | 1120 | 0.02 | command | command | 93.4 | 79.7 | checkpoint |
The DocSynth300K pretrained model can be downloaded from here. Change checkpoint.pt
to the path of model to be evaluated during evaluation.
The code base is built with ultralytics and YOLO-v10.
Thanks for these great work!