DocLayout-YOLO: Advancing Document Layout Analysis with Mesh-candidate Bestfit and Global-to-local perception

Official PyTorch implementation of DocLayout-YOLO.

Zhiyuan Zhao, Hengrui Kang, Bin Wang, Conghui He

Abstract

We introduce DocLayout-YOLO, which not only enhances accuracy but also preserves the speed advantage through optimization from pre-training and model perspectives in a document-tailored manner. In terms of robust document pretraining, we innovatively regard document synthetic as a 2D bin packing problem and introduce Mesh-candidate Bestfit, which enables the generation of large-scale, diverse document datasets. The model, pre-trained on the resulting DocSynth300K dataset, significantly enhances fine-tuning performance across a variety of document types. In terms of model enhancement for document understanding, we propose a Global-to-local Controllable Receptive Module which emulates the human visual process from global to local perspectives and features a controllable module for feature extraction and integration. Experimental results on extensive downstream datasets show that the proposed DocLayout-YOLO excels at both speed and accuracy.

Quick Start

1. Pypi installation

pip install doclayout-yolo

2. Prediction

We provide model fine-tuned on DocStructBench for prediction, which is capable of handing various document types. Model can be downloaded from here and example images can be found under assets/example.

Example code for prediction:

import cv2
from doclayout_yolo import YOLOv10
model = YOLOv10("path to provided model")  # load an official model
det_res = model.predict(
    "image to predict",
    imgsz=1024,    # prediction image size
    conf=0.2,    # prediction score threshold
    device="0",    # device to use
)
annotated_frame = det_res[0].plot(pil=True, line_width=5, font_size=20)
cv2.imwrite("result.jpg", annotated_frame)

You also can use predict_single.py for prediction with custom inference settings. For batch process, please refer to PDF-Extract-Kit.

Training and Evaluation on Public DLA Datasets

1. Environment Preparation

conda virtual environment is recommended.

conda create -n doclayout_yolo python=3.9
conda activate doclayout_yolo
pip install -r requirements.txt
pip install -e .

2. Data Preparation

specify data root path

Find your ultralytics config file (for Linux user in $HOME/.config/Ultralytics/settings.yaml) and change datasets_dir to project root path.

Download prepared yolo-format D4LA and doclaynet data from below and put to ./layout_data:

Dataset	Download
D4LA	link
DocLayNet	link

the file structure is as follows:

./layout_data
├── D4LA
│   ├── images
│   ├── labels
│   ├── test.txt
│   └── train.txt
└── doclaynet
    ├── images
    ├── labels
    ├── val.txt
    └── train.txt

3. Training and Evaluation

Training is conducted on 8 GPUs with a global batch size of 64 (8 images per device), detailed settings and checkpoints are as follows:

Dataset	Model	DocSynth300K Pretrained?	imgsz	Learning rate	Finetune	Evaluation	AP50	mAP	Checkpoint
D4LA	DocLayout-YOLO	✗	1600	0.04	command	command	81.7	69.8	checkpoint
D4LA	DocLayout-YOLO	✓	1600	0.04	command	command	82.4	70.3	checkpoint
DocLayNet	DocLayout-YOLO	✗	1120	0.02	command	command	93.0	77.7	checkpoint
DocLayNet	DocLayout-YOLO	✓	1120	0.02	command	command	93.4	79.7	checkpoint

The DocSynth300K pretrained model can be downloaded from here. Change checkpoint.pt to the path of model to be evaluated during evaluation.

Acknowledgement

The code base is built with ultralytics and YOLO-v10.

Thanks for these great work!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
doclayout_yolo		doclayout_yolo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
predict_single.py		predict_single.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train.py		train.py
val.py		val.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DocLayout-YOLO: Advancing Document Layout Analysis with Mesh-candidate Bestfit and Global-to-local perception

Quick Start

Training and Evaluation on Public DLA Datasets

1. Environment Preparation

2. Data Preparation

3. Training and Evaluation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

JulioZhao97/DocLayout-YOLO

Folders and files

Latest commit

History

Repository files navigation

DocLayout-YOLO: Advancing Document Layout Analysis with Mesh-candidate Bestfit and Global-to-local perception

Quick Start

Training and Evaluation on Public DLA Datasets

1. Environment Preparation

2. Data Preparation

3. Training and Evaluation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages