Skip to content

return_word_box parameter with unexpected behavior #17156

@denpawy

Description

@denpawy

🔎 Search before asking

  • I have searched the PaddleOCR Docs and found no similar bug report.
  • I have searched the PaddleOCR Issues and found no similar bug report.
  • I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

I am writing a processing pipeline and use PaddleOCR as a step within said pipeline because of the brilliant quality and results I saw so far.
Right now I am looking forward to use the parameter return_word_box in the PaddleOCR python package (https://github.com/PaddlePaddle/PaddleOCR).

I created a test PDF and I am focused on german documents right now.

The issue is that PaddleOCR somehow splits email addresses or words with german diacritics during the word segmentation into separate words.

The following are examples of the first page result using the PaddleOCR(use_angle_cls=True, lang="de", use_doc_unwarping=False, det_limit_side_len=4096, det_limit_type="max", return_word_box=True).ocr(img_path) method:
Value in rec_texts (index 7): 'Mit freundlichen Grüßen'
Value in text_word (index 7): ['Mit', ' ', 'freundlichen', ' ', 'Gr', 'üß', 'en']

Value in rec_texts (index 32): 'Manchmal begegnet man auch ungewöhnlichen Adressen wie [email protected] oder'
Value in text_word (index 32): ['Manchmal', ' ', 'begegnet', ' ', 'man', ' ', 'auch', ' ', 'ungew', 'ö', 'hnlichen', ' ', 'Adressen', ' ', 'wie', ' ', 'alex', '.', 'k-93', '@', 'devmail', '.', 'io', ' ', 'oder']

I consider this a bug, hence I did not put it up as a Q&A, please correct it if I am wrong.

Here is my pdf which I transormed into images per page and then used with the ocr method of PaddleOCR:
email_t_3.pdf

🏃‍♂️ Environment (运行环境)

OS: Windows 11 Enterprise
OS build: 26200.7171
Environment: FastAPI
CPU: 13th Gen Intel(R) Core(TM) i9-13980HX (2.20 GHz)
RAM: 64 GB
CUDA: None
Install: Poetry/pip
Python: 3.12
PaddleOCR: 3.3.1

🌰 Minimal Reproducible Example (最小可复现问题的Demo)


import os
import fitz
import tempfile
from paddleocr import PaddleOCR
from PIL import Image

def test_word_splitting(pdf_path):
    """Convert PDF pages to temporary images."""
    temp_images = []
    pdf_document = fitz.open(pdf_path)

    ocr = PaddleOCR(
        use_angle_cls=True,
        lang="de",
        use_doc_unwarping=False,
        det_limit_side_len=4096,
        det_limit_type="max",
        return_word_box=True
    )

    for page_num in range(pdf_document.page_count):
        page = pdf_document[page_num]
        pix = page.get_pixmap()

        with tempfile.NamedTemporaryFile(suffix='.png') as temp_file:
            pix.save(temp_file.name)
            temp_images.append(temp_file.name)

            result = ocr.predict(temp_file.name)
            for page_result in result:
                if page_result is None:
                    continue

                for i, rec_text in enumerate(page_result["rec_texts"]):
                    word_boxes = page_result["text_word"][i]  # Word box information

                    print(f"Index:         {i}")
                    print(f"Original text: {rec_text}")
                    print(f"Word boxes:    {word_boxes}")
                    print("-" * 50)

    pdf_document.close()
    return temp_images

if __name__ == "__main__":
    pdf_path = "./email_t_3.pdf"
    test_word_splitting(pdf_path)

Metadata

Metadata

Assignees

Labels

task/inferenceRelated to model inference or prediction.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions