return_word_box parameter with unexpected behavior

### 🔎 Search before asking

- [x] I have searched the PaddleOCR [Docs](https://paddlepaddle.github.io/PaddleOCR/) and found no similar bug report.
- [x] I have searched the PaddleOCR [Issues](https://github.com/PaddlePaddle/PaddleOCR/issues) and found no similar bug report.
- [x] I have searched the PaddleOCR [Discussions](https://github.com/PaddlePaddle/PaddleOCR/discussions) and found no similar bug report.

### 🐛 Bug (问题描述)

I am writing a processing pipeline and use PaddleOCR as a step within said pipeline because of the brilliant quality and results I saw so far.
Right now I am looking forward to use the parameter return_word_box in the PaddleOCR python package (https://github.com/PaddlePaddle/PaddleOCR).

I created a test PDF and I am focused on german documents right now.

The issue is that PaddleOCR somehow splits email addresses or words with german diacritics during the word segmentation into separate words.

The following are examples of the first page result using the `PaddleOCR(use_angle_cls=True, lang="de", use_doc_unwarping=False, det_limit_side_len=4096, det_limit_type="max", return_word_box=True).ocr(img_path)` method:
Value in rec_texts (index 7): 'Mit freundlichen **Grüßen**'
Value in text_word (index 7): ['Mit', ' ', 'freundlichen', ' ', **'Gr', 'üß', 'en'**]

Value in rec_texts (index 32): 'Manchmal begegnet man auch **ungewöhnlichen** Adressen wie **alex.k-93@devmail.io** oder'
Value in text_word (index 32): ['Manchmal', ' ', 'begegnet', ' ', 'man', ' ', 'auch', ' ', **'ungew', 'ö', 'hnlichen'**, ' ', 'Adressen', ' ', 'wie', ' ', **'alex', '.', 'k-93', '@', 'devmail', '.', 'io'**, ' ', 'oder']

I consider this a bug, hence I did not put it up as a Q&A, please correct it if I am wrong.

Here is my pdf which I transormed into images per page and then used with the ocr method of PaddleOCR:
[email_t_3.pdf](https://github.com/user-attachments/files/23676028/email_t_3.pdf)

### 🏃‍♂️ Environment (运行环境)

```
OS: Windows 11 Enterprise
OS build: 26200.7171
Environment: FastAPI
CPU: 13th Gen Intel(R) Core(TM) i9-13980HX (2.20 GHz)
RAM: 64 GB
CUDA: None
Install: Poetry/pip
Python: 3.12
PaddleOCR: 3.3.1
```

### 🌰 Minimal Reproducible Example (最小可复现问题的Demo)

```

import os
import fitz
import tempfile
from paddleocr import PaddleOCR
from PIL import Image

def test_word_splitting(pdf_path):
    """Convert PDF pages to temporary images."""
    temp_images = []
    pdf_document = fitz.open(pdf_path)

    ocr = PaddleOCR(
        use_angle_cls=True,
        lang="de",
        use_doc_unwarping=False,
        det_limit_side_len=4096,
        det_limit_type="max",
        return_word_box=True
    )

    for page_num in range(pdf_document.page_count):
        page = pdf_document[page_num]
        pix = page.get_pixmap()

        with tempfile.NamedTemporaryFile(suffix='.png') as temp_file:
            pix.save(temp_file.name)
            temp_images.append(temp_file.name)

            result = ocr.predict(temp_file.name)
            for page_result in result:
                if page_result is None:
                    continue

                for i, rec_text in enumerate(page_result["rec_texts"]):
                    word_boxes = page_result["text_word"][i]  # Word box information

                    print(f"Index:         {i}")
                    print(f"Original text: {rec_text}")
                    print(f"Word boxes:    {word_boxes}")
                    print("-" * 50)

    pdf_document.close()
    return temp_images

if __name__ == "__main__":
    pdf_path = "./email_t_3.pdf"
    test_word_splitting(pdf_path)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

return_word_box parameter with unexpected behavior #17156

🔎 Search before asking

🐛 Bug (问题描述)

🏃‍♂️ Environment (运行环境)

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

return_word_box parameter with unexpected behavior #17156

Description

🔎 Search before asking

🐛 Bug (问题描述)

🏃‍♂️ Environment (运行环境)

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions