Refact: Change DOCX extraction to use HTML tags for whitespace by danielaskdd · Pull Request #2550 · HKUDS/LightRAG

danielaskdd · 2025-12-28T07:22:38Z

Refact: Change DOCX extraction to use HTML tags for whitespace

Summary

This PR modifies the DOCX text extraction function to use HTML tags for representing whitespace instead of escaped string sequences. This change improves the visual representation of extracted document content when rendered in HTML contexts.

Changes

File: lightrag/api/routers/document_routes.py
Function: _extract_docx()

Whitespace Conversion Updates:

Before	After	Purpose
`\t` → `\\t`	`\t` → `&emsp;&emsp;`	Tabs rendered as HTML em spaces (double em space for visual tab width)
`\r\n` → `\\n`	`\r\n` → `<br>`	Windows newlines rendered as HTML line breaks
`\r` → `\\n`	`\r` → `<br>`	Mac (Classic) newlines rendered as HTML line breaks
`\n` → `\\n`	`\n` → `<br>`	Unix newlines rendered as HTML line breaks

Motivation

The previous implementation escaped whitespace characters as literal string sequences (\\t, \\n), which displayed as raw text in HTML rendering contexts. The new approach uses proper HTML entities and tags:

&emsp;&emsp; provides visual spacing equivalent to a tab character
<br> creates proper line breaks when content is rendered in HTML

Impact

Operational Impact: DOCX document extraction will now produce HTML-compatible whitespace formatting
Backward Compatibility: This affects how extracted text is formatted; consuming applications that expect escaped sequences may need adjustment
No API Changes: The function signature and overall extraction logic remain unchanged

Testing

Verified DOCX extraction produces properly formatted HTML whitespace
Line breaks and tabs render correctly in HTML display contexts

- Replace tabs with HTML em spaces - Convert all newlines to break tags

danielaskdd · 2025-12-28T07:22:59Z

@codex review

chatgpt-codex-connector · 2025-12-28T07:25:54Z

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Change DOCX extraction to use HTML tags for whitespace

4ef52ec

- Replace tabs with HTML em spaces - Convert all newlines to break tags

danielaskdd merged commit 2ff9c69 into HKUDS:main Dec 28, 2025
3 checks passed

danielaskdd deleted the docx-extraction branch December 28, 2025 07:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refact: Change DOCX extraction to use HTML tags for whitespace#2550

Refact: Change DOCX extraction to use HTML tags for whitespace#2550
danielaskdd merged 1 commit intoHKUDS:mainfrom
danielaskdd:docx-extraction

danielaskdd commented Dec 28, 2025

Uh oh!

danielaskdd commented Dec 28, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielaskdd commented Dec 28, 2025