Skip to content

Refact: Change DOCX extraction to use HTML tags for whitespace#2550

Merged
danielaskdd merged 1 commit intoHKUDS:mainfrom
danielaskdd:docx-extraction
Dec 28, 2025
Merged

Refact: Change DOCX extraction to use HTML tags for whitespace#2550
danielaskdd merged 1 commit intoHKUDS:mainfrom
danielaskdd:docx-extraction

Conversation

@danielaskdd
Copy link
Collaborator

Refact: Change DOCX extraction to use HTML tags for whitespace

Summary

This PR modifies the DOCX text extraction function to use HTML tags for representing whitespace instead of escaped string sequences. This change improves the visual representation of extracted document content when rendered in HTML contexts.

Changes

  • File: lightrag/api/routers/document_routes.py
  • Function: _extract_docx()

Whitespace Conversion Updates:

Before After Purpose
\t\\t \t   Tabs rendered as HTML em spaces (double em space for visual tab width)
\r\n\\n \r\n<br> Windows newlines rendered as HTML line breaks
\r\\n \r<br> Mac (Classic) newlines rendered as HTML line breaks
\n\\n \n<br> Unix newlines rendered as HTML line breaks

Motivation

The previous implementation escaped whitespace characters as literal string sequences (\\t, \\n), which displayed as raw text in HTML rendering contexts. The new approach uses proper HTML entities and tags:

  • &emsp;&emsp; provides visual spacing equivalent to a tab character
  • <br> creates proper line breaks when content is rendered in HTML

Impact

  • Operational Impact: DOCX document extraction will now produce HTML-compatible whitespace formatting
  • Backward Compatibility: This affects how extracted text is formatted; consuming applications that expect escaped sequences may need adjustment
  • No API Changes: The function signature and overall extraction logic remain unchanged

Testing

  • Verified DOCX extraction produces properly formatted HTML whitespace
  • Line breaks and tabs render correctly in HTML display contexts

- Replace tabs with HTML em spaces
- Convert all newlines to break tags
@danielaskdd
Copy link
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danielaskdd danielaskdd merged commit 2ff9c69 into HKUDS:main Dec 28, 2025
3 checks passed
@danielaskdd danielaskdd deleted the docx-extraction branch December 28, 2025 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant