Refact: Change DOCX extraction to use HTML tags for whitespace#2550
Merged
danielaskdd merged 1 commit intoHKUDS:mainfrom Dec 28, 2025
Merged
Refact: Change DOCX extraction to use HTML tags for whitespace#2550danielaskdd merged 1 commit intoHKUDS:mainfrom
danielaskdd merged 1 commit intoHKUDS:mainfrom
Conversation
- Replace tabs with HTML em spaces - Convert all newlines to break tags
Collaborator
Author
|
@codex review |
|
Codex Review: Didn't find any major issues. Breezy! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refact: Change DOCX extraction to use HTML tags for whitespace
Summary
This PR modifies the DOCX text extraction function to use HTML tags for representing whitespace instead of escaped string sequences. This change improves the visual representation of extracted document content when rendered in HTML contexts.
Changes
lightrag/api/routers/document_routes.py_extract_docx()Whitespace Conversion Updates:
\t→\\t\t→  \r\n→\\n\r\n→<br>\r→\\n\r→<br>\n→\\n\n→<br>Motivation
The previous implementation escaped whitespace characters as literal string sequences (
\\t,\\n), which displayed as raw text in HTML rendering contexts. The new approach uses proper HTML entities and tags:  provides visual spacing equivalent to a tab character<br>creates proper line breaks when content is rendered in HTMLImpact
Testing