Feat: Enhanced DOCX Extraction with Table Content Support#2383
Feat: Enhanced DOCX Extraction with Table Content Support#2383danielaskdd merged 4 commits intoHKUDS:mainfrom
Conversation
• Include tables in extracted content • Maintain original document order • Add spacing around tables • Use tabs to separate table cells • Process all body elements sequentially
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
• Always append cell text to maintain columns • Preserve empty cells in table structure • Check for any content before adding rows • Use tab separation for proper alignment • Improve table formatting consistency
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
• Remove text emptiness check • Always append paragraph text • Maintain document formatting • Preserve original spacing
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
• Keep original paragraph spacing • Preserve cell whitespace in tables • Maintain document formatting • Don't strip leading/trailing spaces
|
@codex review |
|
Codex Review: Didn't find any major issues. 🚀 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
🔧 Enhanced DOCX Extraction with Table Content Support
Problem Statement
The current
_extract_docx()function only extracts text fromdoc.paragraphs, which excludes all content within table cells. Word documents often contain critical information in tables (data, specifications, comparisons, etc.), and this content was completely missing from the extracted text sent to the RAG system.Impact: Documents with tables lose significant content, reducing the quality and completeness of knowledge extraction for the RAG system.
Solution
Enhanced the
_extract_docx()function to:Changes Made
Modified File
lightrag/api/routers/document_routes.pyKey Implementation Details
Document Element Traversal
doc.element.bodyto access all elements in orderelement.tag.endswith('p'))element.tag.endswith('tbl'))Table Formatting
\t)Merged Cell Handling
Example Output
Before (only paragraphs):
After (paragraphs + tables):
Benefits
Testing Recommendations
Breaking Changes
None - This is a backward-compatible enhancement.
Related Issues
Resolves: #2380