Feat: Enhanced DOCX Extraction with Table Content Support by danielaskdd · Pull Request #2383 · HKUDS/LightRAG

danielaskdd · 2025-11-18T17:37:15Z

🔧 Enhanced DOCX Extraction with Table Content Support

Problem Statement

The current _extract_docx() function only extracts text from doc.paragraphs, which excludes all content within table cells. Word documents often contain critical information in tables (data, specifications, comparisons, etc.), and this content was completely missing from the extracted text sent to the RAG system.

Impact: Documents with tables lose significant content, reducing the quality and completeness of knowledge extraction for the RAG system.

Solution

Enhanced the _extract_docx() function to:

✅ Extract all table content in addition to paragraphs
✅ Preserve document order (tables appear between correct paragraphs)
✅ Format tables with clear visual separation
✅ Handle merged cells naturally (preserves repeated content)

Changes Made

Modified File

lightrag/api/routers/document_routes.py

Key Implementation Details

Document Element Traversal
- Iterate through doc.element.body to access all elements in order
- Identify paragraphs (element.tag.endswith('p'))
- Identify tables (element.tag.endswith('tbl'))
Table Formatting
- Cells within a row are separated by tab characters (\t)
- Blank lines added before and after tables for clarity
- Table rows remain grouped (no blank lines between rows)
Merged Cell Handling
- Accepts natural python-docx behavior (repeated cell content)
- Keeps implementation simple and robust

Example Output

Before (only paragraphs):

Introduction to our team
This table shows our employees.
All data is current.

After (paragraphs + tables):

Introduction to our team

Name	Age	Department
John	30	Engineering
Jane	25	Marketing

This table shows our employees.

Product	Price	Quantity
Apple	$5	100
Orange	$3	150

All data is current.

Benefits

Complete Content Extraction - No more missing table data
Better RAG Quality - Knowledge graphs can now include structured data from tables
Preserved Context - Table position in document flow maintained
Clean Formatting - Tab-separated cells make structure clear to LLM processing

Testing Recommendations

✅ Test with DOCX files containing tables and paragraphs
✅ Verify table content appears in correct document position
✅ Test merged cell scenarios
✅ Verify backward compatibility (files without tables still work)
✅ Check performance with large tables

Breaking Changes

None - This is a backward-compatible enhancement.

Related Issues

Resolves: #2380

• Include tables in extracted content • Maintain original document order • Add spacing around tables • Use tabs to separate table cells • Process all body elements sequentially

danielaskdd · 2025-11-18T17:37:41Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

lightrag/api/routers/document_routes.py

• Always append cell text to maintain columns • Preserve empty cells in table structure • Check for any content before adding rows • Use tab separation for proper alignment • Improve table formatting consistency

danielaskdd · 2025-11-18T17:52:53Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

lightrag/api/routers/document_routes.py

• Remove text emptiness check • Always append paragraph text • Maintain document formatting • Preserve original spacing

danielaskdd · 2025-11-18T18:03:50Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

lightrag/api/routers/document_routes.py

• Keep original paragraph spacing • Preserve cell whitespace in tables • Maintain document formatting • Don't strip leading/trailing spaces

danielaskdd · 2025-11-18T18:13:50Z

@codex review

chatgpt-codex-connector · 2025-11-18T18:20:50Z

Codex Review: Didn't find any major issues. 🚀

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Enhance DOCX extraction to preserve document order with tables

4438ba4

• Include tables in extracted content • Maintain original document order • Add spacing around tables • Use tabs to separate table cells • Process all body elements sequentially

chatgpt-codex-connector bot reviewed Nov 18, 2025

View reviewed changes

lightrag/api/routers/document_routes.py Outdated Show resolved Hide resolved

Fix table column structure preservation in DOCX extraction

fa887d8

• Always append cell text to maintain columns • Preserve empty cells in table structure • Check for any content before adding rows • Use tab separation for proper alignment • Improve table formatting consistency

chatgpt-codex-connector bot reviewed Nov 18, 2025

View reviewed changes

lightrag/api/routers/document_routes.py Outdated Show resolved Hide resolved

Preserve blank paragraphs in DOCX extraction to maintain spacing

186c8f0

• Remove text emptiness check • Always append paragraph text • Maintain document formatting • Preserve original spacing

chatgpt-codex-connector bot reviewed Nov 18, 2025

View reviewed changes

lightrag/api/routers/document_routes.py Show resolved Hide resolved

lightrag/api/routers/document_routes.py Show resolved Hide resolved

Remove text stripping in DOCX extraction to preserve whitespace

e7d2803

• Keep original paragraph spacing • Preserve cell whitespace in tables • Maintain document formatting • Don't strip leading/trailing spaces

danielaskdd merged commit efbbaaf into HKUDS:main Nov 18, 2025
4 checks passed

danielaskdd mentioned this pull request Nov 18, 2025

[Question]: RAG doc Not supported for processing tables？ #2380

Closed

2 tasks

danielaskdd deleted the doc-table branch November 18, 2025 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Enhanced DOCX Extraction with Table Content Support#2383

Feat: Enhanced DOCX Extraction with Table Content Support#2383
danielaskdd merged 4 commits intoHKUDS:mainfrom
danielaskdd:doc-table

danielaskdd commented Nov 18, 2025

Uh oh!

danielaskdd commented Nov 18, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

danielaskdd commented Nov 18, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

danielaskdd commented Nov 18, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

danielaskdd commented Nov 18, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielaskdd commented Nov 18, 2025

🔧 Enhanced DOCX Extraction with Table Content Support

Problem Statement

Solution

Changes Made

Modified File

Key Implementation Details

Example Output

Benefits

Testing Recommendations

Breaking Changes

Related Issues

Uh oh!

danielaskdd commented Nov 18, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

danielaskdd commented Nov 18, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

danielaskdd commented Nov 18, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

danielaskdd commented Nov 18, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant