Skip to content

Feat: Enhanced DOCX Extraction with Table Content Support#2383

Merged
danielaskdd merged 4 commits intoHKUDS:mainfrom
danielaskdd:doc-table
Nov 18, 2025
Merged

Feat: Enhanced DOCX Extraction with Table Content Support#2383
danielaskdd merged 4 commits intoHKUDS:mainfrom
danielaskdd:doc-table

Conversation

@danielaskdd
Copy link
Collaborator

🔧 Enhanced DOCX Extraction with Table Content Support

Problem Statement

The current _extract_docx() function only extracts text from doc.paragraphs, which excludes all content within table cells. Word documents often contain critical information in tables (data, specifications, comparisons, etc.), and this content was completely missing from the extracted text sent to the RAG system.

Impact: Documents with tables lose significant content, reducing the quality and completeness of knowledge extraction for the RAG system.

Solution

Enhanced the _extract_docx() function to:

  • ✅ Extract all table content in addition to paragraphs
  • ✅ Preserve document order (tables appear between correct paragraphs)
  • ✅ Format tables with clear visual separation
  • ✅ Handle merged cells naturally (preserves repeated content)

Changes Made

Modified File

  • lightrag/api/routers/document_routes.py

Key Implementation Details

  1. Document Element Traversal

    • Iterate through doc.element.body to access all elements in order
    • Identify paragraphs (element.tag.endswith('p'))
    • Identify tables (element.tag.endswith('tbl'))
  2. Table Formatting

    • Cells within a row are separated by tab characters (\t)
    • Blank lines added before and after tables for clarity
    • Table rows remain grouped (no blank lines between rows)
  3. Merged Cell Handling

    • Accepts natural python-docx behavior (repeated cell content)
    • Keeps implementation simple and robust

Example Output

Before (only paragraphs):

Introduction to our team
This table shows our employees.
All data is current.

After (paragraphs + tables):

Introduction to our team

Name	Age	Department
John	30	Engineering
Jane	25	Marketing

This table shows our employees.

Product	Price	Quantity
Apple	$5	100
Orange	$3	150

All data is current.

Benefits

  1. Complete Content Extraction - No more missing table data
  2. Better RAG Quality - Knowledge graphs can now include structured data from tables
  3. Preserved Context - Table position in document flow maintained
  4. Clean Formatting - Tab-separated cells make structure clear to LLM processing

Testing Recommendations

  • ✅ Test with DOCX files containing tables and paragraphs
  • ✅ Verify table content appears in correct document position
  • ✅ Test merged cell scenarios
  • ✅ Verify backward compatibility (files without tables still work)
  • ✅ Check performance with large tables

Breaking Changes

None - This is a backward-compatible enhancement.

Related Issues

Resolves: #2380

• Include tables in extracted content
• Maintain original document order
• Add spacing around tables
• Use tabs to separate table cells
• Process all body elements sequentially
@danielaskdd
Copy link
Collaborator Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

• Always append cell text to maintain columns
• Preserve empty cells in table structure
• Check for any content before adding rows
• Use tab separation for proper alignment
• Improve table formatting consistency
@danielaskdd
Copy link
Collaborator Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

• Remove text emptiness check
• Always append paragraph text
• Maintain document formatting
• Preserve original spacing
@danielaskdd
Copy link
Collaborator Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

• Keep original paragraph spacing
• Preserve cell whitespace in tables
• Maintain document formatting
• Don't strip leading/trailing spaces
@danielaskdd
Copy link
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. 🚀

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danielaskdd danielaskdd merged commit efbbaaf into HKUDS:main Nov 18, 2025
4 checks passed
@danielaskdd danielaskdd deleted the doc-table branch November 18, 2025 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Question]: RAG doc Not supported for processing tables?

1 participant