Feat: Enhance XLSX Extraction by Adding Separators and Escape Special Characters#2386
Conversation
…ping - Add clear sheet separators - Escape special characters - Trim trailing empty columns - Preserve row structure - Single-pass optimization
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
• Two-pass approach for consistent width • Maintain tabular structure integrity • Determine max columns first pass • Extract with alignment second pass • Prevent column misalignment issues
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
• Remove intermediate row storage • Use iterator twice instead of list() • Preserve column alignment logic • Reduce memory footprint • Maintain same output format
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…s scan • Remove two-pass row scanning approach • Use built-in sheet.max_column property • Simplify column width detection logic • Improve memory efficiency • Maintain column alignment preservation
|
@codex review |
|
Codex Review: Didn't find any major issues. Delightful! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
- Add escape_cell() function - Escape backslashes first - Handle tabs and newlines - Preserve tab-delimited format - Prevent double-escaping issues
|
@codex review |
|
Codex Review: Didn't find any major issues. Keep them coming! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
🎯 Enhance XLSX Extraction by Adding Separators and Escape Special Characters
Summary
Improved the
_extract_xlsxfunction to produce better-structured, LLM-friendly output with clear sheet delimiters, proper special character escaping, and performance optimization through single-pass processing.🔍 Problem Statement
The previous XLSX extraction had several limitations:
✨ Solution
1. Clear Sheet Delimiters
====================separators2. Robust Special Character Escaping
3. Sheet Title Sanitization
4. Single-Pass Optimization ⚡
5. Smart Trailing Column Trimming
data\t\t\t\t\t\t(long empty trailing cells)6. Comprehensive Documentation
str | int | float | NonevsAny)📊 Key Changes
Sheet: {title}\n==================== Sheet: {title} ====================\t,\n,\\)Anystr | int | float | None🎁 Benefits
🧪 Testing Recommendations
C:\Users\test\file.txtFix DOCX table extraction by escaping special characters in cells