Fix UTF-8 Encoding Issues Causing Document Processing Failures by danielaskdd · Pull Request #2017 · HKUDS/LightRAG

danielaskdd · 2025-08-27T16:12:02Z

Fix UTF-8 Encoding Issues Causing Document Processing Failures

Problem

LightRAG was experiencing document processing failures when encountering surrogate characters (U+D800-U+DFFF) in text content. The root cause is:

The core issue was that sanitize_text_for_encoding used a "silent failure" approach, returning placeholder text like [TEXT_ENCODING_ERROR: X characters] instead of raising exceptions. This allowed corrupted data to propagate through the system until it reached Neo4j's strict UTF-8 encoding requirements.
LLM returning content with encode problem while entity/relationship extraction, and there is no unicode santization there.

Solution

Implemented a fail-fast approach that prevents corrupted data from propagating through the system:

1. Enhanced `sanitize_text_for_encoding` Function

Before: Returned error placeholders like [TEXT_ENCODING_ERROR: X characters] for uncleanable text
After: Raises ValueError when encountering uncleanable encoding issues
Benefit: Forces callers to handle encoding problems explicitly instead of silently corrupting data

2. Strict Text Cleaning in Entity/Relationship Extraction

Added: Three-stage cleaning pipeline: sanitize_text_for_encoding → clean_str → normalize_extracted_info
Applied to: All entity and relationship fields (names, descriptions, keywords)
Error Handling: Catches ValueError from sanitization and returns None to skip problematic extractions

3. Comprehensive Error Handling

Entity Extraction: _handle_single_entity_extraction now handles encoding errors gracefully
Relationship Extraction: _handle_single_relationship_extraction uses same fail-safe approach
Logging: Clear error messages for debugging problematic documents

Impact

Before Fix:

Documents with surrogate characters caused complete processing failure
Corrupted data was stored in the knowledge graph
Neo4j operations crashed the entire document processing pipeline

After Fix:

Encoding issues are detected early at the extraction stage
Problematic entities/relationships are skipped instead of corrupting data
Document processing continues successfully with valid entities
Storage layer is protected from receiving invalid UTF-8 data

Backward Compatibility

✅ All existing functionality preserved
✅ No API changes required
✅ No configuration changes needed
✅ Existing data storage formats unchanged

…ters - Change sanitize_text_for_encoding to fail-fast instead of returning error placeholders - Add strict UTF-8 cleaning pipeline to entity/relationship extraction - Skip problematic entities/relationships instead of corrupting data Fixes document processing crashes when encountering surrogate characters (U+D800-U+DFFF)

danielaskdd merged commit 57ba2ca into HKUDS:main Aug 27, 2025
1 check passed

danielaskdd deleted the improve-text-sanitize branch August 29, 2025 07:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UTF-8 Encoding Issues Causing Document Processing Failures#2017

Fix UTF-8 Encoding Issues Causing Document Processing Failures#2017
danielaskdd merged 1 commit intoHKUDS:mainfrom
danielaskdd:improve-text-sanitize

danielaskdd commented Aug 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielaskdd commented Aug 27, 2025

Fix UTF-8 Encoding Issues Causing Document Processing Failures

Problem

Solution

1. Enhanced sanitize_text_for_encoding Function

2. Strict Text Cleaning in Entity/Relationship Extraction

3. Comprehensive Error Handling

Impact

Before Fix:

After Fix:

Backward Compatibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Enhanced `sanitize_text_for_encoding` Function