Skip to content

Fix UTF-8 Encoding Issues Causing Document Processing Failures#2017

Merged
danielaskdd merged 1 commit intoHKUDS:mainfrom
danielaskdd:improve-text-sanitize
Aug 27, 2025
Merged

Fix UTF-8 Encoding Issues Causing Document Processing Failures#2017
danielaskdd merged 1 commit intoHKUDS:mainfrom
danielaskdd:improve-text-sanitize

Conversation

@danielaskdd
Copy link
Collaborator

Fix UTF-8 Encoding Issues Causing Document Processing Failures

Problem

LightRAG was experiencing document processing failures when encountering surrogate characters (U+D800-U+DFFF) in text content. The root cause is:

  1. The core issue was that sanitize_text_for_encoding used a "silent failure" approach, returning placeholder text like [TEXT_ENCODING_ERROR: X characters] instead of raising exceptions. This allowed corrupted data to propagate through the system until it reached Neo4j's strict UTF-8 encoding requirements.
  2. LLM returning content with encode problem while entity/relationship extraction, and there is no unicode santization there.

Solution

Implemented a fail-fast approach that prevents corrupted data from propagating through the system:

1. Enhanced sanitize_text_for_encoding Function

  • Before: Returned error placeholders like [TEXT_ENCODING_ERROR: X characters] for uncleanable text
  • After: Raises ValueError when encountering uncleanable encoding issues
  • Benefit: Forces callers to handle encoding problems explicitly instead of silently corrupting data

2. Strict Text Cleaning in Entity/Relationship Extraction

  • Added: Three-stage cleaning pipeline: sanitize_text_for_encodingclean_strnormalize_extracted_info
  • Applied to: All entity and relationship fields (names, descriptions, keywords)
  • Error Handling: Catches ValueError from sanitization and returns None to skip problematic extractions

3. Comprehensive Error Handling

  • Entity Extraction: _handle_single_entity_extraction now handles encoding errors gracefully
  • Relationship Extraction: _handle_single_relationship_extraction uses same fail-safe approach
  • Logging: Clear error messages for debugging problematic documents

Impact

Before Fix:

  • Documents with surrogate characters caused complete processing failure
  • Corrupted data was stored in the knowledge graph
  • Neo4j operations crashed the entire document processing pipeline

After Fix:

  • Encoding issues are detected early at the extraction stage
  • Problematic entities/relationships are skipped instead of corrupting data
  • Document processing continues successfully with valid entities
  • Storage layer is protected from receiving invalid UTF-8 data

Backward Compatibility

  • ✅ All existing functionality preserved
  • ✅ No API changes required
  • ✅ No configuration changes needed
  • ✅ Existing data storage formats unchanged

…ters

- Change sanitize_text_for_encoding to fail-fast instead of returning error placeholders
- Add strict UTF-8 cleaning pipeline to entity/relationship extraction
- Skip problematic entities/relationships instead of corrupting data

Fixes document processing crashes when encountering surrogate characters (U+D800-U+DFFF)
@danielaskdd danielaskdd merged commit 57ba2ca into HKUDS:main Aug 27, 2025
1 check passed
@danielaskdd danielaskdd deleted the improve-text-sanitize branch August 29, 2025 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant