Fix UTF-8 Encoding Issues Causing Document Processing Failures#2017
Merged
danielaskdd merged 1 commit intoHKUDS:mainfrom Aug 27, 2025
Merged
Fix UTF-8 Encoding Issues Causing Document Processing Failures#2017danielaskdd merged 1 commit intoHKUDS:mainfrom
danielaskdd merged 1 commit intoHKUDS:mainfrom
Conversation
…ters - Change sanitize_text_for_encoding to fail-fast instead of returning error placeholders - Add strict UTF-8 cleaning pipeline to entity/relationship extraction - Skip problematic entities/relationships instead of corrupting data Fixes document processing crashes when encountering surrogate characters (U+D800-U+DFFF)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix UTF-8 Encoding Issues Causing Document Processing Failures
Problem
LightRAG was experiencing document processing failures when encountering surrogate characters (U+D800-U+DFFF) in text content. The root cause is:
sanitize_text_for_encodingused a "silent failure" approach, returning placeholder text like[TEXT_ENCODING_ERROR: X characters]instead of raising exceptions. This allowed corrupted data to propagate through the system until it reached Neo4j's strict UTF-8 encoding requirements.Solution
Implemented a fail-fast approach that prevents corrupted data from propagating through the system:
1. Enhanced
sanitize_text_for_encodingFunction[TEXT_ENCODING_ERROR: X characters]for uncleanable textValueErrorwhen encountering uncleanable encoding issues2. Strict Text Cleaning in Entity/Relationship Extraction
sanitize_text_for_encoding→clean_str→normalize_extracted_infoValueErrorfrom sanitization and returnsNoneto skip problematic extractions3. Comprehensive Error Handling
_handle_single_entity_extractionnow handles encoding errors gracefully_handle_single_relationship_extractionuses same fail-safe approachImpact
Before Fix:
After Fix:
Backward Compatibility