Fix: Prevent UnicodeEncodeError in JSON storage operations#2344
Fix: Prevent UnicodeEncodeError in JSON storage operations#2344danielaskdd merged 3 commits intoHKUDS:mainfrom
Conversation
• Add _sanitize_json_data helper function • Recursively clean strings in data • Sanitize before JSON serialization • Prevent encoding-related crashes • Use existing sanitize_text_for_encoding
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
• Remove surrogate characters (U+D800-DFFF) • Filter Unicode non-characters • Direct char-by-char filtering
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Sanitize dictionary keys - Preserve tuple types - Handle nested structures better
|
@codex review |
|
Codex Review: Didn't find any major issues. Nice work! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Fix: Prevent UnicodeEncodeError in JSON storage operations
Problem
Document deletion and other operations were failing with
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc9a' in position 201: surrogates not allowed. This occurred when the JSON storage system attempted to persist data containing surrogate characters (U+D800 to U+DFFF), which are invalid in UTF-8 encoding.Error Stack Trace:
Root Cause Analysis
Initial fix attempt used
sanitize_text_for_encoding(), which had a critical flaw: it attempted to validate UTF-8 encoding before removing surrogate characters. This causedValueErrorexceptions instead of fixing the problem, making operations likewrite_json({"text": "\udc9a"}, path)still fail.Solution
Implemented a dedicated JSON sanitization approach that avoids pre-validation:
Added
_sanitize_string_for_json()helper functionModified
_sanitize_json_data()function_sanitize_string_for_json()for all string valuesUpdated
write_json()functionChanges Made
File:
lightrag/utils.py_sanitize_string_for_json(text: str) -> str_sanitize_json_data(data: Any) -> Anywrite_json(json_obj, file_name)Key Features
Impact
UnicodeEncodeErrorTesting
The fix correctly handles edge cases that previously caused failures:
write_json({"text": "\udc9a"}, path)- now succeeds (surrogate removed)