Skip to content

Fix: Prevent UnicodeEncodeError in JSON storage operations#2344

Merged
danielaskdd merged 3 commits intoHKUDS:mainfrom
danielaskdd:fix-josn-serialization-error
Nov 11, 2025
Merged

Fix: Prevent UnicodeEncodeError in JSON storage operations#2344
danielaskdd merged 3 commits intoHKUDS:mainfrom
danielaskdd:fix-josn-serialization-error

Conversation

@danielaskdd
Copy link
Collaborator

@danielaskdd danielaskdd commented Nov 11, 2025

Fix: Prevent UnicodeEncodeError in JSON storage operations

Problem

Document deletion and other operations were failing with UnicodeEncodeError: 'utf-8' codec can't encode character '\udc9a' in position 201: surrogates not allowed. This occurred when the JSON storage system attempted to persist data containing surrogate characters (U+D800 to U+DFFF), which are invalid in UTF-8 encoding.

Error Stack Trace:

File "lightrag/kg/json_kv_impl.py", line 84, in index_done_callback
File "lightrag/utils.py", line 911, in write_json
    json.dump(json_obj, f, indent=2, ensure_ascii=False)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc9a'

Root Cause Analysis

Initial fix attempt used sanitize_text_for_encoding(), which had a critical flaw: it attempted to validate UTF-8 encoding before removing surrogate characters. This caused ValueError exceptions instead of fixing the problem, making operations like write_json({"text": "\udc9a"}, path) still fail.

Solution

Implemented a dedicated JSON sanitization approach that avoids pre-validation:

  1. Added _sanitize_string_for_json() helper function

    • Directly removes surrogate characters (U+D800 to U+DFFF) without attempting to encode first
    • Also removes other Unicode non-characters (U+FFFE, U+FFFF)
    • No exceptions raised - problematic characters are silently stripped
  2. Modified _sanitize_json_data() function

    • Recursively traverses data structures (dicts, lists)
    • Uses the new _sanitize_string_for_json() for all string values
    • Preserves non-string data types unchanged
  3. Updated write_json() function

    • Now sanitizes all data before JSON serialization
    • Protects all JSON write operations system-wide

Changes Made

File: lightrag/utils.py

  • New function: _sanitize_string_for_json(text: str) -> str
  • Modified function: _sanitize_json_data(data: Any) -> Any
  • Modified function: write_json(json_obj, file_name)

Key Features

  • ✅ No pre-validation - removes surrogates without encoding attempts
  • ✅ No exceptions raised - silent character removal for safe operation
  • ✅ Handles nested data structures recursively
  • ✅ Preserves non-string data types unchanged
  • ✅ Protects all JSON write operations system-wide
  • ✅ Backward compatible - doesn't affect other sanitization code paths

Impact

  • Before: Document deletion and storage operations with corrupted data failed completely with UnicodeEncodeError
  • After: System gracefully handles and sanitizes problematic data during persistence, allowing operations to complete successfully

Testing

The fix correctly handles edge cases that previously caused failures:

  • write_json({"text": "\udc9a"}, path) - now succeeds (surrogate removed)
  • Mixed data with surrogates in nested structures - handled correctly
  • Document deletion with corrupted cache data - completes successfully

• Add _sanitize_json_data helper function
• Recursively clean strings in data
• Sanitize before JSON serialization
• Prevent encoding-related crashes
• Use existing sanitize_text_for_encoding
@danielaskdd
Copy link
Collaborator Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

• Remove surrogate characters (U+D800-DFFF)
• Filter Unicode non-characters
• Direct char-by-char filtering
@danielaskdd
Copy link
Collaborator Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Sanitize dictionary keys
- Preserve tuple types
- Handle nested structures better
@danielaskdd
Copy link
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Nice work!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danielaskdd danielaskdd merged commit 69ca366 into HKUDS:main Nov 11, 2025
1 check passed
@danielaskdd danielaskdd deleted the fix-josn-serialization-error branch November 11, 2025 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant