Fix: Prevent UnicodeEncodeError in JSON storage operations by danielaskdd · Pull Request #2344 · HKUDS/LightRAG

danielaskdd · 2025-11-11T16:13:01Z

Fix: Prevent UnicodeEncodeError in JSON storage operations

Problem

Document deletion and other operations were failing with UnicodeEncodeError: 'utf-8' codec can't encode character '\udc9a' in position 201: surrogates not allowed. This occurred when the JSON storage system attempted to persist data containing surrogate characters (U+D800 to U+DFFF), which are invalid in UTF-8 encoding.

Error Stack Trace:

File "lightrag/kg/json_kv_impl.py", line 84, in index_done_callback
File "lightrag/utils.py", line 911, in write_json
    json.dump(json_obj, f, indent=2, ensure_ascii=False)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc9a'

Root Cause Analysis

Initial fix attempt used sanitize_text_for_encoding(), which had a critical flaw: it attempted to validate UTF-8 encoding before removing surrogate characters. This caused ValueError exceptions instead of fixing the problem, making operations like write_json({"text": "\udc9a"}, path) still fail.

Solution

Implemented a dedicated JSON sanitization approach that avoids pre-validation:

Added _sanitize_string_for_json() helper function
- Directly removes surrogate characters (U+D800 to U+DFFF) without attempting to encode first
- Also removes other Unicode non-characters (U+FFFE, U+FFFF)
- No exceptions raised - problematic characters are silently stripped
Modified _sanitize_json_data() function
- Recursively traverses data structures (dicts, lists)
- Uses the new _sanitize_string_for_json() for all string values
- Preserves non-string data types unchanged
Updated write_json() function
- Now sanitizes all data before JSON serialization
- Protects all JSON write operations system-wide

Changes Made

File: lightrag/utils.py

New function: _sanitize_string_for_json(text: str) -> str
Modified function: _sanitize_json_data(data: Any) -> Any
Modified function: write_json(json_obj, file_name)

Key Features

✅ No pre-validation - removes surrogates without encoding attempts
✅ No exceptions raised - silent character removal for safe operation
✅ Handles nested data structures recursively
✅ Preserves non-string data types unchanged
✅ Protects all JSON write operations system-wide
✅ Backward compatible - doesn't affect other sanitization code paths

Impact

Before: Document deletion and storage operations with corrupted data failed completely with UnicodeEncodeError
After: System gracefully handles and sanitizes problematic data during persistence, allowing operations to complete successfully

Testing

The fix correctly handles edge cases that previously caused failures:

write_json({"text": "\udc9a"}, path) - now succeeds (surrogate removed)
Mixed data with surrogates in nested structures - handled correctly
Document deletion with corrupted cache data - completes successfully

• Add _sanitize_json_data helper function • Recursively clean strings in data • Sanitize before JSON serialization • Prevent encoding-related crashes • Use existing sanitize_text_for_encoding

danielaskdd · 2025-11-11T16:13:12Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

lightrag/utils.py

• Remove surrogate characters (U+D800-DFFF) • Filter Unicode non-characters • Direct char-by-char filtering

danielaskdd · 2025-11-11T16:39:19Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

lightrag/utils.py

- Sanitize dictionary keys - Preserve tuple types - Handle nested structures better

danielaskdd · 2025-11-11T16:50:55Z

@codex review

chatgpt-codex-connector · 2025-11-11T16:53:15Z

Codex Review: Didn't find any major issues. Nice work!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Add data sanitization to JSON writing to prevent UTF-8 encoding errors

d1f4b6e

• Add _sanitize_json_data helper function • Recursively clean strings in data • Sanitize before JSON serialization • Prevent encoding-related crashes • Use existing sanitize_text_for_encoding

chatgpt-codex-connector bot reviewed Nov 11, 2025

View reviewed changes

lightrag/utils.py Show resolved Hide resolved

Add specialized JSON string sanitizer to prevent UTF-8 encoding errors

6918a88

• Remove surrogate characters (U+D800-DFFF) • Filter Unicode non-characters • Direct char-by-char filtering

chatgpt-codex-connector bot reviewed Nov 11, 2025

View reviewed changes

lightrag/utils.py Outdated Show resolved Hide resolved

Improve JSON data sanitization to handle tuples and dict keys

f28a0c2

- Sanitize dictionary keys - Preserve tuple types - Handle nested structures better

danielaskdd merged commit 69ca366 into HKUDS:main Nov 11, 2025
1 check passed

danielaskdd deleted the fix-josn-serialization-error branch November 11, 2025 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Prevent UnicodeEncodeError in JSON storage operations#2344

Fix: Prevent UnicodeEncodeError in JSON storage operations#2344
danielaskdd merged 3 commits intoHKUDS:mainfrom
danielaskdd:fix-josn-serialization-error

danielaskdd commented Nov 11, 2025 •

edited

Loading

Uh oh!

danielaskdd commented Nov 11, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

danielaskdd commented Nov 11, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

danielaskdd commented Nov 11, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielaskdd commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!