Fix: Add chunk token limit validation with detailed error reporting#2389
Fix: Add chunk token limit validation with detailed error reporting#2389danielaskdd merged 4 commits intoHKUDS:mainfrom
Conversation
- Add ChunkTokenLimitExceededError exception - Validate chunks against token limits - Include chunk preview in error messages - Add comprehensive test coverage - Log warnings for oversized chunks
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Prevent passes two separate string objects to __init__ - Maintain same error output
|
@codex review |
|
Codex Review: Didn't find any major issues. Keep it up! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
- Test recursive split mode - Add edge case coverage - Test parameter combinations - Verify chunk order indexing - Add integration test scenarios
|
@codex Conduct a comprehensive analysis and review of the correctness and completeness of the test cases in test_chuning.py |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
• Add MultiTokenCharacterTokenizer for testing • Test token vs character counting accuracy • Verify delimiter splitting precision • Test overlap with distinctive content • Add decode content preservation tests
|
@codex review |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Fix: Add chunk token limit validation with detailed error reporting
Summary
This PR adds explicit validation when chunking text to prevent oversized chunks from being processed silently. When
split_by_character_only=True, chunks that exceed the configuredchunk_token_sizenow raise a descriptive error instead of being passed through.Fix: #2387
Changes Made
1. New Exception Type (
lightrag/exceptions.py)ChunkTokenLimitExceededErrorexception classchunk_tokens,chunk_token_limit, andchunk_preview(first 80 chars)2. Validation Logic (
lightrag/operate.py)chunking_by_token_size()for thesplit_by_character_onlypathChunkTokenLimitExceededErrorwhen chunk exceedschunk_token_size3. Test Coverage (
tests/test_chunking.py)DummyTokenizerfor deterministic testing without external dependenciesOperational Impact
Before: Oversized chunks in
split_by_character_onlymode would silently proceed, potentially causing downstream issues with LLM token limits or unexpected behavior.After: Clear, actionable errors are raised immediately with:
Risk: Low - Only adds validation to an existing code path. Normal-sized chunks are unaffected.
Testing
All tests pass:
pytest tests/test_chunking.py -v -m offline # 5 passed in <1sThe tests verify:
Notes
split_by_character_only=True