Skip to content

Fix: Add chunk token limit validation with detailed error reporting#2389

Merged
danielaskdd merged 4 commits intoHKUDS:mainfrom
danielaskdd:fix-chunk-size
Nov 19, 2025
Merged

Fix: Add chunk token limit validation with detailed error reporting#2389
danielaskdd merged 4 commits intoHKUDS:mainfrom
danielaskdd:fix-chunk-size

Conversation

@danielaskdd
Copy link
Collaborator

@danielaskdd danielaskdd commented Nov 19, 2025

Fix: Add chunk token limit validation with detailed error reporting

Summary

This PR adds explicit validation when chunking text to prevent oversized chunks from being processed silently. When split_by_character_only=True, chunks that exceed the configured chunk_token_size now raise a descriptive error instead of being passed through.

Fix: #2387

Changes Made

1. New Exception Type (lightrag/exceptions.py)

  • Added ChunkTokenLimitExceededError exception class
  • Includes structured error data: chunk_tokens, chunk_token_limit, and chunk_preview (first 80 chars)
  • Provides clear error messages with token counts and chunk preview for debugging

2. Validation Logic (lightrag/operate.py)

  • Added token limit check in chunking_by_token_size() for the split_by_character_only path
  • Logs warning before raising exception to aid debugging
  • Raises ChunkTokenLimitExceededError when chunk exceeds chunk_token_size

3. Test Coverage (tests/test_chunking.py)

  • Added 5 offline test cases covering:
    • ✅ Chunks within limit (normal operation)
    • ✅ Chunks exceeding limit (error raised)
    • ✅ Error message includes preview
    • ✅ Chunks at exact limit (boundary condition)
    • ✅ Chunks one token over limit (minimal overflow)
  • Uses DummyTokenizer for deterministic testing without external dependencies

Operational Impact

Before: Oversized chunks in split_by_character_only mode would silently proceed, potentially causing downstream issues with LLM token limits or unexpected behavior.

After: Clear, actionable errors are raised immediately with:

  • Exact token counts (actual vs. limit)
  • Preview of problematic content
  • Helpful error messages for debugging

Risk: Low - Only adds validation to an existing code path. Normal-sized chunks are unaffected.

Testing

All tests pass:

pytest tests/test_chunking.py -v -m offline
# 5 passed in <1s

The tests verify:

  • No false positives (valid chunks still pass)
  • No false negatives (oversized chunks are caught)
  • Error attributes are correct
  • Boundary conditions are handled properly

Notes

  • This validation only applies when split_by_character_only=True
  • The non-character-only path already handles oversized chunks differently (recursive splitting)
  • Preview truncation at 80 chars balances debuggability with log readability

- Add ChunkTokenLimitExceededError exception
- Validate chunks against token limits
- Include chunk preview in error messages
- Add comprehensive test coverage
- Log warnings for oversized chunks
@danielaskdd
Copy link
Collaborator Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Prevent passes two separate string objects to __init__
- Maintain same error output
@danielaskdd
Copy link
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Test recursive split mode
- Add edge case coverage
- Test parameter combinations
- Verify chunk order indexing
- Add integration test scenarios
@danielaskdd
Copy link
Collaborator Author

@codex Conduct a comprehensive analysis and review of the correctness and completeness of the test cases in test_chuning.py

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

• Add MultiTokenCharacterTokenizer for testing
• Test token vs character counting accuracy
• Verify delimiter splitting precision
• Test overlap with distinctive content
• Add decode content preservation tests
@danielaskdd
Copy link
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danielaskdd danielaskdd merged commit f72f435 into HKUDS:main Nov 19, 2025
4 checks passed
@danielaskdd danielaskdd deleted the fix-chunk-size branch November 19, 2025 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]:chunk size too large causes embedding API call to time out.

1 participant