Fix: Add chunk token limit validation with detailed error reporting by danielaskdd · Pull Request #2389 · HKUDS/LightRAG

danielaskdd · 2025-11-19T10:33:46Z

Fix: Add chunk token limit validation with detailed error reporting

Summary

This PR adds explicit validation when chunking text to prevent oversized chunks from being processed silently. When split_by_character_only=True, chunks that exceed the configured chunk_token_size now raise a descriptive error instead of being passed through.

Fix: #2387

Changes Made

1. New Exception Type (lightrag/exceptions.py)

Added ChunkTokenLimitExceededError exception class
Includes structured error data: chunk_tokens, chunk_token_limit, and chunk_preview (first 80 chars)
Provides clear error messages with token counts and chunk preview for debugging

2. Validation Logic (lightrag/operate.py)

Added token limit check in chunking_by_token_size() for the split_by_character_only path
Logs warning before raising exception to aid debugging
Raises ChunkTokenLimitExceededError when chunk exceeds chunk_token_size

3. Test Coverage (tests/test_chunking.py)

Added 5 offline test cases covering:
- ✅ Chunks within limit (normal operation)
- ✅ Chunks exceeding limit (error raised)
- ✅ Error message includes preview
- ✅ Chunks at exact limit (boundary condition)
- ✅ Chunks one token over limit (minimal overflow)
Uses DummyTokenizer for deterministic testing without external dependencies

Operational Impact

Before: Oversized chunks in split_by_character_only mode would silently proceed, potentially causing downstream issues with LLM token limits or unexpected behavior.

After: Clear, actionable errors are raised immediately with:

Exact token counts (actual vs. limit)
Preview of problematic content
Helpful error messages for debugging

Risk: Low - Only adds validation to an existing code path. Normal-sized chunks are unaffected.

Testing

All tests pass:

pytest tests/test_chunking.py -v -m offline
# 5 passed in <1s

The tests verify:

No false positives (valid chunks still pass)
No false negatives (oversized chunks are caught)
Error attributes are correct
Boundary conditions are handled properly

Notes

This validation only applies when split_by_character_only=True
The non-character-only path already handles oversized chunks differently (recursive splitting)
Preview truncation at 80 chars balances debuggability with log readability

- Add ChunkTokenLimitExceededError exception - Validate chunks against token limits - Include chunk preview in error messages - Add comprehensive test coverage - Log warnings for oversized chunks

danielaskdd · 2025-11-19T10:34:42Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

lightrag/exceptions.py

- Prevent passes two separate string objects to __init__ - Maintain same error output

danielaskdd · 2025-11-19T10:54:07Z

@codex review

chatgpt-codex-connector · 2025-11-19T10:56:54Z

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Test recursive split mode - Add edge case coverage - Test parameter combinations - Verify chunk order indexing - Add integration test scenarios

danielaskdd · 2025-11-19T11:11:41Z

@codex Conduct a comprehensive analysis and review of the correctness and completeness of the test cases in test_chuning.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tests/test_chunking.py

• Add MultiTokenCharacterTokenizer for testing • Test token vs character counting accuracy • Verify delimiter splitting precision • Test overlap with distinctive content • Add decode content preservation tests

danielaskdd · 2025-11-19T11:33:32Z

@codex review

chatgpt-codex-connector · 2025-11-19T11:36:35Z

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Add token limit validation for character-only chunking

f988a22

- Add ChunkTokenLimitExceededError exception - Validate chunks against token limits - Include chunk preview in error messages - Add comprehensive test coverage - Log warnings for oversized chunks

chatgpt-codex-connector bot reviewed Nov 19, 2025

View reviewed changes

lightrag/exceptions.py Outdated Show resolved Hide resolved

Fix ChunkTokenLimitExceededError message formatting

6fea68b

- Prevent passes two separate string objects to __init__ - Maintain same error output

Add comprehensive tests for chunking with recursive splitting

5733292

- Test recursive split mode - Add edge case coverage - Test parameter combinations - Verify chunk order indexing - Add integration test scenarios

chatgpt-codex-connector bot reviewed Nov 19, 2025

View reviewed changes

tests/test_chunking.py Show resolved Hide resolved

tests/test_chunking.py Show resolved Hide resolved

danielaskdd mentioned this pull request Nov 19, 2025

[Bug]:chunk size too large causes embedding API call to time out. #2387

Closed

2 tasks

danielaskdd merged commit f72f435 into HKUDS:main Nov 19, 2025
4 checks passed

danielaskdd deleted the fix-chunk-size branch November 19, 2025 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Add chunk token limit validation with detailed error reporting#2389

Fix: Add chunk token limit validation with detailed error reporting#2389
danielaskdd merged 4 commits intoHKUDS:mainfrom
danielaskdd:fix-chunk-size

danielaskdd commented Nov 19, 2025 •

edited

Loading

Uh oh!

danielaskdd commented Nov 19, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

danielaskdd commented Nov 19, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 19, 2025

Uh oh!

danielaskdd commented Nov 19, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

danielaskdd commented Nov 19, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielaskdd commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!