Add multilingual sentence-based chunking support #175

sumukshashidhar · 2025-08-25T02:15:01Z

Summary

Implemented customizable sentence-based text chunking with configurable delimiters
Added support for Chinese, Japanese, and other languages with different punctuation
Maintained full backward compatibility with existing token-based chunking

Changes

Added new chunking mode (sentence vs token) to ChunkingConfig
Implemented split_into_sentences() and split_into_sentence_chunks() functions
Added configurable sentence delimiters via regex patterns
Created comprehensive test suite for sentence splitting and chunking
Added documentation and example configuration for Chinese text

Test Plan

All existing tests pass
New unit tests for sentence splitting pass
New unit tests for sentence chunking pass
Backward compatibility verified

Fixes #85

Implements customizable sentence delimiters for text chunking to support multiple languages including Chinese, Japanese, and mixed-language content. - Added sentence-based chunking mode alongside existing token-based mode - Configurable sentence delimiters via regex patterns - Support for sentence overlap and minimum chunk length - Full backward compatibility with existing token-based chunking - Added comprehensive tests and documentation Fixes #85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add multilingual sentence-based chunking support #175

Add multilingual sentence-based chunking support #175

Uh oh!

sumukshashidhar commented Aug 25, 2025

Uh oh!

Uh oh!

Add multilingual sentence-based chunking support #175

Are you sure you want to change the base?

Add multilingual sentence-based chunking support #175

Uh oh!

Conversation

sumukshashidhar commented Aug 25, 2025

Summary

Changes

Test Plan

Uh oh!

Uh oh!