Skip to content

Conversation

sumukshashidhar
Copy link
Collaborator

Summary

  • Implemented customizable sentence-based text chunking with configurable delimiters
  • Added support for Chinese, Japanese, and other languages with different punctuation
  • Maintained full backward compatibility with existing token-based chunking

Changes

  • Added new chunking mode (sentence vs token) to ChunkingConfig
  • Implemented split_into_sentences() and split_into_sentence_chunks() functions
  • Added configurable sentence delimiters via regex patterns
  • Created comprehensive test suite for sentence splitting and chunking
  • Added documentation and example configuration for Chinese text

Test Plan

  • All existing tests pass
  • New unit tests for sentence splitting pass
  • New unit tests for sentence chunking pass
  • Backward compatibility verified

Fixes #85

Implements customizable sentence delimiters for text chunking to support
multiple languages including Chinese, Japanese, and mixed-language content.

- Added sentence-based chunking mode alongside existing token-based mode
- Configurable sentence delimiters via regex patterns
- Support for sentence overlap and minimum chunk length
- Full backward compatibility with existing token-based chunking
- Added comprehensive tests and documentation

Fixes #85
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow Customizable Splitting Symbols in chunking.py for Multilingual Support?
1 participant