Allow Customizable Splitting Symbols in chunking.py for Multilingual Support？

Hello, sumukshashidhar.

I'm currently working with the Chinese language, and I've noticed that in the `chunking.py` file, the symbols used for splitting text in the chunking code are hard - coded. Specifically, in line 431 of `chunking.py`, we have the following code:

```python
    # Split using capturing parentheses to retain delimiters, then recombine.
    segments = re.split(r"([.!?])", normalized_text)
    sentences: list[str] = []
```

This code is designed to split text based on English punctuation marks like `.`, `!`, and `?`. However, in Chinese, the common sentence delimiter is `。` (a full - stop in Chinese). For instance, consider the following typical Chinese sentences:

`最近UV非常火，各种MCP教程中都有UV的影子，我们今天来看一下是为什么。 传统python项目，创建一个新项目时需要设置虚拟环境、安装依赖，这个过程不仅繁琐，而且往往非常耗时。`

These sentences would not be properly split with the current hard - coded symbols. To enhance the flexibility and adaptability of the text - chunking functionality for different languages, I would like to propose that you consider allowing users to customize and expand the splitting symbols in the configuration `yaml` file. This way, users can add symbols like `。` according to their specific language requirements.

Thank you very much for your consideration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow Customizable Splitting Symbols in chunking.py for Multilingual Support？ #85

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow Customizable Splitting Symbols in chunking.py for Multilingual Support？ #85

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions