Skip to content

Allow Customizable Splitting Symbols in chunking.py for Multilingual Support? #85

@wei12314

Description

@wei12314

Hello, sumukshashidhar.

I'm currently working with the Chinese language, and I've noticed that in the chunking.py file, the symbols used for splitting text in the chunking code are hard - coded. Specifically, in line 431 of chunking.py, we have the following code:

    # Split using capturing parentheses to retain delimiters, then recombine.
    segments = re.split(r"([.!?])", normalized_text)
    sentences: list[str] = []

This code is designed to split text based on English punctuation marks like ., !, and ?. However, in Chinese, the common sentence delimiter is (a full - stop in Chinese). For instance, consider the following typical Chinese sentences:

最近UV非常火,各种MCP教程中都有UV的影子,我们今天来看一下是为什么。 传统python项目,创建一个新项目时需要设置虚拟环境、安装依赖,这个过程不仅繁琐,而且往往非常耗时。

These sentences would not be properly split with the current hard - coded symbols. To enhance the flexibility and adaptability of the text - chunking functionality for different languages, I would like to propose that you consider allowing users to customize and expand the splitting symbols in the configuration yaml file. This way, users can add symbols like according to their specific language requirements.

Thank you very much for your consideration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions