speed up chunking & add separator chunking#48
Merged
gusye1234 merged 6 commits intogusye1234:mainfrom Sep 19, 2024
Merged
Conversation
gusye1234
reviewed
Sep 18, 2024
20fed70 to
9900d35
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #48 +/- ##
=======================================
Coverage 94.36% 94.36%
=======================================
Files 11 11
Lines 1189 1189
=======================================
Hits 1122 1122
Misses 67 67 ☔ View full report in Codecov by Sentry. |
Collaborator
Author
|
NOW I GUESS ALL SHOULD BE WELL : ) |
gusye1234
reviewed
Sep 19, 2024
Owner
gusye1234
left a comment
There was a problem hiding this comment.
Great works! Few typing errors I think
AhmaddAbbass
pushed a commit
to AhmaddAbbass/nano-graphrag
that referenced
this pull request
Nov 14, 2025
* speed up chunking & add separator chunking * add test code for splitter & reformat chunking methods * typo * fix overlap behaviour * typo * typo for type check
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey, so this PR's got two main changes:
We're now converting docs to tokens in bulk, which is giving us a sweet 3X speed boost when dealing with a ton of docs (we tested it with 30k) by chunking_by_token_size. It's not gonna make much difference for small-scale stuff, but 30k is still pretty much toy-level (both industry and research usually work with way more). So yeah, this is definitely a solid upgrade.
We've added support for separator-based splitting without needing any extra dependencies. This splitting method tries to keep the grammar structure intact, meaning you'll always get complete clauses or sentences (if without any overlap). We tweaked the logic from langchain, so it might not be exactly the same, but it does what it says on the tin.