Make "fast_chunking" the default chunking mode and preserve semantic mode under config #38

sumukshashidhar · 2025-03-31T12:25:35Z

Summary of Changes

Default to fast_chunking: Introduced a new “fast_chunking” logic that creates chunks purely based on maximum token length. This mode avoids embedding computation and similarity checks.
Retain Semantic Chunking via Config: The existing embedding-based (semantic) mode is now activated only if the config specifies "chunking_mode": "semantic_chunking".
Refactored Chunking Flow: The fast mode does not load or run any embedding model, minimizing overhead.
Added chunking_mode Field: The new field in ChunkingParameters decides which approach to use, defaulting to "fast_chunking" if not explicitly configured.

Tested to work

clefourrier

LGTM!
Maybe add some doc on how this affects/would affect multihop (I assume we shouldn't multihop with the fast chunking, right?)

clefourrier · 2025-03-31T12:29:21Z

And ofc, fix the style first

sumukshashidhar · 2025-03-31T12:31:17Z

fixed style!

sumukshashidhar · 2025-03-31T12:32:25Z

multi-hop wouldn't be affected - as all multi-hop chunks are just combos of single hop chunks!

Make "fast_chunking" the default chunking mode and preserve semantic mode under config

add fast chunking

ab2bfe2

sumukshashidhar requested review from clefourrier and alozowski March 31, 2025 12:26

clefourrier approved these changes Mar 31, 2025

View reviewed changes

fix cq

838de5d

sumukshashidhar merged commit 316201e into main Mar 31, 2025
1 check passed

sumukshashidhar mentioned this pull request Mar 31, 2025

Make pytorch, transformers, etc, optional dependencies #39

Closed

Josephrp pushed a commit to Josephrp/yourbench that referenced this pull request Jun 5, 2025

Merge pull request huggingface#38 from huggingface/fast_chunk

cda253b

Make "fast_chunking" the default chunking mode and preserve semantic mode under config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make "fast_chunking" the default chunking mode and preserve semantic mode under config #38

Make "fast_chunking" the default chunking mode and preserve semantic mode under config #38

Uh oh!

sumukshashidhar commented Mar 31, 2025

Uh oh!

clefourrier left a comment

Uh oh!

clefourrier commented Mar 31, 2025

Uh oh!

sumukshashidhar commented Mar 31, 2025

Uh oh!

sumukshashidhar commented Mar 31, 2025

Uh oh!

Uh oh!

Uh oh!

Make "fast_chunking" the default chunking mode and preserve semantic mode under config #38

Make "fast_chunking" the default chunking mode and preserve semantic mode under config #38

Uh oh!

Conversation

sumukshashidhar commented Mar 31, 2025

Uh oh!

clefourrier left a comment

Choose a reason for hiding this comment

Uh oh!

clefourrier commented Mar 31, 2025

Uh oh!

sumukshashidhar commented Mar 31, 2025

Uh oh!

sumukshashidhar commented Mar 31, 2025

Uh oh!

Uh oh!

Uh oh!