About adding a new n-gram tokenizer #9130

TY0909 · 2026-05-22T02:40:51Z

TY0909
May 22, 2026

Hi,

I noticed that Qdrant currently supports four types of tokenizers. I think we can add a new n-gram tokenizier, because it can add fuzziness support for both sparse vector and full_text_index.

Is there any particular reason why Qdrant hasn’t added support for n-gram tokenizer yet?

reallyticsai · 2026-05-30T09:55:44Z

reallyticsai
May 30, 2026

n-gram tokenization is a solid idea for improving fuzzy search, especially for typo-tolerance and partial word matching. In production, we've used n-gram approaches (like 3-grams or 4-grams) to boost recall for names, addresses, and multi-lingual datasets. They're especially valuable when users make common input mistakes or when dealing with OCR/noisy text.

Why Qdrant might not have added it yet:

Index Bloat: n-gram tokenizers can dramatically increase the number of tokens per document, which can spike memory use and slow down indexing. For example, "machine" tokenized to 3-grams produces mac, ach, chi, hin, ine—multiplied over millions of docs, that's a lot of tokens.
Performance Trade-offs: In full-text search systems like Elasticsearch, n-gram tokenization is a common cause of slow queries and larger inverted indices unless you carefully tune min/max n-gram size and filter out stopwords.
Alternative Fuzziness: Qdrant's current fuzzy parameter (for full_text_index) leverages edit distance internally. It’s fast but less flexible than n-gram-based fuzziness.

How we’ve implemented this:

We use n-gram tokenization in tandem with a custom analyzer in Elasticsearch. A minimal example in Python:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(3, 3), analyzer='char')
print(vectorizer.fit_transform(["qdrant"]).toarray())
# Output: array with counts for ['qdr', 'dra', 'ran', 'ant']

For vector DBs, you’d need to update the tokenizer and indexing pipeline to split incoming text into n-grams, hash them, and store the resulting sparse vectors or tokens.

Recommendation: If you want n-gram fuzziness, you could prototype this externally—pre-tokenize your inputs into n-grams before sending to Qdrant. If the use case justifies more efficient native support, it's worth opening a feature request with a proposal for n-gram config (min/max n, token length limits, etc.) to minimize the index size/performance impact.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qdrant

About adding a new n-gram tokenizer #9130

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Qdrant

About adding a new n-gram tokenizer #9130

Uh oh!

TY0909 May 22, 2026

Replies: 1 comment

Uh oh!

reallyticsai May 30, 2026

TY0909
May 22, 2026

reallyticsai
May 30, 2026