Replies: 1 comment
-
|
n-gram tokenization is a solid idea for improving fuzzy search, especially for typo-tolerance and partial word matching. In production, we've used n-gram approaches (like 3-grams or 4-grams) to boost recall for names, addresses, and multi-lingual datasets. They're especially valuable when users make common input mistakes or when dealing with OCR/noisy text. Why Qdrant might not have added it yet:
How we’ve implemented this:
Recommendation: If you want n-gram fuzziness, you could prototype this externally—pre-tokenize your inputs into n-grams before sending to Qdrant. If the use case justifies more efficient native support, it's worth opening a feature request with a proposal for n-gram config (min/max n, token length limits, etc.) to minimize the index size/performance impact. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I noticed that
Qdrantcurrently supports four types of tokenizers. I think we can add a new n-gram tokenizier, because it can add fuzziness support for both sparse vector and full_text_index.Is there any particular reason why
Qdranthasn’t added support for n-gram tokenizer yet?Beta Was this translation helpful? Give feedback.
All reactions