The Fellowship of the Vectors: New Embeddings Filter using clustering. #7015

GMartin-dev · 2023-07-01T11:43:32Z

Continuing with Tolkien inspired series of langchain tools. I bring to you:
The Fellowship of the Vectors, AKA EmbeddingsClusteringFilter.
This document filter uses embeddings to group vectors together into clusters, then allows you to pick an arbitrary number of documents vector based on proximity to the cluster centers. That's a representative sample of the cluster.

The original idea is from Greg Kamradt from this video (Level4):
https://www.youtube.com/watch?v=qaPMdcCqtWk&t=365s

I added few tricks to make it a bit more versatile, so you can parametrize what to do with duplicate documents in case of cluster overlap: replace the duplicates with the next closest document or remove it. This allow you to use it as an special kind of redundant filter too.
Additionally you can choose 2 diff orders: grouped by cluster or respecting the original retriever scores.
In my use case I was using the docs grouped by cluster to run refine chains per cluster to generate summarization over a large corpus of documents.
Let me know if you want to change anything!

@rlancemartin, @eyurtsev, @hwchase17,

vercel · 2023-07-01T11:43:36Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)			Jul 7, 2023 4:38pm

rlancemartin

Cool idea! Similar in spirit to MMR, which aims to improve diversity in the retrieved docs. Of course, this approach is esp useful for cases like the merge retriever with docs coming from several different retrievers. You can think of this as a post-processing de-dupe?

rlancemartin · 2023-07-07T16:40:08Z

This is a good general theme in retrieval: how to enforce diversity among the retrieved docs? MMR should be effective for a single retriever, but it would be interesting to test this vs MMR. In the case of multiple retrievers, like you show, dupes will be unavoidable even if each is using MMR. So, this type of clustering / de-dupe as a post-processing stage makes sense. What other ideas along these lines did you consider?

GMartin-dev · 2023-07-07T19:07:39Z

Hi Lance! The de-dupe is an interesting effect I was using the good old Redundant filter too. But, what had the most value for me was to control the equilibrium of redundancy / duplication.
With clustering and a dynamic "n" sample per cluster you can "choose" diff "portions". Dealing with merger_retriever or dealing with a giant collection that you must check completely like during a summarization, to be able to take samples that are diverse and representative of the whole corpus just in the right amount. The de-dupe along it's just one side of the coin.

gkamradt · 2023-07-07T19:17:13Z

This is super cool! Just here to follow along.

Awesome work @GMartin-dev and @rlancemartin

rlancemartin · 2023-07-07T19:56:06Z

The de-dupe is an interesting effect I was using the good old Redundant filter too

Right, when I say de-dupe I also mean "semantically" de-dupe (in addition to de-dupe identical chunks): like you are saying, we will compress the retrieved docs to enforce diversity among the results. Also, looks like @gkamradt used it initially in the context of summarization.

So, overall, neat idea for compression that it seems we could use in at least two places: (1) pre-processing of chunks (e.g., from a large doc, like a book) that are passed to LLM for summarization (like @gkamradt used it for). (2) post-processing of retrieved docs (esp using merge retriever), which can enforce diversity (like MMR does) / compress semantically similar results.

Any other uses you guys had in mind?

New embedding Clustering filter, test and notebook.

dfc3154

dosubot bot added auto:documentation labels Jul 1, 2023

hwchase17 assigned rlancemartin Jul 6, 2023

rlancemartin approved these changes Jul 7, 2023

View reviewed changes

rlancemartin added 2 commits July 7, 2023 09:25

Format

7a5cd21

Local import for sklearn

6fd303c

rlancemartin merged commit 3ce4e46 into langchain-ai:master Jul 7, 2023

rlancemartin mentioned this pull request Jul 17, 2023

Lost in the middle: We have been ordering documents the WRONG way. (for long context) #7520

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The Fellowship of the Vectors: New Embeddings Filter using clustering. #7015

The Fellowship of the Vectors: New Embeddings Filter using clustering. #7015

Uh oh!

GMartin-dev commented Jul 1, 2023

Uh oh!

vercel bot commented Jul 1, 2023 •

edited

Loading

Uh oh!

rlancemartin left a comment

Uh oh!

rlancemartin commented Jul 7, 2023

Uh oh!

GMartin-dev commented Jul 7, 2023

Uh oh!

gkamradt commented Jul 7, 2023

Uh oh!

rlancemartin commented Jul 7, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

The Fellowship of the Vectors: New Embeddings Filter using clustering. #7015

The Fellowship of the Vectors: New Embeddings Filter using clustering. #7015

Uh oh!

Conversation

GMartin-dev commented Jul 1, 2023

Uh oh!

vercel bot commented Jul 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rlancemartin left a comment

Choose a reason for hiding this comment

Uh oh!

rlancemartin commented Jul 7, 2023

Uh oh!

GMartin-dev commented Jul 7, 2023

Uh oh!

gkamradt commented Jul 7, 2023

Uh oh!

rlancemartin commented Jul 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Jul 1, 2023 •

edited

Loading

rlancemartin commented Jul 7, 2023 •

edited

Loading