-
Notifications
You must be signed in to change notification settings - Fork 279
Description
What problem does the new feature solve?
I went through documentation to look for this feature, but apparently it doesn't exist yet,
I'm looking for a parameter in vectorizer through which a user can define which all columns from the metadata column he/she want in the view.
Currently in my project, I have multiple PDFs exceeding 200 pages, as a result,
the rows for those particular PDFs exceeds 100s of rows in the embeddings table and consequently I get multiple redundant text for such huge PDFs from the view. As a result, the client side of application consumes a lot of bandwidth. I know, this is not a storage problem, more or less the client has to load a whole lot more data and that too redundant leading to a spike in our read metrics in the db.
The two workaround I found, is to don't query unwanted columns from the view and join it with the metadata column to get the actual text, only at the very last stage when row count gets reduced after all the filters.
Or manually drop the extracted_text column from view and join the data in the final result set.
But each time my data changes and vectorizer runs, that column gets created again in the view, so I have to manually drop that column each week.
I feel this should be a feature, or maybe any of you guys can suggest a better workaround.
PS, I'm a Data Engineer at a startup with less than 1 year of XP, any of your help would mean a lot to me
What does the feature do?
It would give user, the option to configure which all columns the user want in the view created by the vectorizer.
The user would pass an array of columns present in the main table, which he/she want to be also present in the view.
This would be passed as an optional parameter to ai.vectorizer function.
Implementation challenges
No response
Are you going to work on this feature?
🦸 Yes , I will submit a PR soon!