Add DotProduct function#25508
Conversation
|
|
|
tdcmeehan
left a comment
There was a problem hiding this comment.
Need some test cases for edge cases. Some that come to mind:
- What happens if there is an overflow and we start yielding Infinity?
- What happens if there is
NaNorInfinityin the arrays? - What happens if there's a SQL null in the array? Do we need to check for that and validate?
Thank you, updated the release notes. |
Thanks, I have updated the test cases with infinity and Nan value tests. |
steveburnett
left a comment
There was a problem hiding this comment.
Please add documentation for the new function.
Added! |
steveburnett
left a comment
There was a problem hiding this comment.
One nit of phrasing.
|
Please update the release note: The link to the function didn't work when I put the current release note into the local doc build to test it. Below is a working link to the newly added function. The external link to wikipedia needs to be in .rst format, not GitHub. I've converted that one for you as well in the revised example below. Consider if the wikipedia sentence belongs in the release note at all, or if it should be added to the documentation entry in math.rst. |
df2c360 to
9ecdc06
Compare
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull updated branch, new local doc build, looks good.
Thanks for the doc!
|
Saved that user @Raaghav0 is from Meta |
@tdcmeehan from what i see in Velox, there is already a cosine_similarity() function test for nulls in an array that then expects it to return null like in this PR. So there is intent to return null for other functions of a similar signature. I tried out the cosine_similarity() function present in prestodb, and that too returns null. |
@tdcmeehan would the above be reasonable to follow the same pattern as Velox? |
tdcmeehan
left a comment
There was a problem hiding this comment.
Ultimately I think propagating nulls makes the most sense, but I think we should get more opinions. Let me tag a couple of folks who I think would give a good opinion @aditi-pandit @rschlussel
16f6ff2 to
2fb6761
Compare
2fb6761 to
89e72d5
Compare
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull branch, local doc build, looks good. Thanks!
yeah, I think propagating nulls like in this proposal is better than throwing. |
@tdcmeehan I talked to @kaikalur and @rschlussel offline and they both would like to throw error on nulls (like duckdb) using |
Also looks like other systems are doing the same - error out on null |
@tdcmeehan pinging once more requesting a review. If you are unavailable, could you recommend another reviewer? |
tdcmeehan
left a comment
There was a problem hiding this comment.
Looks good % nit, please squash commits according to our guidelines. Reword commit to Add DotProduct function.
03476cd to
999ea3b
Compare
Add DotProduct function adding nan and infinity checks Adding documentation null inside array test fix typo in doc making the return type nullable improving performance after null checks adding documentation for nulls throwing error for having nulls in the array fixing a nit
999ea3b to
b36030f
Compare
9e202ff
into
prestodb:master
Description
This PR introduces the dot product (dot_product) function between identical sized vectors represented both either as array(real) type or as array(double) type. The dot_product is used to measure similarities between embeddings from models such as DRAGON that are built using dot product as the similarity measure.
Motivation and Context
Since we are introducing vector search capabilities into Presto this PR adds another common distance function. This functionality will enable users to perform efficient similarity searches for embedding models that measures similarity using dot products.
Impact
The addition of the dot_product function will enhance Presto's capabilities in handling embeddings built from dot product similarity and enable users to perform more complex analytics tasks.
Test Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.