Skip to content

Add DotProduct function#25508

Merged
facebook-github-bot merged 1 commit into
prestodb:masterfrom
Raaghav0:DotProductFunction
Jul 24, 2025
Merged

Add DotProduct function#25508
facebook-github-bot merged 1 commit into
prestodb:masterfrom
Raaghav0:DotProductFunction

Conversation

@Raaghav0

@Raaghav0 Raaghav0 commented Jul 8, 2025

Copy link
Copy Markdown
Contributor

Description

This PR introduces the dot product (dot_product) function between identical sized vectors represented both either as array(real) type or as array(double) type. The dot_product is used to measure similarities between embeddings from models such as DRAGON that are built using dot product as the similarity measure.

Motivation and Context

Since we are introducing vector search capabilities into Presto this PR adds another common distance function. This functionality will enable users to perform efficient similarity searches for embedding models that measures similarity using dot products.

Impact

The addition of the dot_product function will enhance Presto's capabilities in handling embeddings built from dot product similarity and enable users to perform more complex analytics tasks.

Test Plan

  • Unit tests
  • Manual tests: running HiveQueryRunner with various input scenarios to ensure correctness and performance.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==
General Changes
* Add :func:`dot_product` to calculate to calculate the sum of element wise product between two identically sized vectors represented as arrays. This function supports both array(real) and array(double) input types. For more information, refer to the [Dot Product definition](https://en.wikipedia.org/wiki/Dot_product). 

@Raaghav0 Raaghav0 requested a review from a team as a code owner July 8, 2025 23:07
@linux-foundation-easycla

linux-foundation-easycla Bot commented Jul 8, 2025

Copy link
Copy Markdown

CLA Signed

The committers listed above are authorized under a signed CLA.

@facebook-github-bot

Copy link
Copy Markdown
Collaborator

@Raaghav0 has imported this pull request. If you are a Meta employee, you can view this in D77969974.

@facebook-github-bot

Copy link
Copy Markdown
Collaborator

@Raaghav0 has imported this pull request. If you are a Meta employee, you can view this in D77969974.

@steveburnett

Copy link
Copy Markdown
Contributor
  • Please add documentation for the new function.

  • Revise the release note entry to link to the new documentation. See Formatting in the Release Notes Guidelines for examples of linking.

    The release note entry can be much shorter when the examples you give are in the linked doc. Something like the following example:

== RELEASE NOTES ==

General Changes
* Add :func:`dotproduct` to calculate the sum of element wise product between two identically sized vectors represented as arrays. This function supports both array(real) and array(double) input types. 

@tdcmeehan tdcmeehan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some test cases for edge cases. Some that come to mind:

  • What happens if there is an overflow and we start yielding Infinity?
  • What happens if there is NaN or Infinity in the arrays?
  • What happens if there's a SQL null in the array? Do we need to check for that and validate?

@Raaghav0 Raaghav0 marked this pull request as draft July 9, 2025 18:04
@facebook-github-bot

Copy link
Copy Markdown
Collaborator

@Raaghav0 has imported this pull request. If you are a Meta employee, you can view this in D77969974.

@Raaghav0 Raaghav0 marked this pull request as ready for review July 9, 2025 19:30
@Raaghav0

Raaghav0 commented Jul 9, 2025

Copy link
Copy Markdown
Contributor Author
  • Please add documentation for the new function.
  • Revise the release note entry to link to the new documentation. See Formatting in the Release Notes Guidelines for examples of linking.
    The release note entry can be much shorter when the examples you give are in the linked doc. Something like the following example:
== RELEASE NOTES ==

General Changes
* Add :func:`dotproduct` to calculate the sum of element wise product between two identically sized vectors represented as arrays. This function supports both array(real) and array(double) input types. 

Thank you, updated the release notes.

@Raaghav0

Raaghav0 commented Jul 9, 2025

Copy link
Copy Markdown
Contributor Author

Need some test cases for edge cases. Some that come to mind:

  • What happens if there is an overflow and we start yielding Infinity?
  • What happens if there is NaN or Infinity in the arrays?
  • What happens if there's a SQL null in the array? Do we need to check for that and validate?

Thanks, I have updated the test cases with infinity and Nan value tests.

@Raaghav0 Raaghav0 requested a review from tdcmeehan July 9, 2025 20:15

@steveburnett steveburnett left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation for the new function.

@Raaghav0 Raaghav0 requested a review from elharo as a code owner July 9, 2025 20:39
@facebook-github-bot

Copy link
Copy Markdown
Collaborator

@Raaghav0 has imported this pull request. If you are a Meta employee, you can view this in D77969974.

@Raaghav0

Raaghav0 commented Jul 9, 2025

Copy link
Copy Markdown
Contributor Author

Please add documentation for the new function.

Added!

@Raaghav0 Raaghav0 requested a review from steveburnett July 9, 2025 20:41
@facebook-github-bot

Copy link
Copy Markdown
Collaborator

@Raaghav0 has imported this pull request. If you are a Meta employee, you can view this in D77969974.

@Raaghav0 Raaghav0 requested a review from tdcmeehan July 9, 2025 21:26

@steveburnett steveburnett left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit of phrasing.

Comment thread presto-docs/src/main/sphinx/functions/math.rst Outdated
@steveburnett

Copy link
Copy Markdown
Contributor

Please update the release note:

The link to the function didn't work when I put the current release note into the local doc build to test it. Below is a working link to the newly added function.

The external link to wikipedia needs to be in .rst format, not GitHub. I've converted that one for you as well in the revised example below.

Consider if the wikipedia sentence belongs in the release note at all, or if it should be added to the documentation entry in math.rst.

== RELEASE NOTES ==

General Changes
* Add :func:`dot_product(x, y) -> double()` to calculate to calculate the sum of element wise product between two identically sized vectors represented as arrays. This function supports both array(real) and array(double) input types. For more information, refer to the `Dot Product definition <https://en.wikipedia.org/wiki/Dot_product>`_.

@Raaghav0 Raaghav0 force-pushed the DotProductFunction branch from df2c360 to 9ecdc06 Compare July 9, 2025 21:39
@facebook-github-bot

Copy link
Copy Markdown
Collaborator

@Raaghav0 has imported this pull request. If you are a Meta employee, you can view this in D77969974.

@Raaghav0 Raaghav0 requested a review from steveburnett July 9, 2025 21:42
steveburnett
steveburnett previously approved these changes Jul 9, 2025

@steveburnett steveburnett left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build, looks good.

Thanks for the doc!

@ethanyzhang ethanyzhang added the from:Meta PR from Meta label Jul 10, 2025
@prestodb-ci

Copy link
Copy Markdown
Contributor

Saved that user @Raaghav0 is from Meta

@Raaghav0

Copy link
Copy Markdown
Contributor Author

I'm going back and forth on what the behavior should be on NULL element. I just checked DuckDB and found that its equivalent actually throws on NULL:

duckdb> SELECT LIST_DOT_PRODUCT(ARRAY[1, NULL], ARRAY[2, NULL]);
Invalid Input Error: list_dot_product: left argument can not contain NULL values

Can you check other SQL systems and try to see what their behavior is? I would be comfortable following the consensus. It might also not hurt to create a discussion on this at Velox and seeing what you find there--eventually these functions will move there.

@tdcmeehan from what i see in Velox, there is already a cosine_similarity() function test for nulls in an array that then expects it to return null like in this PR. So there is intent to return null for other functions of a similar signature. I tried out the cosine_similarity() function present in prestodb, and that too returns null.

@Raaghav0

Copy link
Copy Markdown
Contributor Author

I'm going back and forth on what the behavior should be on NULL element. I just checked DuckDB and found that its equivalent actually throws on NULL:

duckdb> SELECT LIST_DOT_PRODUCT(ARRAY[1, NULL], ARRAY[2, NULL]);
Invalid Input Error: list_dot_product: left argument can not contain NULL values

Can you check other SQL systems and try to see what their behavior is? I would be comfortable following the consensus. It might also not hurt to create a discussion on this at Velox and seeing what you find there--eventually these functions will move there.

@tdcmeehan from what i see in Velox, there is already a cosine_similarity() function test for nulls in an array that then expects it to return null like in this PR. So there is intent to return null for other functions of a similar signature. I tried out the cosine_similarity() function present in prestodb, and that too returns null.

@tdcmeehan would the above be reasonable to follow the same pattern as Velox?

@tdcmeehan tdcmeehan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately I think propagating nulls makes the most sense, but I think we should get more opinions. Let me tag a couple of folks who I think would give a good opinion @aditi-pandit @rschlussel

Comment thread presto-docs/src/main/sphinx/functions/math.rst
Comment thread presto-docs/src/main/sphinx/functions/math.rst
@Raaghav0 Raaghav0 force-pushed the DotProductFunction branch from 16f6ff2 to 2fb6761 Compare July 15, 2025 17:50
@facebook-github-bot

Copy link
Copy Markdown
Collaborator

@Raaghav0 has imported this pull request. If you are a Meta employee, you can view this in D77969974.

@Raaghav0 Raaghav0 requested a review from tdcmeehan July 15, 2025 17:59
@Raaghav0 Raaghav0 force-pushed the DotProductFunction branch from 2fb6761 to 89e72d5 Compare July 17, 2025 18:35
@facebook-github-bot

Copy link
Copy Markdown
Collaborator

@Raaghav0 has imported this pull request. If you are a Meta employee, you can view this in D77969974.

@facebook-github-bot

Copy link
Copy Markdown
Collaborator

@Raaghav0 has imported this pull request. If you are a Meta employee, you can view this in D77969974.

steveburnett
steveburnett previously approved these changes Jul 17, 2025

@steveburnett steveburnett left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull branch, local doc build, looks good. Thanks!

@rschlussel

Copy link
Copy Markdown
Contributor

Ultimately I think propagating nulls makes the most sense, but I think we should get more opinions. Let me tag a couple of folks who I think would give a good opinion @aditi-pandit @rschlussel

yeah, I think propagating nulls like in this proposal is better than throwing.

@Raaghav0

Copy link
Copy Markdown
Contributor Author

Ultimately I think propagating nulls makes the most sense, but I think we should get more opinions. Let me tag a couple of folks who I think would give a good opinion @aditi-pandit @rschlussel

yeah, I think propagating nulls like in this proposal is better than throwing.

@tdcmeehan I talked to @kaikalur and @rschlussel offline and they both would like to throw error on nulls (like duckdb) using mayHaveNull check. Primarily because this function is intended to be used for dense vectors. Would you be able to offer another review?

@kaikalur

Copy link
Copy Markdown
Contributor

Ultimately I think propagating nulls makes the most sense, but I think we should get more opinions. Let me tag a couple of folks who I think would give a good opinion @aditi-pandit @rschlussel

yeah, I think propagating nulls like in this proposal is better than throwing.

@tdcmeehan I talked to @kaikalur and @rschlussel offline and they both would like to throw error on nulls (like duckdb) using mayHaveNull check. Primarily because this function is intended to be used for dense vectors. Would you be able to offer another review?

Also looks like other systems are doing the same - error out on null

@Raaghav0

Copy link
Copy Markdown
Contributor Author

Ultimately I think propagating nulls makes the most sense, but I think we should get more opinions. Let me tag a couple of folks who I think would give a good opinion @aditi-pandit @rschlussel

yeah, I think propagating nulls like in this proposal is better than throwing.

@tdcmeehan I talked to @kaikalur and @rschlussel offline and they both would like to throw error on nulls (like duckdb) using mayHaveNull check. Primarily because this function is intended to be used for dense vectors. Would you be able to offer another review?

Also looks like other systems are doing the same - error out on null

@tdcmeehan pinging once more requesting a review. If you are unavailable, could you recommend another reviewer?

@tdcmeehan tdcmeehan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good % nit, please squash commits according to our guidelines. Reword commit to Add DotProduct function.

@Raaghav0 Raaghav0 changed the title Registering DotProduct as a Presto Math Function Add DotProduct function Jul 22, 2025
@Raaghav0 Raaghav0 force-pushed the DotProductFunction branch from 03476cd to 999ea3b Compare July 22, 2025 19:16
@facebook-github-bot

Copy link
Copy Markdown
Collaborator

@Raaghav0 has imported this pull request. If you are a Meta employee, you can view this in D77969974.

@facebook-github-bot

Copy link
Copy Markdown
Collaborator

@Raaghav0 has imported this pull request. If you are a Meta employee, you can view this in D77969974.

Add DotProduct function

adding nan and infinity checks

Adding documentation

null inside array test

fix typo in doc

making the return type nullable

improving performance after null checks

adding documentation for nulls

throwing error for having nulls in the array

fixing a nit
@Raaghav0 Raaghav0 force-pushed the DotProductFunction branch from 999ea3b to b36030f Compare July 23, 2025 18:23
@facebook-github-bot

Copy link
Copy Markdown
Collaborator

@Raaghav0 has imported this pull request. If you are a Meta employee, you can view this in D77969974.

@facebook-github-bot facebook-github-bot merged commit 9e202ff into prestodb:master Jul 24, 2025
170 of 172 checks passed
@prestodb-ci prestodb-ci mentioned this pull request Jul 28, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants