Skip to content

optimize APPROX_DISTINCT operations on constant conditional values#25428

Merged
feilong-liu merged 1 commit into
prestodb:masterfrom
hdikeman:henryd/approxdistinct.main
Jun 25, 2025
Merged

optimize APPROX_DISTINCT operations on constant conditional values#25428
feilong-liu merged 1 commit into
prestodb:masterfrom
hdikeman:henryd/approxdistinct.main

Conversation

@hdikeman

Copy link
Copy Markdown
Contributor

Description

APPROX_DISTINCT operations on a constant value (e.g. APPROX_DISTINCT(IF(..., const))) are more expensive than and functionally equivalent to ARBITRARY(IF(..., 1, 0))
Adding an optimizer rule to replace any APPROX_DISTINCT operations on constant conditionals with equivalent calls to ARBITRARY

This is a copy of the PR here: #25262 which had to be recreated due to GitHub weirdness

Motivation and Context

Some autogenerated queries use this pattern, which is inefficient and causes OOM errors for complex queries

Impact

Queries which use this APPROX_DISTINCT pattern will consume less memory

Test Plan

Adding test coverage, E2E and unit tests

Also did some E2E testing manually to make sure the substitution was occurring:

presto:tiny> explain select approx_distinct(if(orderkey = orderkey, 'const')) from orders;
                                                                                                                    Query Plan                                                               >
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 - Output[PlanNodeId 8][_col0] => [approx_distinct:bigint]                                                                                                                                   >
         _col0 := approx_distinct (1:16)                                                                                                                                                     >
     - Aggregate(FINAL)[PlanNodeId 3] => [approx_distinct:bigint]                                                                                                                            >
             approx_distinct := "presto.default.arbitrary"((arbitrary)) (1:16)                                                                                                               >
         - LocalExchange[PlanNodeId 285][SINGLE] () => [arbitrary:bigint]                                                                                                                    >
             - RemoteStreamingExchange[PlanNodeId 291][GATHER - COLUMNAR] => [arbitrary:bigint]                                                                                              >
                 - Aggregate(PARTIAL)[PlanNodeId 289] => [arbitrary:bigint]                                                                                                                  >
                         arbitrary := "presto.default.arbitrary"((expr_4)) (1:16)                                                                                                            >
                     - ScanProject[PlanNodeId 0,1][table = TableHandle {connectorId='tpch', connectorHandle='orders:sf0.01', layout='Optional[orders:sf0.01]'}, projectLocality = LOCAL] => [>
                             Estimates: {source: CostBasedSourceInfo, rows: 15,000 (131.84kB), cpu: 135,000.00, memory: 0.00, network: 0.00}/{source: CostBasedSourceInfo, rows: 15,000 (131.>
                             expr_4 := IF((orderkey) = (orderkey), BIGINT'1', BIGINT'0') (1:72)                                                                                              >
                             orderkey := tpch:orderkey (1:71)                                                                                                                                >
                             tpch:orderstatus                                                                                                                                                >
                                 :: [["F"], ["O"], ["P"]] 

All unit tests were run on latest revision. Also, a verifier run was performed:

2025-06-23T16:28:11.319-0700    INFO    main    com.facebook.presto.verifier.framework.VerificationManager      Progress: 4637 succeeded, 2510 skipped, 1 resolved, 13 failed, 210 resubmitted, 100.00% done                                
[...]
2025-06-23 16:28:23 INFO: [verifier.py:1484] Imported verifier logs. Success rate: 99.87%. Correctness rate: 99.94%.

failed queries were due to load

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@prestodb-ci prestodb-ci added the from:Meta PR from Meta label Jun 24, 2025
@hdikeman hdikeman marked this pull request as ready for review June 24, 2025 18:24
@hdikeman

Copy link
Copy Markdown
Contributor Author

And just for clarity, the reason I had to duplicate the PR from #25262 here is explained in a comment here: #25262 (comment)

I don't know how to resolve Sreeni's change request. He's out on vacation. I was told the easiest way around this is to close and reopen the PR. New one is here: #25428

@kaikalur requested some changes to skip the aggregation entirely, but that's not possible in the case of aggregations over conditionals like this. Will handle that in a follow-up change as discussed here: #25262 (comment)

cc @feilong-liu

@feilong-liu feilong-liu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unblock since the changes requested by @kaikalur is addressed

…restodb#25262)

Summary:
Pull Request resolved: prestodb#25262

`APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))`

Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY`

This comes up in some automated queries

Differential Revision: D76161617
@hdikeman hdikeman force-pushed the henryd/approxdistinct.main branch from 25cbd02 to 8e75193 Compare June 25, 2025 20:27
@feilong-liu feilong-liu merged commit d7af265 into prestodb:master Jun 25, 2025
106 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants