Skip to content

Run empty values node simplify optimizer after connector optimizer#25155

Merged
feilong-liu merged 1 commit into
prestodb:masterfrom
feilong-liu:add_simplify_empty_input
May 22, 2025
Merged

Run empty values node simplify optimizer after connector optimizer#25155
feilong-liu merged 1 commit into
prestodb:masterfrom
feilong-liu:add_simplify_empty_input

Conversation

@feilong-liu

@feilong-liu feilong-liu commented May 20, 2025

Copy link
Copy Markdown
Contributor

Description

The connector optimizer can convert a table scan which returns no data to an empty values node. We have an optimizer SimplifyPlanWithEmptyInput to simplify plan with empty values node.
There are two runs of connector optimizer, logical and physical:

https://github.com/prestodb/presto/blob/master/presto-main-base/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java#L727-L730

builder.add(
                new ApplyConnectorOptimization(() -> planOptimizerManager.getOptimizers(LOGICAL)),
                projectionPushDown,
                new PruneUnreferencedOutputs());

https://github.com/prestodb/presto/blob/master/presto-main-base/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java#L951-L958

builder.add(
                new ApplyConnectorOptimization(() -> planOptimizerManager.getOptimizers(PHYSICAL)),
                new IterativeOptimizer(
                        metadata,
                        ruleStats,
                        statsCalculator,
                        costCalculator,
                        ImmutableSet.of(new RemoveRedundantIdentityProjections(), new PruneRedundantProjectionAssignments())));

Previously we only run it after the run of logical connector optimizer. However, turns out that the empty values node conversion also happens after physical run.

One example:

set session hive.pushdown_filter_enabled=true;

CREATE TABLE t1 with (partitioned_by = ARRAY['ds']) as select * from (values (1, '2024-01-05')) t(col1, ds);

create table t2 with (partitioned_by = ARRAY['ds']) as select * from (values (1, '2024-01-05')) t(col1, ds);

create table t3 with (partitioned_by = ARRAY['col2']) as select * from (values (1, '2024-01-05')) t(col1, col2);

explain (type distributed) SELECT e.col1 FROM t1 e INNER JOIN t2 m ON e.col1 = m.col1 AND e.ds = '2024-01-05' RIGHT JOIN t3 b ON b.col2 = 'xxx';

The reason is because the query relies on the first run of connector optimizer to push filters down into table scan. Later during predicate pushdown, it will find that there are no col2 equals to 'xxx' in table t3, and leads to empty table conversion in the later run of connector optimizer.

So run the SimplifyPlanWithEmptyInput once more after physical run.

Motivation and Context

Optimize query plans with empty input

Impact

Optimize query plans with empty input

Test Plan

Existing unit tests

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Improve query plans using the ``SimplifyPlanWithEmptyInput`` optimizer to convert a table scan which returns no data to an empty values node.

@feilong-liu feilong-liu requested a review from a team as a code owner May 20, 2025 19:13
@feilong-liu feilong-liu requested a review from hantangwangd May 20, 2025 19:13
@prestodb-ci prestodb-ci added the from:Meta PR from Meta label May 20, 2025
@feilong-liu feilong-liu marked this pull request as draft May 20, 2025 19:13
@feilong-liu feilong-liu force-pushed the add_simplify_empty_input branch from 8317642 to 9d8039d Compare May 21, 2025 04:02
@feilong-liu feilong-liu marked this pull request as ready for review May 21, 2025 18:29
@feilong-liu feilong-liu requested a review from jaystarshot May 21, 2025 18:29

@hantangwangd hantangwangd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A noob question, do you have any non-default configuration for optimizer or hive connector? In my local, the plan of the SQL in your example was as follows when first sent to SimplifyPlanWithEmptyInput, after optimized by the logical phase of connector optimizer:

- Output[PlanNodeId 15][col1] => [col1:integer]
    - RightJoin[PlanNodeId 10][(VARCHAR'xxx') = (col2)] => [col1:integer]
        - InnerJoin[PlanNodeId 4][("col1" = "col1_0")] => [col1:integer]
            - Project[PlanNodeId 421][projectLocality = LOCAL] => [col1:integer]
                - Values[PlanNodeId 454] => [col1:integer]
            - Project[PlanNodeId 422][projectLocality = LOCAL] => [col1_0:integer]
                - Values[PlanNodeId 455] => [col1_0:integer]
        - ScanProject[PlanNodeId 6,423][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=t3, analyzePartitionValues=Optional.empty}', layout='Optional[default.t3{}]'}, projectLocality = LOCAL] => [col2:varchar(10)]
                LAYOUT: default.t3{}
                col2 := col2:varchar(10):-13:PARTITION_KEY (1:95)
                    :: [["2024-01-05"]]

And after SimplifyPlanWithEmptyInput's optimization, the plan was as follows:

- Output[PlanNodeId 15][col1] => [col1:integer]
    - Project[PlanNodeId 475][projectLocality = LOCAL] => [col1:integer]
            col1 := null (1:20)
        - ScanProject[PlanNodeId 6,423][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=t3, analyzePartitionValues=Optional.empty}', layout='Optional[default.t3{}]'}, projectLocality = LOCAL] => [col2:varchar(10)]
                LAYOUT: default.t3{}
                col2 := col2:varchar(10):-13:PARTITION_KEY (1:95)
                    :: [["2024-01-05"]]

After this, the plan didn't change anymore by the physical phase of connector optimizer. Am I miss anything?

@feilong-liu feilong-liu requested a review from rschlussel May 22, 2025 16:31
@feilong-liu

Copy link
Copy Markdown
Contributor Author

A noob question, do you have any non-default configuration for optimizer or hive connector? In my local, the plan of the SQL in your example was as follows when first sent to SimplifyPlanWithEmptyInput, after optimized by the logical phase of connector optimizer:

- Output[PlanNodeId 15][col1] => [col1:integer]
    - RightJoin[PlanNodeId 10][(VARCHAR'xxx') = (col2)] => [col1:integer]
        - InnerJoin[PlanNodeId 4][("col1" = "col1_0")] => [col1:integer]
            - Project[PlanNodeId 421][projectLocality = LOCAL] => [col1:integer]
                - Values[PlanNodeId 454] => [col1:integer]
            - Project[PlanNodeId 422][projectLocality = LOCAL] => [col1_0:integer]
                - Values[PlanNodeId 455] => [col1_0:integer]
        - ScanProject[PlanNodeId 6,423][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=t3, analyzePartitionValues=Optional.empty}', layout='Optional[default.t3{}]'}, projectLocality = LOCAL] => [col2:varchar(10)]
                LAYOUT: default.t3{}
                col2 := col2:varchar(10):-13:PARTITION_KEY (1:95)
                    :: [["2024-01-05"]]

And after SimplifyPlanWithEmptyInput's optimization, the plan was as follows:

- Output[PlanNodeId 15][col1] => [col1:integer]
    - Project[PlanNodeId 475][projectLocality = LOCAL] => [col1:integer]
            col1 := null (1:20)
        - ScanProject[PlanNodeId 6,423][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=t3, analyzePartitionValues=Optional.empty}', layout='Optional[default.t3{}]'}, projectLocality = LOCAL] => [col2:varchar(10)]
                LAYOUT: default.t3{}
                col2 := col2:varchar(10):-13:PARTITION_KEY (1:95)
                    :: [["2024-01-05"]]

After this, the plan didn't change anymore by the physical phase of connector optimizer. Am I miss anything?

Oh, the hive.pushdown_filter_enabled needs to be set to true. Thanks for the catch. Also updated in the description now.

@feilong-liu feilong-liu requested a review from hantangwangd May 22, 2025 17:40
ImmutableSet.of(new RemoveRedundantIdentityProjections(), new PruneRedundantProjectionAssignments())));

// Pass after connector optimizer, as it relies on connector optimizer to identify empty input tables and convert them to empty ValuesNode
builder.add(new SimplifyPlanWithEmptyInput());

@rschlussel rschlussel May 22, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it need PruneUnreferencedOutputs too like after the logical connector PlanOptimizers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PruneUnreferencedOutputs works only for outer joins which are converted to projection when the inner side is empty, which in some cases the join keys are not in output and need to be pruned. It does not affect validity of the plan, but will further simplify the plan to allow later optimizer simplify the plan further. Since this instance of SimplifyPlanWithEmptyInput is running at the very end, and the PruneUnreferencedOutputs optimizer does not handle some nodes for example merge join node which may be in the plan by this stage, not adding it should be fine here (and can draft another PR to fix the PruneUnreferencedOutputs if needed).

@steveburnett

Copy link
Copy Markdown
Contributor

Thanks for the release note entry! Suggestion to help follow the Order of changes recommended phrasing in the Release Notes Guidelines.

== RELEASE NOTES ==

General Changes
* Improve query plans using the ``SimplifyPlanWithEmptyInput`` optimizer to convert a table scan which returns no data to an empty values node.

@feilong-liu feilong-liu merged commit fcc4735 into prestodb:master May 22, 2025
100 checks passed
@feilong-liu feilong-liu deleted the add_simplify_empty_input branch May 22, 2025 18:19
@hantangwangd

Copy link
Copy Markdown
Member

@feilong-liu Got it, thanks for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants