Skip to content

feat(optimizer): Add per-column predicates for ROW IN/NOT IN#27708

Open
kaikalur wants to merge 1 commit intoprestodb:masterfrom
kaikalur:optimize-row-predicates
Open

feat(optimizer): Add per-column predicates for ROW IN/NOT IN#27708
kaikalur wants to merge 1 commit intoprestodb:masterfrom
kaikalur:optimize-row-predicates

Conversation

@kaikalur
Copy link
Copy Markdown
Contributor

@kaikalur kaikalur commented May 3, 2026

Summary

  • Add OptimizeRowInPredicate, a PlanOptimizer (in optimizations/, runs before PickTableLayout) that rewrites ROW(c1, c2) IN (ROW('a', 1), ROW('b', 2)) to add per-column IN predicates (c1 IN ('a', 'b') AND c2 IN (1, 2)) alongside the original OR-of-ANDs expansion. ROW NOT IN is rewritten symmetrically with per-column NOT IN disjuncts plus the original ROW NOT IN as a safety net.
  • The added per-column predicates let the domain translator extract per-column constraints, enabling partition pruning, predicate pushdown, and join optimization. The original ROW IN/NOT IN structure is preserved for correctness.
  • Only fires when the filter sits directly on a TableScan (optionally through ProjectNodes) — otherwise the rewrite just bloats the predicate without payoff. Gated by the optimize_row_in_predicate session property, disabled by default.
  • Replaces the prior RewriteRowConstructorInToDisjunction rule (which produced only the OR-of-ANDs and didn't help domain extraction by itself).

Test plan

  • Unit tests in TestOptimizeRowInPredicate: ROW IN rewrite shape, ROW NOT IN rewrite shape, gating session property, non-ROW-constructor target no-op, recursion under top-level AND and OR (via RowExpressionTreeRewriter), partition-key domain extraction (using RowExpressionDomainTranslator), filter-not-on-TableScan no-op
  • TestFeaturesConfig covers the new optimize_row_in_predicate session property
  • E2E tests in AbstractTestQueries (testOptimizeRowInPredicate, testOptimizeRowNotInPredicate) compare results with optimization enabled vs. disabled across multi-column, AND-conjunction, single-row, three-column, partition-key-shaped (varchar enum + date), and lineitem (returnflag, linestatus) cases — verified via TestLocalQueries (530/530 green)
  • Session property documented in presto-docs/.../properties-session.rst

== RELEASE NOTES ==

General Changes

  • Add :func:optimize_row_in_predicate session property that rewrites multi-column ROW IN/NOT IN predicates to expose per-column IN/NOT IN predicates, enabling partition pruning and other domain-based optimizations.

@prestodb-ci prestodb-ci added the from:Meta PR from Meta label May 3, 2026
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented May 3, 2026

Reviewer's Guide

Introduces a new OptimizeRowInPredicate plan optimizer that rewrites ROW(...) IN/NOT IN predicates on table scans into per-column IN/NOT IN predicates (while preserving the original semantics), wires it to a new optimize_row_in_predicate session property, replaces the old RewriteRowConstructorInToDisjunction rule, and adds unit/E2E tests plus documentation for the new behavior.

Sequence diagram for ROW IN/NOT IN optimization on table scans

sequenceDiagram
    actor User
    participant Session
    participant PlanOptimizers
    participant OptimizeRowInPredicate
    participant SimplePlanRewriter
    participant Rewriter
    participant PlanNode

    User->>PlanOptimizers: createPlan(query, session)
    PlanOptimizers->>OptimizeRowInPredicate: optimize(plan, session, types, varAllocator, idAllocator, warningCollector)
    OptimizeRowInPredicate->>Session: isOptimizeRowInPredicate(session)
    Session-->>OptimizeRowInPredicate: enabled/disabled

    alt session property disabled
        OptimizeRowInPredicate-->>PlanOptimizers: PlanOptimizerResult(plan, false)
    else session property enabled
        OptimizeRowInPredicate->>SimplePlanRewriter: rewriteWith(new Rewriter(functionResolution, idAllocator), plan, null)
        SimplePlanRewriter->>Rewriter: visit(plan)
        loop walk plan
            Rewriter->>Rewriter: visitFilter(FilterNode)
            Rewriter->>Rewriter: isFilterOnScan(source)
            alt filter sits on TableScan (through ProjectNode)
                Rewriter->>Rewriter: rewritePredicate(predicate)
                alt ROW IN pattern
                    Rewriter->>Rewriter: tryRewriteRowIn(inExpr)
                    Rewriter->>Rewriter: extractRowFields(inExpr)
                    Rewriter->>Rewriter: buildColumnInPredicate(rowFields, fieldIdx)
                    Rewriter-->>Rewriter: combined AND(per-column IN, OR-of-ANDs)
                else ROW NOT IN pattern
                    Rewriter->>Rewriter: tryRewriteRowNotIn(inExpr)
                    Rewriter->>Rewriter: extractRowFields(inExpr)
                    Rewriter->>Rewriter: buildColumnInPredicate(rowFields, fieldIdx)
                    Rewriter-->>Rewriter: OR(per-column NOT IN, original ROW NOT IN)
                end
            else not on TableScan
                Rewriter-->>Rewriter: leave predicate unchanged
            end
        end
        SimplePlanRewriter-->>OptimizeRowInPredicate: rewrittenPlan
        OptimizeRowInPredicate-->>PlanOptimizers: PlanOptimizerResult(rewrittenPlan, planChanged)
    end
Loading

Class diagram for OptimizeRowInPredicate plan optimizer

classDiagram

class PlanOptimizer {
  <<interface>>
  +optimize(PlanNode plan, Session session, TypeProvider types, VariableAllocator variableAllocator, PlanNodeIdAllocator idAllocator, WarningCollector warningCollector) PlanOptimizerResult
}

class OptimizeRowInPredicate {
  -FunctionResolution functionResolution
  +OptimizeRowInPredicate(Metadata metadata)
  +optimize(PlanNode plan, Session session, TypeProvider types, VariableAllocator variableAllocator, PlanNodeIdAllocator idAllocator, WarningCollector warningCollector) PlanOptimizerResult
}

class Rewriter {
  -FunctionResolution functionResolution
  -PlanNodeIdAllocator idAllocator
  -boolean planChanged
  +Rewriter(FunctionResolution functionResolution, PlanNodeIdAllocator idAllocator)
  +isPlanChanged() boolean
  +visitFilter(FilterNode node, RewriteContext context) PlanNode
  -rewritePredicate(RowExpression predicate) RowExpression
  -tryRewriteRowIn(SpecialFormExpression inExpr) Optional~RowExpression~
  -tryRewriteRowNotIn(SpecialFormExpression inExpr) Optional~RowExpression~
  -extractRowFields(SpecialFormExpression inExpr) Optional~RowFields~
  -buildColumnInPredicate(RowFields rowFields, int fieldIdx) SpecialFormExpression
}

class RowFields {
  +List~VariableReferenceExpression~ fieldVars
  +List~SpecialFormExpression~ candidateRows
  +RowFields(List~VariableReferenceExpression~ fieldVars, List~SpecialFormExpression~ candidateRows)
}

class SimplePlanRewriter {
  +rewriteWith(SimplePlanRewriter rewriter, PlanNode plan, Object context) PlanNode
}

OptimizeRowInPredicate ..> FunctionResolution
OptimizeRowInPredicate ..> Metadata

PlanOptimizer <|.. OptimizeRowInPredicate
SimplePlanRewriter <|-- Rewriter
OptimizeRowInPredicate o--> Rewriter
Rewriter o--> RowFields
Rewriter ..> FilterNode
Rewriter ..> ProjectNode
Rewriter ..> TableScanNode
Rewriter ..> RowExpression
Rewriter ..> SpecialFormExpression
Rewriter ..> CallExpression
Rewriter ..> VariableReferenceExpression
Loading

File-Level Changes

Change Details Files
Add OptimizeRowInPredicate plan optimizer to rewrite ROW IN/NOT IN on table scans into per-column predicates plus preserved original structure.
  • Implement PlanOptimizer that scans FilterNodes on top of TableScan/Project chains and rewrites eligible ROW IN predicates into AND of per-column INs combined with an OR-of-ANDs expansion using equality calls.
  • Implement symmetric rewrite for ROW NOT IN by transforming NOT(ROW IN (...)) into an OR of per-column NOT(IN) predicates plus NOT(original ROW IN) as a safety disjunct.
  • Ensure rewrite only applies when the session property optimize_row_in_predicate is enabled and track whether the plan changed to report through PlanOptimizerResult.
  • Add helper logic (RowFields, extractRowFields, buildColumnInPredicate, recursive rewritePredicate over AND-conjuncts) to safely recognize and transform only ROW_CONSTRUCTOR-based IN expressions with consistent arity and variable targets.
presto-main-base/src/main/java/com/facebook/presto/sql/planner/optimizations/OptimizeRowInPredicate.java
Wire the new optimizer into the planning pipeline and replace the previous RewriteRowConstructorInToDisjunction rule and config/session wiring.
  • Replace the IterativeOptimizer that used RewriteRowConstructorInToDisjunction in PlanOptimizers with a direct OptimizeRowInPredicate step executed before PickTableLayout-related optimizations.
  • Remove RewriteRowConstructorInToDisjunction class and its dedicated rule test suite from the codebase.
  • Switch FeaturesConfig boolean from rewriteRowConstructorInToDisjunction to optimizeRowInPredicate, updating getter/setter names, @config key, and description to reflect the new optimization semantics.
  • Update SystemSessionProperties to rename REWRITE_ROW_CONSTRUCTOR_IN_TO_DISJUNCTION to OPTIMIZE_ROW_IN_PREDICATE, adjust the property metadata declaration, and expose an isOptimizeRowInPredicate(Session) accessor used by the optimizer.
presto-main-base/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java
presto-main-base/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java
presto-main-base/src/main/java/com/facebook/presto/SystemSessionProperties.java
presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteRowConstructorInToDisjunction.java
presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestRewriteRowConstructorInToDisjunction.java
Extend configuration tests and session property docs to cover optimize_row_in_predicate.
  • Adjust TestFeaturesConfig default and explicit mapping expectations to refer to setOptimizeRowInPredicate and the new optimizer.optimize-row-in-predicate config key.
  • Ensure the explicit property mapping test toggles the new optimize_row_in_predicate config and matches the updated FeaturesConfig instance.
  • Document the new session property in properties-session.rst, describing its purpose (add per-column IN/NOT IN predicates for ROW IN/NOT IN) and usage.
presto-main-base/src/test/java/com/facebook/presto/sql/analyzer/TestFeaturesConfig.java
presto-docs/src/main/sphinx/admin/properties-session.rst
Add unit tests for OptimizeRowInPredicate behavior and its interaction with domain extraction and plan shape constraints.
  • Create TestOptimizeRowInPredicate to cover ROW IN rewrite shape, ROW NOT IN rewrite shape, session-property gating, non-ROW IN no-op, recursion under top-level AND, partition-key domain extraction via RowExpressionDomainTranslator, and no-op behavior when the filter is not on a TableScan.
  • Use helper builders to construct row-constructor IN expressions and assert on resulting conjunct/disjunct structure, including per-column INs/NOT INs and preservation of the original disjunction.
  • Verify that before the rewrite, the domain translator yields an unconstrained TupleDomain for the ROW IN predicate, and after the rewrite it yields concrete per-column domains for the underlying variables.
presto-main-base/src/test/java/com/facebook/presto/sql/planner/optimizations/TestOptimizeRowInPredicate.java
Add end-to-end query tests validating correctness of the optimization with ROW IN and ROW NOT IN over TPCH tables.
  • Extend AbstractTestQueries with testOptimizeRowInPredicate that runs a variety of ROW IN patterns (multi-column, AND-conjunction with other predicates, single-row, three-column, partition-key-shaped, and lineitem partition demo) comparing results with optimization enabled vs disabled.
  • Add testOptimizeRowNotInPredicate that similarly compares results for ROW NOT IN patterns, including restricted partitions and combinations with additional filters, ensuring semantic equivalence between optimized and unoptimized plans.
  • Configure sessions within the tests by setting the OPTIMIZE_ROW_IN_PREDICATE system property to true/false and using assertQueryWithSameQueryRunner to validate identical results.
presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The current rewritePredicate only descends through top-level AND-conjuncts; any ROW ... IN/NOT IN nested under ORs or other logical structures will be missed, so consider a more general recursive walk over RowExpression trees if you want broader coverage without relying on extractConjuncts.
  • Detection of NOT in rewritePredicate is based on CallExpression.getDisplayName().equalsIgnoreCase("not"), which is a bit brittle; using the resolved function identity (or a dedicated helper from FunctionResolution) would make this match more robust against renames or different display names.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The current `rewritePredicate` only descends through top-level AND-conjuncts; any `ROW ... IN/NOT IN` nested under ORs or other logical structures will be missed, so consider a more general recursive walk over `RowExpression` trees if you want broader coverage without relying on `extractConjuncts`.
- Detection of NOT in `rewritePredicate` is based on `CallExpression.getDisplayName().equalsIgnoreCase("not")`, which is a bit brittle; using the resolved function identity (or a dedicated helper from `FunctionResolution`) would make this match more robust against renames or different display names.

## Individual Comments

### Comment 1
<location path="presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java" line_range="8611" />
<code_context>
-        tester = null;
-    }
-
-    @Test
-    public void testRewriteEnablesPartitionPruningViaTupleDomain()
-    {
</code_context>
<issue_to_address>
**suggestion (testing):** Add ROW IN/NOT IN end-to-end cases involving NULLs to validate correctness under three-valued logic

The new `testOptimizeRowInPredicate`/`testOptimizeRowNotInPredicate` cover multiple shapes but only with non-NULL values. Please add a few end-to-end cases where at least one row element is NULL (e.g., `(c1, c2) IN (ROW('a', NULL), ...)` and `(c1, c2) NOT IN (ROW('a', NULL))`) to verify ROW IN/NOT IN behavior under three-valued logic when the optimizer rewrites to per-column predicates.

Suggested implementation:

```java
        // Multi-column ROW IN
        assertQueryWithSameQueryRunner(enabled,
                "SELECT count(*) FROM orders WHERE (orderstatus, custkey) IN (('O', 370), ('F', 781), ('P', 1234))",
                disabled);

        // Multi-column ROW IN with NULL in a non-leading column
        assertQueryWithSameQueryRunner(enabled,
                "SELECT count(*) FROM orders WHERE (orderstatus, custkey) IN (ROW('O', CAST(NULL AS bigint)), ROW('F', 781))",
                disabled);

        // Multi-column ROW IN with NULL in a leading column
        assertQueryWithSameQueryRunner(enabled,
                "SELECT count(*) FROM orders WHERE (orderstatus, custkey) IN (ROW(CAST(NULL AS varchar), 370), ROW('F', 781))",
                disabled);

        // Multi-column ROW NOT IN with NULL in a non-leading column
        assertQueryWithSameQueryRunner(enabled,
                "SELECT count(*) FROM orders WHERE (orderstatus, custkey) NOT IN (ROW('O', CAST(NULL AS bigint)))",
                disabled);

        // Multi-column ROW NOT IN with NULL in a leading column
        assertQueryWithSameQueryRunner(enabled,
                "SELECT count(*) FROM orders WHERE (orderstatus, custkey) NOT IN (ROW(CAST(NULL AS varchar), 370))",
                disabled);

```

If there are other ROW IN/NOT IN shapes tested later in `testOptimizeRowInPredicate` (e.g., different arities or nested ROWs), consider adding analogous NULL-containing variants for those shapes as well, following the same `assertQueryWithSameQueryRunner(enabled, ..., disabled)` pattern to exercise three-valued logic across all rewrite cases.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

disabledSession);
}

@Test
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add ROW IN/NOT IN end-to-end cases involving NULLs to validate correctness under three-valued logic

The new testOptimizeRowInPredicate/testOptimizeRowNotInPredicate cover multiple shapes but only with non-NULL values. Please add a few end-to-end cases where at least one row element is NULL (e.g., (c1, c2) IN (ROW('a', NULL), ...) and (c1, c2) NOT IN (ROW('a', NULL))) to verify ROW IN/NOT IN behavior under three-valued logic when the optimizer rewrites to per-column predicates.

Suggested implementation:

        // Multi-column ROW IN
        assertQueryWithSameQueryRunner(enabled,
                "SELECT count(*) FROM orders WHERE (orderstatus, custkey) IN (('O', 370), ('F', 781), ('P', 1234))",
                disabled);

        // Multi-column ROW IN with NULL in a non-leading column
        assertQueryWithSameQueryRunner(enabled,
                "SELECT count(*) FROM orders WHERE (orderstatus, custkey) IN (ROW('O', CAST(NULL AS bigint)), ROW('F', 781))",
                disabled);

        // Multi-column ROW IN with NULL in a leading column
        assertQueryWithSameQueryRunner(enabled,
                "SELECT count(*) FROM orders WHERE (orderstatus, custkey) IN (ROW(CAST(NULL AS varchar), 370), ROW('F', 781))",
                disabled);

        // Multi-column ROW NOT IN with NULL in a non-leading column
        assertQueryWithSameQueryRunner(enabled,
                "SELECT count(*) FROM orders WHERE (orderstatus, custkey) NOT IN (ROW('O', CAST(NULL AS bigint)))",
                disabled);

        // Multi-column ROW NOT IN with NULL in a leading column
        assertQueryWithSameQueryRunner(enabled,
                "SELECT count(*) FROM orders WHERE (orderstatus, custkey) NOT IN (ROW(CAST(NULL AS varchar), 370))",
                disabled);

If there are other ROW IN/NOT IN shapes tested later in testOptimizeRowInPredicate (e.g., different arities or nested ROWs), consider adding analogous NULL-containing variants for those shapes as well, following the same assertQueryWithSameQueryRunner(enabled, ..., disabled) pattern to exercise three-valued logic across all rewrite cases.

@kaikalur kaikalur force-pushed the optimize-row-predicates branch 2 times, most recently from 090bb54 to 5130fe8 Compare May 3, 2026 15:38
Copy link
Copy Markdown
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the documentation! Just one nit of formatting.

Comment thread presto-docs/src/main/sphinx/admin/properties-session.rst Outdated
…tor IN/NOT IN

Adds OptimizeRowInPredicate, a PlanOptimizer that rewrites:

  ROW(c1, c2) IN (ROW('a', 1), ROW('b', 2))

into:

  (c1 IN ('a', 'b') AND c2 IN (1, 2))
  AND ((c1 = 'a' AND c2 = 1) OR (c1 = 'b' AND c2 = 2))

ROW NOT IN is handled symmetrically:

  ROW(c1, c2) NOT IN (ROW('a', 1), ROW('b', 2))

into:

  (c1 NOT IN ('a', 'b') OR c2 NOT IN (1, 2)
   OR ROW(c1, c2) NOT IN (ROW('a', 1), ROW('b', 2)))

The added per-column predicates let the domain translator extract per-column
constraints, enabling partition pruning, predicate pushdown, and join
optimization. The original ROW IN/NOT IN structure is preserved for correctness.

Only fires when the filter sits directly on a TableScan (optionally through
ProjectNodes). Runs once before PickTableLayout. Gated by the
optimize_row_in_predicate session property, disabled by default.

== RELEASE NOTES ==

General Changes
* Add :func:\`optimize_row_in_predicate\` session property that rewrites
  multi-column ROW IN/NOT IN predicates to expose per-column IN/NOT IN
  predicates, enabling partition pruning and other domain-based optimizations.
@kaikalur kaikalur force-pushed the optimize-row-predicates branch from 5130fe8 to 532f9a0 Compare May 4, 2026 18:59
@kaikalur
Copy link
Copy Markdown
Contributor Author

kaikalur commented May 4, 2026

Thanks for catching that! Fixed the RST formatting in 532f9a0 — replaced the ANDs pattern with plain text "disjunctive OR of AND clauses expansion" so the trailing s no longer breaks RST.

@steveburnett
Copy link
Copy Markdown
Contributor

steveburnett commented May 4, 2026

Thanks for catching that! Fixed the RST formatting in 532f9a0 — replaced the ANDs pattern with plain text "disjunctive OR of AND clauses expansion" so the trailing s no longer breaks RST.

Yes! I've learned that the best way to fix this kind of problem - and a lot of problems with English grammar unrelated to formatting - is to rewrite the sentence to avoid and fix the problem. This works for me, thank you for fixing it so quickly.

Copy link
Copy Markdown
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, verified the formatting is fixed in a new local doc build. Everything looks good. Thanks!

@jja725 jja725 self-requested a review May 4, 2026 23:26
Copy link
Copy Markdown
Contributor

@jja725 jja725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ROW IN optimization LGTM but have some doubt for the NOT IN path

functionResolution.notFunction(),
BOOLEAN,
ImmutableList.of(buildColumnInPredicate(rowFields, fieldIdx))));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have tests in meta environment to demonstrate the effectiveness of decomposing ROW NOT IN? I don't think this would help any partition pruning since we still have original ROW NOT IN. If it's not helpful we can just support the ROW IN case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants