feat(optimizer): Add per-column predicates for ROW IN/NOT IN#27708
feat(optimizer): Add per-column predicates for ROW IN/NOT IN#27708kaikalur wants to merge 1 commit intoprestodb:masterfrom
Conversation
Reviewer's GuideIntroduces a new OptimizeRowInPredicate plan optimizer that rewrites ROW(...) IN/NOT IN predicates on table scans into per-column IN/NOT IN predicates (while preserving the original semantics), wires it to a new optimize_row_in_predicate session property, replaces the old RewriteRowConstructorInToDisjunction rule, and adds unit/E2E tests plus documentation for the new behavior. Sequence diagram for ROW IN/NOT IN optimization on table scanssequenceDiagram
actor User
participant Session
participant PlanOptimizers
participant OptimizeRowInPredicate
participant SimplePlanRewriter
participant Rewriter
participant PlanNode
User->>PlanOptimizers: createPlan(query, session)
PlanOptimizers->>OptimizeRowInPredicate: optimize(plan, session, types, varAllocator, idAllocator, warningCollector)
OptimizeRowInPredicate->>Session: isOptimizeRowInPredicate(session)
Session-->>OptimizeRowInPredicate: enabled/disabled
alt session property disabled
OptimizeRowInPredicate-->>PlanOptimizers: PlanOptimizerResult(plan, false)
else session property enabled
OptimizeRowInPredicate->>SimplePlanRewriter: rewriteWith(new Rewriter(functionResolution, idAllocator), plan, null)
SimplePlanRewriter->>Rewriter: visit(plan)
loop walk plan
Rewriter->>Rewriter: visitFilter(FilterNode)
Rewriter->>Rewriter: isFilterOnScan(source)
alt filter sits on TableScan (through ProjectNode)
Rewriter->>Rewriter: rewritePredicate(predicate)
alt ROW IN pattern
Rewriter->>Rewriter: tryRewriteRowIn(inExpr)
Rewriter->>Rewriter: extractRowFields(inExpr)
Rewriter->>Rewriter: buildColumnInPredicate(rowFields, fieldIdx)
Rewriter-->>Rewriter: combined AND(per-column IN, OR-of-ANDs)
else ROW NOT IN pattern
Rewriter->>Rewriter: tryRewriteRowNotIn(inExpr)
Rewriter->>Rewriter: extractRowFields(inExpr)
Rewriter->>Rewriter: buildColumnInPredicate(rowFields, fieldIdx)
Rewriter-->>Rewriter: OR(per-column NOT IN, original ROW NOT IN)
end
else not on TableScan
Rewriter-->>Rewriter: leave predicate unchanged
end
end
SimplePlanRewriter-->>OptimizeRowInPredicate: rewrittenPlan
OptimizeRowInPredicate-->>PlanOptimizers: PlanOptimizerResult(rewrittenPlan, planChanged)
end
Class diagram for OptimizeRowInPredicate plan optimizerclassDiagram
class PlanOptimizer {
<<interface>>
+optimize(PlanNode plan, Session session, TypeProvider types, VariableAllocator variableAllocator, PlanNodeIdAllocator idAllocator, WarningCollector warningCollector) PlanOptimizerResult
}
class OptimizeRowInPredicate {
-FunctionResolution functionResolution
+OptimizeRowInPredicate(Metadata metadata)
+optimize(PlanNode plan, Session session, TypeProvider types, VariableAllocator variableAllocator, PlanNodeIdAllocator idAllocator, WarningCollector warningCollector) PlanOptimizerResult
}
class Rewriter {
-FunctionResolution functionResolution
-PlanNodeIdAllocator idAllocator
-boolean planChanged
+Rewriter(FunctionResolution functionResolution, PlanNodeIdAllocator idAllocator)
+isPlanChanged() boolean
+visitFilter(FilterNode node, RewriteContext context) PlanNode
-rewritePredicate(RowExpression predicate) RowExpression
-tryRewriteRowIn(SpecialFormExpression inExpr) Optional~RowExpression~
-tryRewriteRowNotIn(SpecialFormExpression inExpr) Optional~RowExpression~
-extractRowFields(SpecialFormExpression inExpr) Optional~RowFields~
-buildColumnInPredicate(RowFields rowFields, int fieldIdx) SpecialFormExpression
}
class RowFields {
+List~VariableReferenceExpression~ fieldVars
+List~SpecialFormExpression~ candidateRows
+RowFields(List~VariableReferenceExpression~ fieldVars, List~SpecialFormExpression~ candidateRows)
}
class SimplePlanRewriter {
+rewriteWith(SimplePlanRewriter rewriter, PlanNode plan, Object context) PlanNode
}
OptimizeRowInPredicate ..> FunctionResolution
OptimizeRowInPredicate ..> Metadata
PlanOptimizer <|.. OptimizeRowInPredicate
SimplePlanRewriter <|-- Rewriter
OptimizeRowInPredicate o--> Rewriter
Rewriter o--> RowFields
Rewriter ..> FilterNode
Rewriter ..> ProjectNode
Rewriter ..> TableScanNode
Rewriter ..> RowExpression
Rewriter ..> SpecialFormExpression
Rewriter ..> CallExpression
Rewriter ..> VariableReferenceExpression
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The current
rewritePredicateonly descends through top-level AND-conjuncts; anyROW ... IN/NOT INnested under ORs or other logical structures will be missed, so consider a more general recursive walk overRowExpressiontrees if you want broader coverage without relying onextractConjuncts. - Detection of NOT in
rewritePredicateis based onCallExpression.getDisplayName().equalsIgnoreCase("not"), which is a bit brittle; using the resolved function identity (or a dedicated helper fromFunctionResolution) would make this match more robust against renames or different display names.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The current `rewritePredicate` only descends through top-level AND-conjuncts; any `ROW ... IN/NOT IN` nested under ORs or other logical structures will be missed, so consider a more general recursive walk over `RowExpression` trees if you want broader coverage without relying on `extractConjuncts`.
- Detection of NOT in `rewritePredicate` is based on `CallExpression.getDisplayName().equalsIgnoreCase("not")`, which is a bit brittle; using the resolved function identity (or a dedicated helper from `FunctionResolution`) would make this match more robust against renames or different display names.
## Individual Comments
### Comment 1
<location path="presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java" line_range="8611" />
<code_context>
- tester = null;
- }
-
- @Test
- public void testRewriteEnablesPartitionPruningViaTupleDomain()
- {
</code_context>
<issue_to_address>
**suggestion (testing):** Add ROW IN/NOT IN end-to-end cases involving NULLs to validate correctness under three-valued logic
The new `testOptimizeRowInPredicate`/`testOptimizeRowNotInPredicate` cover multiple shapes but only with non-NULL values. Please add a few end-to-end cases where at least one row element is NULL (e.g., `(c1, c2) IN (ROW('a', NULL), ...)` and `(c1, c2) NOT IN (ROW('a', NULL))`) to verify ROW IN/NOT IN behavior under three-valued logic when the optimizer rewrites to per-column predicates.
Suggested implementation:
```java
// Multi-column ROW IN
assertQueryWithSameQueryRunner(enabled,
"SELECT count(*) FROM orders WHERE (orderstatus, custkey) IN (('O', 370), ('F', 781), ('P', 1234))",
disabled);
// Multi-column ROW IN with NULL in a non-leading column
assertQueryWithSameQueryRunner(enabled,
"SELECT count(*) FROM orders WHERE (orderstatus, custkey) IN (ROW('O', CAST(NULL AS bigint)), ROW('F', 781))",
disabled);
// Multi-column ROW IN with NULL in a leading column
assertQueryWithSameQueryRunner(enabled,
"SELECT count(*) FROM orders WHERE (orderstatus, custkey) IN (ROW(CAST(NULL AS varchar), 370), ROW('F', 781))",
disabled);
// Multi-column ROW NOT IN with NULL in a non-leading column
assertQueryWithSameQueryRunner(enabled,
"SELECT count(*) FROM orders WHERE (orderstatus, custkey) NOT IN (ROW('O', CAST(NULL AS bigint)))",
disabled);
// Multi-column ROW NOT IN with NULL in a leading column
assertQueryWithSameQueryRunner(enabled,
"SELECT count(*) FROM orders WHERE (orderstatus, custkey) NOT IN (ROW(CAST(NULL AS varchar), 370))",
disabled);
```
If there are other ROW IN/NOT IN shapes tested later in `testOptimizeRowInPredicate` (e.g., different arities or nested ROWs), consider adding analogous NULL-containing variants for those shapes as well, following the same `assertQueryWithSameQueryRunner(enabled, ..., disabled)` pattern to exercise three-valued logic across all rewrite cases.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| disabledSession); | ||
| } | ||
|
|
||
| @Test |
There was a problem hiding this comment.
suggestion (testing): Add ROW IN/NOT IN end-to-end cases involving NULLs to validate correctness under three-valued logic
The new testOptimizeRowInPredicate/testOptimizeRowNotInPredicate cover multiple shapes but only with non-NULL values. Please add a few end-to-end cases where at least one row element is NULL (e.g., (c1, c2) IN (ROW('a', NULL), ...) and (c1, c2) NOT IN (ROW('a', NULL))) to verify ROW IN/NOT IN behavior under three-valued logic when the optimizer rewrites to per-column predicates.
Suggested implementation:
// Multi-column ROW IN
assertQueryWithSameQueryRunner(enabled,
"SELECT count(*) FROM orders WHERE (orderstatus, custkey) IN (('O', 370), ('F', 781), ('P', 1234))",
disabled);
// Multi-column ROW IN with NULL in a non-leading column
assertQueryWithSameQueryRunner(enabled,
"SELECT count(*) FROM orders WHERE (orderstatus, custkey) IN (ROW('O', CAST(NULL AS bigint)), ROW('F', 781))",
disabled);
// Multi-column ROW IN with NULL in a leading column
assertQueryWithSameQueryRunner(enabled,
"SELECT count(*) FROM orders WHERE (orderstatus, custkey) IN (ROW(CAST(NULL AS varchar), 370), ROW('F', 781))",
disabled);
// Multi-column ROW NOT IN with NULL in a non-leading column
assertQueryWithSameQueryRunner(enabled,
"SELECT count(*) FROM orders WHERE (orderstatus, custkey) NOT IN (ROW('O', CAST(NULL AS bigint)))",
disabled);
// Multi-column ROW NOT IN with NULL in a leading column
assertQueryWithSameQueryRunner(enabled,
"SELECT count(*) FROM orders WHERE (orderstatus, custkey) NOT IN (ROW(CAST(NULL AS varchar), 370))",
disabled);If there are other ROW IN/NOT IN shapes tested later in testOptimizeRowInPredicate (e.g., different arities or nested ROWs), consider adding analogous NULL-containing variants for those shapes as well, following the same assertQueryWithSameQueryRunner(enabled, ..., disabled) pattern to exercise three-valued logic across all rewrite cases.
090bb54 to
5130fe8
Compare
steveburnett
left a comment
There was a problem hiding this comment.
Thank you for the documentation! Just one nit of formatting.
…tor IN/NOT IN
Adds OptimizeRowInPredicate, a PlanOptimizer that rewrites:
ROW(c1, c2) IN (ROW('a', 1), ROW('b', 2))
into:
(c1 IN ('a', 'b') AND c2 IN (1, 2))
AND ((c1 = 'a' AND c2 = 1) OR (c1 = 'b' AND c2 = 2))
ROW NOT IN is handled symmetrically:
ROW(c1, c2) NOT IN (ROW('a', 1), ROW('b', 2))
into:
(c1 NOT IN ('a', 'b') OR c2 NOT IN (1, 2)
OR ROW(c1, c2) NOT IN (ROW('a', 1), ROW('b', 2)))
The added per-column predicates let the domain translator extract per-column
constraints, enabling partition pruning, predicate pushdown, and join
optimization. The original ROW IN/NOT IN structure is preserved for correctness.
Only fires when the filter sits directly on a TableScan (optionally through
ProjectNodes). Runs once before PickTableLayout. Gated by the
optimize_row_in_predicate session property, disabled by default.
== RELEASE NOTES ==
General Changes
* Add :func:\`optimize_row_in_predicate\` session property that rewrites
multi-column ROW IN/NOT IN predicates to expose per-column IN/NOT IN
predicates, enabling partition pruning and other domain-based optimizations.
5130fe8 to
532f9a0
Compare
|
Thanks for catching that! Fixed the RST formatting in 532f9a0 — replaced the |
Yes! I've learned that the best way to fix this kind of problem - and a lot of problems with English grammar unrelated to formatting - is to rewrite the sentence to avoid and fix the problem. This works for me, thank you for fixing it so quickly. |
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull updated branch, verified the formatting is fixed in a new local doc build. Everything looks good. Thanks!
jja725
left a comment
There was a problem hiding this comment.
The ROW IN optimization LGTM but have some doubt for the NOT IN path
| functionResolution.notFunction(), | ||
| BOOLEAN, | ||
| ImmutableList.of(buildColumnInPredicate(rowFields, fieldIdx)))); | ||
| } |
There was a problem hiding this comment.
do we have tests in meta environment to demonstrate the effectiveness of decomposing ROW NOT IN? I don't think this would help any partition pruning since we still have original ROW NOT IN. If it's not helpful we can just support the ROW IN case.
Summary
OptimizeRowInPredicate, aPlanOptimizer(inoptimizations/, runs beforePickTableLayout) that rewritesROW(c1, c2) IN (ROW('a', 1), ROW('b', 2))to add per-columnINpredicates(c1 IN ('a', 'b') AND c2 IN (1, 2))alongside the originalOR-of-ANDs expansion.ROW NOT INis rewritten symmetrically with per-columnNOT INdisjuncts plus the originalROW NOT INas a safety net.TableScan(optionally throughProjectNodes) — otherwise the rewrite just bloats the predicate without payoff. Gated by theoptimize_row_in_predicatesession property, disabled by default.RewriteRowConstructorInToDisjunctionrule (which produced only the OR-of-ANDs and didn't help domain extraction by itself).Test plan
TestOptimizeRowInPredicate: ROW IN rewrite shape, ROW NOT IN rewrite shape, gating session property, non-ROW-constructor target no-op, recursion under top-level AND and OR (viaRowExpressionTreeRewriter), partition-key domain extraction (usingRowExpressionDomainTranslator), filter-not-on-TableScan no-opTestFeaturesConfigcovers the newoptimize_row_in_predicatesession propertyAbstractTestQueries(testOptimizeRowInPredicate,testOptimizeRowNotInPredicate) compare results with optimization enabled vs. disabled across multi-column, AND-conjunction, single-row, three-column, partition-key-shaped (varchar enum + date), and lineitem(returnflag, linestatus)cases — verified viaTestLocalQueries(530/530 green)presto-docs/.../properties-session.rst== RELEASE NOTES ==
General Changes
optimize_row_in_predicatesession property that rewrites multi-column ROW IN/NOT IN predicates to expose per-column IN/NOT IN predicates, enabling partition pruning and other domain-based optimizations.