-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-2961][SQL] Use statistics to prune batches within cached partitions #2188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
QA tests have started for PR 2188 at commit
|
QA tests have finished for PR 2188 at commit
|
Scala style check failed, although the code style is actually OK... |
QA tests have started for PR 2188 at commit
|
QA tests have finished for PR 2188 at commit
|
QA tests have started for PR 2188 at commit
|
QA tests have finished for PR 2188 at commit
|
Tried to combine Michael's partition pruning branch and the batched column buffer building. In this way, we actually got "batch pruning" rather than partition pruning. Conflicts: sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala Conflicts: sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
* Bugfix: gatherStats() should be called in NullableColumnBuilder, otherwise null values are skipped * Bugfix: fixed lower bound comparison in StringColumnStats and TimestampColumnStats
QA tests have started for PR 2188 at commit
|
270ca61
to
062c315
Compare
QA tests have started for PR 2188 at commit
|
QA tests have finished for PR 2188 at commit
|
QA tests have finished for PR 2188 at commit
|
import org.apache.spark.sql.catalyst.dsl.expressions._ | ||
import org.apache.spark.sql.catalyst.expressions._ | ||
|
||
val buildFilter: PartialFunction[Expression, Expression] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document the contract for these filters? "Returns false iff it is impossible for the input expression to evaluate to true based on statistics collected about this partition"?
Thanks for taking this over! A few minor comments. |
@@ -31,7 +31,7 @@ class BooleanBitSetSuite extends FunSuite { | |||
// Tests encoder | |||
// ------------- | |||
|
|||
val builder = TestCompressibleColumnBuilder(new BooleanColumnStats, BOOLEAN, BooleanBitSet) | |||
val builder = TestCompressibleColumnBuilder(new NoopColumnStats, BOOLEAN, BooleanBitSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
laziness :) We should probably implement the statistics that make sense here.
@marmbrus Addressed all the comments, thanks for the detailed review! |
QA tests have started for PR 2188 at commit
|
QA tests have finished for PR 2188 at commit
|
QA tests have started for PR 2188 at commit
|
QA tests have finished for PR 2188 at commit
|
ok to test |
QA tests have started for PR 2188 at commit
|
QA tests have finished for PR 2188 at commit
|
…tions This PR is based on apache#1883 authored by marmbrus. Key differences: 1. Batch pruning instead of partition pruning When apache#1883 was authored, batched column buffer building (apache#1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition). 1. More filters are supported Filter predicates consist of `=`, `<`, `<=`, `>`, `>=` and their conjunctions and disjunctions are supported. Author: Cheng Lian <[email protected]> Closes apache#2188 from liancheng/in-mem-batch-pruning and squashes the following commits: 68cf019 [Cheng Lian] Marked sqlContext as @transient 4254f6c [Cheng Lian] Enables in-memory partition pruning in PartitionBatchPruningSuite 3784105 [Cheng Lian] Overrides InMemoryColumnarTableScan.sqlContext d2a1d66 [Cheng Lian] Disables in-memory partition pruning by default 062c315 [Cheng Lian] HiveCompatibilitySuite code cleanup 16b77bf [Cheng Lian] Fixed pruning predication conjunctions and disjunctions 16195c5 [Cheng Lian] Enabled both disjunction and conjunction 89950d0 [Cheng Lian] Worked around Scala style check 9c167f6 [Cheng Lian] Minor code cleanup 3c4d5c7 [Cheng Lian] Minor code cleanup ea59ee5 [Cheng Lian] Renamed PartitionSkippingSuite to PartitionBatchPruningSuite fc517d0 [Cheng Lian] More test cases 1868c18 [Cheng Lian] Code cleanup, bugfix, and adding tests cb76da4 [Cheng Lian] Added more predicate filters, fixed table scan stats for testing purposes 385474a [Cheng Lian] Merge branch 'inMemStats' into in-mem-batch-pruning
This PR is based on #1883 authored by @marmbrus. Key differences:
Batch pruning instead of partition pruning
When [WIP][SPARK-2961][SQL] Use statistics to skip cached partitions #1883 was authored, batched column buffer building ([SPARK-2650][SQL] Build column buffers in smaller batches #1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition).
More filters are supported
Filter predicates consist of
=
,<
,<=
,>
,>=
and their conjunctions and disjunctions are supported.