[SPARK-2961][SQL] Use statistics to prune batches within cached partitions #2188

liancheng · 2014-08-29T00:38:05Z

This PR is based on #1883 authored by @marmbrus. Key differences:

Batch pruning instead of partition pruning

When [WIP][SPARK-2961][SQL] Use statistics to skip cached partitions #1883 was authored, batched column buffer building ([SPARK-2650][SQL] Build column buffers in smaller batches #1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition).
More filters are supported

Filter predicates consist of =, <, <=, >, >= and their conjunctions and disjunctions are supported.

SparkQA · 2014-08-29T00:44:13Z

QA tests have started for PR 2188 at commit b4a8281.

This patch merges cleanly.

SparkQA · 2014-08-29T00:45:12Z

QA tests have finished for PR 2188 at commit b4a8281.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

liancheng · 2014-08-29T00:54:57Z

Scala style check failed, although the code style is actually OK...

SparkQA · 2014-08-29T00:59:20Z

QA tests have started for PR 2188 at commit 9bd234b.

This patch merges cleanly.

SparkQA · 2014-08-29T02:23:43Z

QA tests have finished for PR 2188 at commit 9bd234b.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

SparkQA · 2014-08-29T08:49:08Z

QA tests have started for PR 2188 at commit b6f9f6c.

This patch merges cleanly.

SparkQA · 2014-08-29T10:03:51Z

QA tests have finished for PR 2188 at commit b6f9f6c.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

Tried to combine Michael's partition pruning branch and the batched column buffer building. In this way, we actually got "batch pruning" rather than partition pruning. Conflicts: sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala Conflicts: sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala

…oses

* Bugfix: gatherStats() should be called in NullableColumnBuilder, otherwise null values are skipped * Bugfix: fixed lower bound comparison in StringColumnStats and TimestampColumnStats

SparkQA · 2014-08-30T08:54:11Z

QA tests have started for PR 2188 at commit 270ca61.

This patch does not merge cleanly!

SparkQA · 2014-08-30T08:59:09Z

QA tests have started for PR 2188 at commit 062c315.

This patch merges cleanly.

SparkQA · 2014-08-30T10:19:16Z

QA tests have finished for PR 2188 at commit 270ca61.

This patch passes unit tests.
This patch does not merge cleanly!

SparkQA · 2014-08-30T10:23:28Z

QA tests have finished for PR 2188 at commit 062c315.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2014-09-03T04:05:20Z

sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala

+  import org.apache.spark.sql.catalyst.dsl.expressions._
+  import org.apache.spark.sql.catalyst.expressions._
+
+  val buildFilter: PartialFunction[Expression, Expression] = {


Can we document the contract for these filters? "Returns false iff it is impossible for the input expression to evaluate to true based on statistics collected about this partition"?

marmbrus · 2014-09-03T04:20:10Z

Thanks for taking this over! A few minor comments.

liancheng · 2014-09-03T05:52:39Z

sql/core/src/test/scala/org/apache/spark/sql/columnar/compression/BooleanBitSetSuite.scala

@@ -31,7 +31,7 @@ class BooleanBitSetSuite extends FunSuite {
    // Tests encoder
    // -------------

-    val builder = TestCompressibleColumnBuilder(new BooleanColumnStats, BOOLEAN, BooleanBitSet)
+    val builder = TestCompressibleColumnBuilder(new NoopColumnStats, BOOLEAN, BooleanBitSet)


@marmbrus Would you mind to elaborate a bit on why you changed BooleanColumnStats to NoopColumnStats in #1883?

laziness :) We should probably implement the statistics that make sense here.

liancheng · 2014-09-03T07:44:31Z

@marmbrus Addressed all the comments, thanks for the detailed review!

SparkQA · 2014-09-03T07:49:09Z

QA tests have started for PR 2188 at commit d2a1d66.

This patch merges cleanly.

SparkQA · 2014-09-03T09:04:21Z

QA tests have finished for PR 2188 at commit d2a1d66.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

SparkQA · 2014-09-03T20:39:21Z

QA tests have started for PR 2188 at commit 4254f6c.

This patch merges cleanly.

SparkQA · 2014-09-03T21:50:44Z

QA tests have finished for PR 2188 at commit 4254f6c.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

liancheng · 2014-09-03T23:24:27Z

ok to test

SparkQA · 2014-09-03T23:29:28Z

QA tests have started for PR 2188 at commit 68cf019.

This patch merges cleanly.

SparkQA · 2014-09-04T01:15:26Z

QA tests have finished for PR 2188 at commit 68cf019.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SparkListenerBlockManagerAdded(time: Long, blockManagerId: BlockManagerId, maxMem: Long)
- case class SparkListenerBlockManagerRemoved(time: Long, blockManagerId: BlockManagerId)
- case class SparkListenerApplicationStart(appName: String, appId: Option[String], time: Long,
- class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

@transient

…tions This PR is based on apache#1883 authored by marmbrus. Key differences: 1. Batch pruning instead of partition pruning When apache#1883 was authored, batched column buffer building (apache#1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition). 1. More filters are supported Filter predicates consist of `=`, `<`, `<=`, `>`, `>=` and their conjunctions and disjunctions are supported. Author: Cheng Lian <[email protected]> Closes apache#2188 from liancheng/in-mem-batch-pruning and squashes the following commits: 68cf019 [Cheng Lian] Marked sqlContext as @transient 4254f6c [Cheng Lian] Enables in-memory partition pruning in PartitionBatchPruningSuite 3784105 [Cheng Lian] Overrides InMemoryColumnarTableScan.sqlContext d2a1d66 [Cheng Lian] Disables in-memory partition pruning by default 062c315 [Cheng Lian] HiveCompatibilitySuite code cleanup 16b77bf [Cheng Lian] Fixed pruning predication conjunctions and disjunctions 16195c5 [Cheng Lian] Enabled both disjunction and conjunction 89950d0 [Cheng Lian] Worked around Scala style check 9c167f6 [Cheng Lian] Minor code cleanup 3c4d5c7 [Cheng Lian] Minor code cleanup ea59ee5 [Cheng Lian] Renamed PartitionSkippingSuite to PartitionBatchPruningSuite fc517d0 [Cheng Lian] More test cases 1868c18 [Cheng Lian] Code cleanup, bugfix, and adding tests cb76da4 [Cheng Lian] Added more predicate filters, fixed table scan stats for testing purposes 385474a [Cheng Lian] Merge branch 'inMemStats' into in-mem-batch-pruning

marmbrus mentioned this pull request Aug 29, 2014

[WIP][SPARK-2961][SQL] Use statistics to skip cached partitions #1883

Closed

2 tasks

liancheng added 11 commits August 30, 2014 01:50

Added more predicate filters, fixed table scan stats for testing purp…

cb76da4

…oses

Code cleanup, bugfix, and adding tests

1868c18

* Bugfix: gatherStats() should be called in NullableColumnBuilder, otherwise null values are skipped * Bugfix: fixed lower bound comparison in StringColumnStats and TimestampColumnStats

More test cases

fc517d0

Renamed PartitionSkippingSuite to PartitionBatchPruningSuite

ea59ee5

Minor code cleanup

3c4d5c7

Minor code cleanup

9c167f6

Worked around Scala style check

89950d0

Enabled both disjunction and conjunction

16195c5

Fixed pruning predication conjunctions and disjunctions

16b77bf

HiveCompatibilitySuite code cleanup

062c315

liancheng force-pushed the in-mem-batch-pruning branch from 270ca61 to 062c315 Compare August 30, 2014 08:54

marmbrus reviewed Sep 3, 2014
View reviewed changes

liancheng reviewed Sep 3, 2014
View reviewed changes

Disables in-memory partition pruning by default

d2a1d66

liancheng added 2 commits September 3, 2014 13:31

Overrides InMemoryColumnarTableScan.sqlContext

3784105

Enables in-memory partition pruning in PartitionBatchPruningSuite

4254f6c

Marked sqlContext as @transient

68cf019

asfgit closed this in 248067a Sep 4, 2014

liancheng deleted the in-mem-batch-pruning branch September 24, 2014 00:05

[SPARK-2961][SQL] Use statistics to prune batches within cached partitions #2188

[SPARK-2961][SQL] Use statistics to prune batches within cached partitions #2188

Uh oh!

Conversation

liancheng commented Aug 29, 2014

Uh oh!

SparkQA commented Aug 29, 2014

Uh oh!

SparkQA commented Aug 29, 2014

Uh oh!

liancheng commented Aug 29, 2014

Uh oh!

SparkQA commented Aug 29, 2014

Uh oh!

SparkQA commented Aug 29, 2014

Uh oh!

SparkQA commented Aug 29, 2014

Uh oh!

SparkQA commented Aug 29, 2014

Uh oh!

SparkQA commented Aug 30, 2014

Uh oh!

SparkQA commented Aug 30, 2014

Uh oh!

SparkQA commented Aug 30, 2014

Uh oh!

SparkQA commented Aug 30, 2014

Uh oh!

marmbrus Sep 3, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Sep 3, 2014

Uh oh!

liancheng Sep 3, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus Sep 3, 2014

Choose a reason for hiding this comment

Uh oh!

liancheng commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

liancheng commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 4, 2014

Uh oh!

Uh oh!