Skip to content

Conversation

liancheng
Copy link
Contributor

This PR is based on #1883 authored by @marmbrus. Key differences:

  1. Batch pruning instead of partition pruning

    When [WIP][SPARK-2961][SQL] Use statistics to skip cached partitions #1883 was authored, batched column buffer building ([SPARK-2650][SQL] Build column buffers in smaller batches #1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition).

  2. More filters are supported

    Filter predicates consist of =, <, <=, >, >= and their conjunctions and disjunctions are supported.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have started for PR 2188 at commit b4a8281.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have finished for PR 2188 at commit b4a8281.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

@liancheng
Copy link
Contributor Author

Scala style check failed, although the code style is actually OK...

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have started for PR 2188 at commit 9bd234b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have finished for PR 2188 at commit 9bd234b.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have started for PR 2188 at commit b6f9f6c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have finished for PR 2188 at commit b6f9f6c.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

Tried to combine Michael's partition pruning branch and the batched
column buffer building. In this way, we actually got "batch pruning"
rather than partition pruning.

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
* Bugfix: gatherStats() should be called in NullableColumnBuilder,
  otherwise null values are skipped
* Bugfix: fixed lower bound comparison in StringColumnStats and
  TimestampColumnStats
@SparkQA
Copy link

SparkQA commented Aug 30, 2014

QA tests have started for PR 2188 at commit 270ca61.

  • This patch does not merge cleanly!

@liancheng liancheng force-pushed the in-mem-batch-pruning branch from 270ca61 to 062c315 Compare August 30, 2014 08:54
@SparkQA
Copy link

SparkQA commented Aug 30, 2014

QA tests have started for PR 2188 at commit 062c315.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 30, 2014

QA tests have finished for PR 2188 at commit 270ca61.

  • This patch passes unit tests.
  • This patch does not merge cleanly!

@SparkQA
Copy link

SparkQA commented Aug 30, 2014

QA tests have finished for PR 2188 at commit 062c315.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.expressions._

val buildFilter: PartialFunction[Expression, Expression] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document the contract for these filters? "Returns false iff it is impossible for the input expression to evaluate to true based on statistics collected about this partition"?

@marmbrus
Copy link
Contributor

marmbrus commented Sep 3, 2014

Thanks for taking this over! A few minor comments.

@@ -31,7 +31,7 @@ class BooleanBitSetSuite extends FunSuite {
// Tests encoder
// -------------

val builder = TestCompressibleColumnBuilder(new BooleanColumnStats, BOOLEAN, BooleanBitSet)
val builder = TestCompressibleColumnBuilder(new NoopColumnStats, BOOLEAN, BooleanBitSet)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus Would you mind to elaborate a bit on why you changed BooleanColumnStats to NoopColumnStats in #1883?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

laziness :) We should probably implement the statistics that make sense here.

@liancheng
Copy link
Contributor Author

@marmbrus Addressed all the comments, thanks for the detailed review!

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2188 at commit d2a1d66.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2188 at commit d2a1d66.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2188 at commit 4254f6c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2188 at commit 4254f6c.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

@liancheng
Copy link
Contributor Author

ok to test

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2188 at commit 68cf019.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 4, 2014

QA tests have finished for PR 2188 at commit 68cf019.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SparkListenerBlockManagerAdded(time: Long, blockManagerId: BlockManagerId, maxMem: Long)
    • case class SparkListenerBlockManagerRemoved(time: Long, blockManagerId: BlockManagerId)
    • case class SparkListenerApplicationStart(appName: String, appId: Option[String], time: Long,
    • class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

@asfgit asfgit closed this in 248067a Sep 4, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…tions

This PR is based on apache#1883 authored by marmbrus. Key differences:

1. Batch pruning instead of partition pruning

   When apache#1883 was authored, batched column buffer building (apache#1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition).

1. More filters are supported

   Filter predicates consist of `=`, `<`, `<=`, `>`, `>=` and their conjunctions and disjunctions are supported.

Author: Cheng Lian <[email protected]>

Closes apache#2188 from liancheng/in-mem-batch-pruning and squashes the following commits:

68cf019 [Cheng Lian] Marked sqlContext as @transient
4254f6c [Cheng Lian] Enables in-memory partition pruning in PartitionBatchPruningSuite
3784105 [Cheng Lian] Overrides InMemoryColumnarTableScan.sqlContext
d2a1d66 [Cheng Lian] Disables in-memory partition pruning by default
062c315 [Cheng Lian] HiveCompatibilitySuite code cleanup
16b77bf [Cheng Lian] Fixed pruning predication conjunctions and disjunctions
16195c5 [Cheng Lian] Enabled both disjunction and conjunction
89950d0 [Cheng Lian] Worked around Scala style check
9c167f6 [Cheng Lian] Minor code cleanup
3c4d5c7 [Cheng Lian] Minor code cleanup
ea59ee5 [Cheng Lian] Renamed PartitionSkippingSuite to PartitionBatchPruningSuite
fc517d0 [Cheng Lian] More test cases
1868c18 [Cheng Lian] Code cleanup, bugfix, and adding tests
cb76da4 [Cheng Lian] Added more predicate filters, fixed table scan stats for testing purposes
385474a [Cheng Lian] Merge branch 'inMemStats' into in-mem-batch-pruning
@liancheng liancheng deleted the in-mem-batch-pruning branch September 24, 2014 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants