Skip to content

Conversation

marmbrus
Copy link
Contributor

No description provided.

@SparkQA
Copy link

SparkQA commented Aug 10, 2014

QA tests have started for PR 1880. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18285/consoleFull

@marmbrus marmbrus changed the title [SPARK-2650] Build column buffers in smaller batches [SPARK-2650][SQL] Build column buffers in smaller batches Aug 10, 2014
@SparkQA
Copy link

SparkQA commented Aug 10, 2014

QA results for PR 1880:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18285/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 10, 2014

QA tests have started for PR 1880. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18286/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 10, 2014

QA results for PR 1880:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18286/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 10, 2014

QA tests have started for PR 1880. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18288/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA results for PR 1880:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18288/consoleFull

// Find the ordinals of the requested columns. If none are requested, use the first.
val requestedColumns =
if (attributes.isEmpty) {
Seq(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use the narrowest one instead of the 1st one by checking default sizes of columns:

val narrowest = relation.output.indices.minBy { i =>
  ColumnType(relation.output(i).dataType).defaultSize
}
Seq(narrowest)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be better. Really though I think we should use statistics from #1883 to skip decoding entirely.

@liancheng
Copy link
Contributor

I believe this PR can alleviate OOMs a lot. Below are some ideas to make in-memory columnar store more memory efficient, and can be done in separate PRs based on this one.

While building column buffers in batch, we still uses 1MB as initial column buffer size for each column (defined as ColumnBuilder.DEFAULT_INITIAL_BUFFER_SIZE). Say T tasks are running in parallel to squeeze a table with C columns into memory, we allocate at least T * C * 1MB for each batch.

The initial column buffer size estimation used in Shark can be useful, but unfortunately the implementation is actually buggy, and usually gives fairly small initial buffer size. A more reasonable estimation heuristics could be:

  1. Let D[i] be the default size of the i-th column
  2. Let I = sum(D[i]) * batchSize
  3. Default column buffer size for the i-th column is S[i] = I * D[i] / sum(D[i])

This estimation is precise for all primitive types whose default sizes equals to their actual sizes since the row number (i.e. batchSize) in a batch is known.

@liancheng
Copy link
Contributor

Ah, just realized I made things too complex... Just use columnType.defaultSize * batchSize as the initial column buffer size, it's equivalent to the verbose version above.

new Iterator[Array[ByteBuffer]] {
def next() = {
val columnBuilders = output.map { attribute =>
ColumnBuilder(ColumnType(attribute.dataType).typeId, 0, attribute.name, useCompression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more precise initial buffer size can be used here:

val columnType = ColumnType(attribute.dataType)
ColumnBuilder(columnType.typeId, columnType.defaultSize * batchSize, attribute.name, useCompression)

@marmbrus
Copy link
Contributor Author

@liancheng, thanks for reviewing! Would you mind creating a JIRA/followup PR to set the defaults correctly as you propose?

@marmbrus
Copy link
Contributor Author

Merged to master and 1.1

@asfgit asfgit closed this in bad21ed Aug 12, 2014
asfgit pushed a commit that referenced this pull request Aug 12, 2014
Author: Michael Armbrust <[email protected]>

Closes #1880 from marmbrus/columnBatches and squashes the following commits:

0649987 [Michael Armbrust] add test
4756fad [Michael Armbrust] fix compilation
2314532 [Michael Armbrust] Build column buffers in smaller batches

(cherry picked from commit bad21ed)
Signed-off-by: Michael Armbrust <[email protected]>
@liancheng
Copy link
Contributor

Opened #1901 for precise initial buffer size estimation.

asfgit pushed a commit that referenced this pull request Aug 14, 2014
…memory column buffer

This is a follow up of #1880.

Since the row number within a single batch is known, we can estimate a much more precise initial buffer size when building an in-memory column buffer.

Author: Cheng Lian <[email protected]>

Closes #1901 from liancheng/precise-init-buffer-size and squashes the following commits:

d5501fa [Cheng Lian] More precise initial buffer size estimation for in-memory column buffer

(cherry picked from commit 376a82e)
Signed-off-by: Michael Armbrust <[email protected]>
asfgit pushed a commit that referenced this pull request Aug 14, 2014
…memory column buffer

This is a follow up of #1880.

Since the row number within a single batch is known, we can estimate a much more precise initial buffer size when building an in-memory column buffer.

Author: Cheng Lian <[email protected]>

Closes #1901 from liancheng/precise-init-buffer-size and squashes the following commits:

d5501fa [Cheng Lian] More precise initial buffer size estimation for in-memory column buffer
@marmbrus marmbrus deleted the columnBatches branch August 27, 2014 20:44
asfgit pushed a commit that referenced this pull request Sep 4, 2014
…tions

This PR is based on #1883 authored by marmbrus. Key differences:

1. Batch pruning instead of partition pruning

   When #1883 was authored, batched column buffer building (#1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition).

1. More filters are supported

   Filter predicates consist of `=`, `<`, `<=`, `>`, `>=` and their conjunctions and disjunctions are supported.

Author: Cheng Lian <[email protected]>

Closes #2188 from liancheng/in-mem-batch-pruning and squashes the following commits:

68cf019 [Cheng Lian] Marked sqlContext as @transient
4254f6c [Cheng Lian] Enables in-memory partition pruning in PartitionBatchPruningSuite
3784105 [Cheng Lian] Overrides InMemoryColumnarTableScan.sqlContext
d2a1d66 [Cheng Lian] Disables in-memory partition pruning by default
062c315 [Cheng Lian] HiveCompatibilitySuite code cleanup
16b77bf [Cheng Lian] Fixed pruning predication conjunctions and disjunctions
16195c5 [Cheng Lian] Enabled both disjunction and conjunction
89950d0 [Cheng Lian] Worked around Scala style check
9c167f6 [Cheng Lian] Minor code cleanup
3c4d5c7 [Cheng Lian] Minor code cleanup
ea59ee5 [Cheng Lian] Renamed PartitionSkippingSuite to PartitionBatchPruningSuite
fc517d0 [Cheng Lian] More test cases
1868c18 [Cheng Lian] Code cleanup, bugfix, and adding tests
cb76da4 [Cheng Lian] Added more predicate filters, fixed table scan stats for testing purposes
385474a [Cheng Lian] Merge branch 'inMemStats' into in-mem-batch-pruning
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Author: Michael Armbrust <[email protected]>

Closes apache#1880 from marmbrus/columnBatches and squashes the following commits:

0649987 [Michael Armbrust] add test
4756fad [Michael Armbrust] fix compilation
2314532 [Michael Armbrust] Build column buffers in smaller batches
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…memory column buffer

This is a follow up of apache#1880.

Since the row number within a single batch is known, we can estimate a much more precise initial buffer size when building an in-memory column buffer.

Author: Cheng Lian <[email protected]>

Closes apache#1901 from liancheng/precise-init-buffer-size and squashes the following commits:

d5501fa [Cheng Lian] More precise initial buffer size estimation for in-memory column buffer
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…tions

This PR is based on apache#1883 authored by marmbrus. Key differences:

1. Batch pruning instead of partition pruning

   When apache#1883 was authored, batched column buffer building (apache#1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition).

1. More filters are supported

   Filter predicates consist of `=`, `<`, `<=`, `>`, `>=` and their conjunctions and disjunctions are supported.

Author: Cheng Lian <[email protected]>

Closes apache#2188 from liancheng/in-mem-batch-pruning and squashes the following commits:

68cf019 [Cheng Lian] Marked sqlContext as @transient
4254f6c [Cheng Lian] Enables in-memory partition pruning in PartitionBatchPruningSuite
3784105 [Cheng Lian] Overrides InMemoryColumnarTableScan.sqlContext
d2a1d66 [Cheng Lian] Disables in-memory partition pruning by default
062c315 [Cheng Lian] HiveCompatibilitySuite code cleanup
16b77bf [Cheng Lian] Fixed pruning predication conjunctions and disjunctions
16195c5 [Cheng Lian] Enabled both disjunction and conjunction
89950d0 [Cheng Lian] Worked around Scala style check
9c167f6 [Cheng Lian] Minor code cleanup
3c4d5c7 [Cheng Lian] Minor code cleanup
ea59ee5 [Cheng Lian] Renamed PartitionSkippingSuite to PartitionBatchPruningSuite
fc517d0 [Cheng Lian] More test cases
1868c18 [Cheng Lian] Code cleanup, bugfix, and adding tests
cb76da4 [Cheng Lian] Added more predicate filters, fixed table scan stats for testing purposes
385474a [Cheng Lian] Merge branch 'inMemStats' into in-mem-batch-pruning
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants