[SPARK-1292] In-memory columnar representation for Spark SQL #205

liancheng · 2014-03-22T09:35:44Z

This PR is rebased from the Catalyst repository, and contains the first version of in-memory columnar representation for Spark SQL. Compression support is not included yet and will be added later in a separate PR.

Signed-off-by: Cheng Lian <[email protected]>

* ColumnAccessors/ColumnBuilders are now both built with stackable traits * Columnar byte buffer layout changed: column type ID was moved to the first 4 bytes (before null value information) * Generic objects are serialised/deserialised before/after being appended/extracted into/from columnar byte buffers. Signed-off-by: Cheng Lian <[email protected]>

Signed-off-by: Cheng Lian <[email protected]>

* SparkSqlSerializer is moved to a separate source file * SparkSqlSerializer.newKryo calls super.newKryo * Class registration is no longer required, since we may de/serialise objects of any class with generic column accessor/builder. Signed-off-by: Cheng Lian <[email protected]>

Signed-off-by: Cheng Lian <[email protected]>

…a workaround Signed-off-by: Cheng Lian <[email protected]>

Signed-off-by: Cheng Lian <[email protected]>

AmplabJenkins · 2014-03-22T10:05:48Z

Merged build triggered.

AmplabJenkins · 2014-03-22T10:05:49Z

Merged build started.

AmplabJenkins · 2014-03-22T10:07:20Z

Merged build finished.

AmplabJenkins · 2014-03-22T10:07:20Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13342/

AmplabJenkins · 2014-03-22T10:14:49Z

Merged build triggered.

AmplabJenkins · 2014-03-22T10:14:50Z

Merged build started.

AmplabJenkins · 2014-03-22T11:05:43Z

Merged build finished.

AmplabJenkins · 2014-03-22T11:05:43Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13343/

marmbrus · 2014-03-22T15:08:50Z

.gitignore

@@ -45,3 +45,4 @@ dist/
 spark-*-bin.tar.gz
 unit-tests.log
 /lib/
+sql/hive/src/test/resources


The test files are merged in, so we don't want this any more.

marmbrus · 2014-03-22T15:09:55Z

Hey Cheng, this looks really good. My only comment is we should probably restrict the visibility of all these new objects/classes so they don't show up in the scala doc. I think everything should be private[sql].

Otherwise, LGTM.

AmplabJenkins · 2014-03-23T06:07:20Z

Merged build triggered.

AmplabJenkins · 2014-03-23T06:07:20Z

Merged build started.

AmplabJenkins · 2014-03-23T07:07:15Z

Merged build finished.

AmplabJenkins · 2014-03-23T07:07:15Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13354/

liancheng · 2014-03-23T11:20:42Z

Visibility issue addressed. I think this is ready to be merged now. And please merge this PR before #208, since that one affects lots of files and may cause potential conflicts. After this PR is merged, I'll merge master to #208, which would be easier to handle.

pwendell · 2014-03-23T19:09:38Z

Okay I merged this into master. Feel free to bump #208 now.

@mateiz

This PR addresses various coding style issues in Spark SQL, including but not limited to those mentioned by @mateiz in PR #146. As this PR affects lots of source files and may cause potential conflicts, it would be better to merge this as soon as possible *after* PR #205 (In-memory columnar representation for Spark SQL) is merged. Author: Cheng Lian <[email protected]> Closes #208 from liancheng/fixCodingStyle and squashes the following commits: fc2b528 [Cheng Lian] Merge branch 'master' into fixCodingStyle b531273 [Cheng Lian] Fixed coding style issues in sql/hive 0b56f77 [Cheng Lian] Fixed coding style issues in sql/core fae7b02 [Cheng Lian] Addressed styling issues mentioned by @marmbrus 9265366 [Cheng Lian] Fixed coding style issues in sql/core 3dcbbbd [Cheng Lian] Fixed relative package imports for package catalyst

@marmbrus

…r storage JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373) (Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.) This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include: * `CompressionScheme` Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include: * `RunLengthEncoding` * `DictionaryEncoding` Algorithms to be implemented include: * `BooleanBitSet` * `IntDelta` * `LongDelta` * `CompressibleColumnBuilder` A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns. A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time. Memory layout of the final byte buffer is showed below: ``` .--------------------------- Column type ID (4 bytes) | .----------------------- Null count N (4 bytes) | | .------------------- Null positions (4 x N bytes, empty if null count is zero) | | | .------------- Compression scheme ID (4 bytes) | | | | .--------- Compressed non-null elements V V V V V +---+---+-----+---+---------+ | | | ... | | ... ... | +---+---+-----+---+---------+ \-----------/ \-----------/ header body ``` * `CompressibleColumnAccessor` A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column. * `ColumnStats` Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information. Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible). A major refactoring change since PR #205 is: * Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code. Author: Cheng Lian <[email protected]> Closes #285 from liancheng/memColumnarCompression and squashes the following commits: ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance 5034453 [Cheng Lian] Bug fix, more tests, and more refactoring c298b76 [Cheng Lian] Test suites refactored 2780d6a [Cheng Lian] [WIP] in-memory columnar compression support 211331c [Cheng Lian] WIP: in-memory columnar compression support 85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code

@marmbrus

…r storage JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373) (Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.) This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include: * `CompressionScheme` Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include: * `RunLengthEncoding` * `DictionaryEncoding` Algorithms to be implemented include: * `BooleanBitSet` * `IntDelta` * `LongDelta` * `CompressibleColumnBuilder` A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns. A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time. Memory layout of the final byte buffer is showed below: ``` .--------------------------- Column type ID (4 bytes) | .----------------------- Null count N (4 bytes) | | .------------------- Null positions (4 x N bytes, empty if null count is zero) | | | .------------- Compression scheme ID (4 bytes) | | | | .--------- Compressed non-null elements V V V V V +---+---+-----+---+---------+ | | | ... | | ... ... | +---+---+-----+---+---------+ \-----------/ \-----------/ header body ``` * `CompressibleColumnAccessor` A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column. * `ColumnStats` Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information. Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible). A major refactoring change since PR apache#205 is: * Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code. Author: Cheng Lian <[email protected]> Closes apache#285 from liancheng/memColumnarCompression and squashes the following commits: ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance 5034453 [Cheng Lian] Bug fix, more tests, and more refactoring c298b76 [Cheng Lian] Test suites refactored 2780d6a [Cheng Lian] [WIP] in-memory columnar compression support 211331c [Cheng Lian] WIP: in-memory columnar compression support 85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code

oliviertoupin · 2015-01-07T00:07:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlSerializer.scala

+    kryo.register(classOf[org.apache.spark.sql.catalyst.expressions.GenericMutableRow])
+    kryo.register(classOf[scala.collection.mutable.ArrayBuffer[_]])
+    kryo.register(classOf[scala.math.BigDecimal], new BigDecimalSerializer)
+    kryo.setReferences(false)


Why did you add setReferences to false? Is there a requirement for it? Why don't you use the value specified in the configuration?

This is only used to shuffle SchemaRDDs and not user types. Making setReferences = false improves performance.

## What changes were proposed in this pull request? Mark analysis/optimization/execution code paths for dataframes as trusted for file access. The idea is that we should block file access that is not from trusted code paths in databricks when acls are enabled. Also (temporarily) use dbfs s3a client for file access always, since we can make the trusted check in the dbfs client. ## How was this patch tested? Added unit tests and modified existing test spark session to check the trusted property Author: Srinath Shankar <[email protected]> Closes apache#205 from srinathshankar/executionIf.

…he#205) (cherry picked from commit 3137e3a)

…rized HDFS (apache#205) * [INFINITY-2622] Initial teragen|sort|validate test with kerberos. * Added kdc and kerberized HDFS, copying over the entire dcos-commons/testing/ directory. All of the HDFS testing is now in a separate test_hdfs module. Unpinned the HDFS version. Temporarily added an HDFS stub universe with Kerberos support. * In require_spark(), avoid mutable default param value, use `None` instead. * Moved hdfs stub universe and hdfs constants into test_hdfs.py. Replaced for-loop with itertools.product. * Select itertool.product into a tuple * Updated soak test for kerberos. * Removed extra kdc.json files.

- Add the "base" job to openlab-zuul-jobs repo to get rid of project-config changes - Move the "site_logs" secret to openlab-zuul-jobs - Move pipelines definition from project-config to openlab-zuul-jobs

…he#205)

liancheng and others added 17 commits March 22, 2014 13:22

Added Hive test files to .gitignore

acc5c48

Signed-off-by: Cheng Lian <[email protected]>

Added TypeTag field to all NativeTypes

34f3c19

Signed-off-by: Cheng Lian <[email protected]>

Added KryoSerializer

2d09066

Signed-off-by: Cheng Lian <[email protected]>

Added ColumnTypes and test suite

f18ddc6

Signed-off-by: Cheng Lian <[email protected]>

Added column builder classes and test suite

c01a177

Signed-off-by: Cheng Lian <[email protected]>

Added Apache license

da2f4d5

Signed-off-by: Cheng Lian <[email protected]>

Refactored BINARY and GENERIC to reduce duplicate code

214be73

Signed-off-by: Cheng Lian <[email protected]>

Minor test suite refactoring

b6c0a49

Signed-off-by: Cheng Lian <[email protected]>

Added implicit conversion from DataType to ColumnType

ed8608e

Signed-off-by: Cheng Lian <[email protected]>

Using SparkSqlSerializer for generic object SerDe causes error, made …

c701c7a

…a workaround Signed-off-by: Cheng Lian <[email protected]>

Added table scan operator for in-memory columnar support.

220ee1e

Signed-off-by: Cheng Lian <[email protected]>

Added Apache license

9bcae4b

Signed-off-by: Cheng Lian <[email protected]>

Removed the unnecessary InMemoryColumnarRelation class

a162d4d

Signed-off-by: Cheng Lian <[email protected]>

Make necessary renaming due to rebase

0dbf2fb

Signed-off-by: Cheng Lian <[email protected]>

Fixed some minor issues introduced during rebasing

af1ad5e

Addressed ScalaStyle issues

0892ad8

marmbrus reviewed Mar 22, 2014
View reviewed changes

liancheng changed the title ~~[SPARK-1292] Port Shark's In-Memory Columnar Representation to Spark SQL~~ [SPARK-1292] In-memory columnar representation for Spark SQL Mar 23, 2014

Restricted new objects/classes to `private[sql]'

99dba41

liancheng mentioned this pull request Mar 23, 2014

Fixed coding style issues in Spark SQL #208

Closed

asfgit closed this in 57a4379 Mar 23, 2014

liancheng mentioned this pull request Apr 1, 2014

[SPARK-1371][WIP] Compression support for Spark SQL in-memory columnar storage #285

Closed

liancheng deleted the memColumnarSupport branch September 24, 2014 00:15

oliviertoupin reviewed Jan 7, 2015
View reviewed changes

jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018

MapR [SPARK-144] Add insertToMapRDB method for rdd for Java API (apac…

aeb9e9d

…he#205) (cherry picked from commit 3137e3a)

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

MapR [SPARK-144] Add insertToMapRDB method for rdd for Java API (apac…

9eb99ce

…he#205)

[SPARK-1292] In-memory columnar representation for Spark SQL #205

[SPARK-1292] In-memory columnar representation for Spark SQL #205

Uh oh!

Conversation

liancheng commented Mar 22, 2014

Uh oh!

AmplabJenkins commented Mar 22, 2014

Uh oh!

AmplabJenkins commented Mar 22, 2014

Uh oh!

AmplabJenkins commented Mar 22, 2014

Uh oh!

AmplabJenkins commented Mar 22, 2014

Uh oh!

AmplabJenkins commented Mar 22, 2014

Uh oh!

AmplabJenkins commented Mar 22, 2014

Uh oh!

AmplabJenkins commented Mar 22, 2014

Uh oh!

AmplabJenkins commented Mar 22, 2014

Uh oh!

marmbrus Mar 22, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Mar 22, 2014

Uh oh!

AmplabJenkins commented Mar 23, 2014

Uh oh!

AmplabJenkins commented Mar 23, 2014

Uh oh!

AmplabJenkins commented Mar 23, 2014

Uh oh!

AmplabJenkins commented Mar 23, 2014

Uh oh!

liancheng commented Mar 23, 2014

Uh oh!

pwendell commented Mar 23, 2014

Uh oh!

oliviertoupin Jan 7, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus Jan 7, 2015

Choose a reason for hiding this comment

Uh oh!

Uh oh!