-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-1292] In-memory columnar representation for Spark SQL #205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
* ColumnAccessors/ColumnBuilders are now both built with stackable traits * Columnar byte buffer layout changed: column type ID was moved to the first 4 bytes (before null value information) * Generic objects are serialised/deserialised before/after being appended/extracted into/from columnar byte buffers. Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
* SparkSqlSerializer is moved to a separate source file * SparkSqlSerializer.newKryo calls super.newKryo * Class registration is no longer required, since we may de/serialise objects of any class with generic column accessor/builder. Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
…a workaround Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
Merged build triggered. |
Merged build started. |
Merged build finished. |
One or more automated tests failed |
Merged build triggered. |
Merged build started. |
Merged build finished. |
All automated tests passed. |
@@ -45,3 +45,4 @@ dist/ | |||
spark-*-bin.tar.gz | |||
unit-tests.log | |||
/lib/ | |||
sql/hive/src/test/resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test files are merged in, so we don't want this any more.
Hey Cheng, this looks really good. My only comment is we should probably restrict the visibility of all these new objects/classes so they don't show up in the scala doc. I think everything should be private[sql]. Otherwise, LGTM. |
Merged build triggered. |
Merged build started. |
Merged build finished. |
All automated tests passed. |
Okay I merged this into master. Feel free to bump #208 now. |
This PR addresses various coding style issues in Spark SQL, including but not limited to those mentioned by @mateiz in PR #146. As this PR affects lots of source files and may cause potential conflicts, it would be better to merge this as soon as possible *after* PR #205 (In-memory columnar representation for Spark SQL) is merged. Author: Cheng Lian <[email protected]> Closes #208 from liancheng/fixCodingStyle and squashes the following commits: fc2b528 [Cheng Lian] Merge branch 'master' into fixCodingStyle b531273 [Cheng Lian] Fixed coding style issues in sql/hive 0b56f77 [Cheng Lian] Fixed coding style issues in sql/core fae7b02 [Cheng Lian] Addressed styling issues mentioned by @marmbrus 9265366 [Cheng Lian] Fixed coding style issues in sql/core 3dcbbbd [Cheng Lian] Fixed relative package imports for package catalyst
…r storage JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373) (Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.) This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include: * `CompressionScheme` Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include: * `RunLengthEncoding` * `DictionaryEncoding` Algorithms to be implemented include: * `BooleanBitSet` * `IntDelta` * `LongDelta` * `CompressibleColumnBuilder` A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns. A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time. Memory layout of the final byte buffer is showed below: ``` .--------------------------- Column type ID (4 bytes) | .----------------------- Null count N (4 bytes) | | .------------------- Null positions (4 x N bytes, empty if null count is zero) | | | .------------- Compression scheme ID (4 bytes) | | | | .--------- Compressed non-null elements V V V V V +---+---+-----+---+---------+ | | | ... | | ... ... | +---+---+-----+---+---------+ \-----------/ \-----------/ header body ``` * `CompressibleColumnAccessor` A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column. * `ColumnStats` Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information. Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible). A major refactoring change since PR #205 is: * Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code. Author: Cheng Lian <[email protected]> Closes #285 from liancheng/memColumnarCompression and squashes the following commits: ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance 5034453 [Cheng Lian] Bug fix, more tests, and more refactoring c298b76 [Cheng Lian] Test suites refactored 2780d6a [Cheng Lian] [WIP] in-memory columnar compression support 211331c [Cheng Lian] WIP: in-memory columnar compression support 85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code
…r storage JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373) (Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.) This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include: * `CompressionScheme` Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include: * `RunLengthEncoding` * `DictionaryEncoding` Algorithms to be implemented include: * `BooleanBitSet` * `IntDelta` * `LongDelta` * `CompressibleColumnBuilder` A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns. A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time. Memory layout of the final byte buffer is showed below: ``` .--------------------------- Column type ID (4 bytes) | .----------------------- Null count N (4 bytes) | | .------------------- Null positions (4 x N bytes, empty if null count is zero) | | | .------------- Compression scheme ID (4 bytes) | | | | .--------- Compressed non-null elements V V V V V +---+---+-----+---+---------+ | | | ... | | ... ... | +---+---+-----+---+---------+ \-----------/ \-----------/ header body ``` * `CompressibleColumnAccessor` A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column. * `ColumnStats` Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information. Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible). A major refactoring change since PR apache#205 is: * Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code. Author: Cheng Lian <[email protected]> Closes apache#285 from liancheng/memColumnarCompression and squashes the following commits: ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance 5034453 [Cheng Lian] Bug fix, more tests, and more refactoring c298b76 [Cheng Lian] Test suites refactored 2780d6a [Cheng Lian] [WIP] in-memory columnar compression support 211331c [Cheng Lian] WIP: in-memory columnar compression support 85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code
kryo.register(classOf[org.apache.spark.sql.catalyst.expressions.GenericMutableRow]) | ||
kryo.register(classOf[scala.collection.mutable.ArrayBuffer[_]]) | ||
kryo.register(classOf[scala.math.BigDecimal], new BigDecimalSerializer) | ||
kryo.setReferences(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you add setReferences to false? Is there a requirement for it? Why don't you use the value specified in the configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used to shuffle SchemaRDDs and not user types. Making setReferences = false
improves performance.
## What changes were proposed in this pull request? Mark analysis/optimization/execution code paths for dataframes as trusted for file access. The idea is that we should block file access that is not from trusted code paths in databricks when acls are enabled. Also (temporarily) use dbfs s3a client for file access always, since we can make the trusted check in the dbfs client. ## How was this patch tested? Added unit tests and modified existing test spark session to check the trusted property Author: Srinath Shankar <[email protected]> Closes apache#205 from srinathshankar/executionIf.
…he#205) (cherry picked from commit 3137e3a)
…rized HDFS (apache#205) * [INFINITY-2622] Initial teragen|sort|validate test with kerberos. * Added kdc and kerberized HDFS, copying over the entire dcos-commons/testing/ directory. All of the HDFS testing is now in a separate test_hdfs module. Unpinned the HDFS version. Temporarily added an HDFS stub universe with Kerberos support. * In require_spark(), avoid mutable default param value, use `None` instead. * Moved hdfs stub universe and hdfs constants into test_hdfs.py. Replaced for-loop with itertools.product. * Select itertool.product into a tuple * Updated soak test for kerberos. * Removed extra kdc.json files.
- Add the "base" job to openlab-zuul-jobs repo to get rid of project-config changes - Move the "site_logs" secret to openlab-zuul-jobs - Move pipelines definition from project-config to openlab-zuul-jobs
This PR is rebased from the Catalyst repository, and contains the first version of in-memory columnar representation for Spark SQL. Compression support is not included yet and will be added later in a separate PR.