Skip to content

[SPARK-1292] In-memory columnar representation for Spark SQL #205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from

Conversation

liancheng
Copy link
Contributor

This PR is rebased from the Catalyst repository, and contains the first version of in-memory columnar representation for Spark SQL. Compression support is not included yet and will be added later in a separate PR.

liancheng and others added 17 commits March 22, 2014 13:22
Signed-off-by: Cheng Lian <[email protected]>
* ColumnAccessors/ColumnBuilders are now both built with stackable traits
* Columnar byte buffer layout changed: column type ID was moved to the first 4 bytes (before null value information)
* Generic objects are serialised/deserialised before/after being appended/extracted into/from columnar byte buffers.

Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
* SparkSqlSerializer is moved to a separate source file
* SparkSqlSerializer.newKryo calls super.newKryo
* Class registration is no longer required, since we may de/serialise objects of any class with generic column accessor/builder.

Signed-off-by: Cheng Lian <[email protected]>
Signed-off-by: Cheng Lian <[email protected]>
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13342/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13343/

@@ -45,3 +45,4 @@ dist/
spark-*-bin.tar.gz
unit-tests.log
/lib/
sql/hive/src/test/resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test files are merged in, so we don't want this any more.

@marmbrus
Copy link
Contributor

Hey Cheng, this looks really good. My only comment is we should probably restrict the visibility of all these new objects/classes so they don't show up in the scala doc. I think everything should be private[sql].

Otherwise, LGTM.

@liancheng liancheng changed the title [SPARK-1292] Port Shark's In-Memory Columnar Representation to Spark SQL [SPARK-1292] In-memory columnar representation for Spark SQL Mar 23, 2014
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13354/

@liancheng
Copy link
Contributor Author

Visibility issue addressed. I think this is ready to be merged now. And please merge this PR before #208, since that one affects lots of files and may cause potential conflicts. After this PR is merged, I'll merge master to #208, which would be easier to handle.

@pwendell
Copy link
Contributor

Okay I merged this into master. Feel free to bump #208 now.

@asfgit asfgit closed this in 57a4379 Mar 23, 2014
asfgit pushed a commit that referenced this pull request Mar 23, 2014
This PR addresses various coding style issues in Spark SQL, including but not limited to those mentioned by @mateiz in PR #146.

As this PR affects lots of source files and may cause potential conflicts, it would be better to merge this as soon as possible *after* PR #205 (In-memory columnar representation for Spark SQL) is merged.

Author: Cheng Lian <[email protected]>

Closes #208 from liancheng/fixCodingStyle and squashes the following commits:

fc2b528 [Cheng Lian] Merge branch 'master' into fixCodingStyle
b531273 [Cheng Lian] Fixed coding style issues in sql/hive
0b56f77 [Cheng Lian] Fixed coding style issues in sql/core
fae7b02 [Cheng Lian] Addressed styling issues mentioned by @marmbrus
9265366 [Cheng Lian] Fixed coding style issues in sql/core
3dcbbbd [Cheng Lian] Fixed relative package imports for package catalyst
asfgit pushed a commit that referenced this pull request Apr 2, 2014
…r storage

JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373)

(Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.)

This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include:

*   `CompressionScheme`

    Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include:

    * `RunLengthEncoding`
    * `DictionaryEncoding`

    Algorithms to be implemented include:

    * `BooleanBitSet`
    * `IntDelta`
    * `LongDelta`

*   `CompressibleColumnBuilder`

    A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns.  A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time.

    Memory layout of the final byte buffer is showed below:

    ```
     .--------------------------- Column type ID (4 bytes)
     |   .----------------------- Null count N (4 bytes)
     |   |   .------------------- Null positions (4 x N bytes, empty if null count is zero)
     |   |   |     .------------- Compression scheme ID (4 bytes)
     |   |   |     |   .--------- Compressed non-null elements
     V   V   V     V   V
    +---+---+-----+---+---------+
    |   |   | ... |   | ... ... |
    +---+---+-----+---+---------+
     \-----------/ \-----------/
        header         body
    ```

*   `CompressibleColumnAccessor`

    A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column.

*   `ColumnStats`

    Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information.

    Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible).

A major refactoring change since PR #205 is:

* Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code.

Author: Cheng Lian <[email protected]>

Closes #285 from liancheng/memColumnarCompression and squashes the following commits:

ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus
d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance
5034453 [Cheng Lian] Bug fix, more tests, and more refactoring
c298b76 [Cheng Lian] Test suites refactored
2780d6a [Cheng Lian] [WIP] in-memory columnar compression support
211331c [Cheng Lian] WIP: in-memory columnar compression support
85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
…r storage

JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373)

(Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.)

This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include:

*   `CompressionScheme`

    Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include:

    * `RunLengthEncoding`
    * `DictionaryEncoding`

    Algorithms to be implemented include:

    * `BooleanBitSet`
    * `IntDelta`
    * `LongDelta`

*   `CompressibleColumnBuilder`

    A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns.  A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time.

    Memory layout of the final byte buffer is showed below:

    ```
     .--------------------------- Column type ID (4 bytes)
     |   .----------------------- Null count N (4 bytes)
     |   |   .------------------- Null positions (4 x N bytes, empty if null count is zero)
     |   |   |     .------------- Compression scheme ID (4 bytes)
     |   |   |     |   .--------- Compressed non-null elements
     V   V   V     V   V
    +---+---+-----+---+---------+
    |   |   | ... |   | ... ... |
    +---+---+-----+---+---------+
     \-----------/ \-----------/
        header         body
    ```

*   `CompressibleColumnAccessor`

    A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column.

*   `ColumnStats`

    Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information.

    Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible).

A major refactoring change since PR apache#205 is:

* Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code.

Author: Cheng Lian <[email protected]>

Closes apache#285 from liancheng/memColumnarCompression and squashes the following commits:

ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus
d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance
5034453 [Cheng Lian] Bug fix, more tests, and more refactoring
c298b76 [Cheng Lian] Test suites refactored
2780d6a [Cheng Lian] [WIP] in-memory columnar compression support
211331c [Cheng Lian] WIP: in-memory columnar compression support
85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code
@liancheng liancheng deleted the memColumnarSupport branch September 24, 2014 00:15
kryo.register(classOf[org.apache.spark.sql.catalyst.expressions.GenericMutableRow])
kryo.register(classOf[scala.collection.mutable.ArrayBuffer[_]])
kryo.register(classOf[scala.math.BigDecimal], new BigDecimalSerializer)
kryo.setReferences(false)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add setReferences to false? Is there a requirement for it? Why don't you use the value specified in the configuration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used to shuffle SchemaRDDs and not user types. Making setReferences = false improves performance.

liancheng pushed a commit to liancheng/spark that referenced this pull request Mar 17, 2017
## What changes were proposed in this pull request?

Mark analysis/optimization/execution code paths for dataframes as trusted for file access. The idea is that we should block file access that is not from trusted code paths in databricks when acls are enabled.

Also (temporarily) use dbfs s3a client for file access always, since we can make the trusted check in the dbfs client.

## How was this patch tested?

Added unit tests and modified existing test spark session to check the trusted property

Author: Srinath Shankar <[email protected]>

Closes apache#205 from srinathshankar/executionIf.
jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018
Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018
…rized HDFS (apache#205)

* [INFINITY-2622] Initial teragen|sort|validate test with kerberos.

* Added kdc and kerberized HDFS, copying over the entire dcos-commons/testing/ directory. All of the HDFS testing is now in a separate test_hdfs module. Unpinned the HDFS version. Temporarily added an HDFS stub universe with Kerberos support.

* In require_spark(), avoid mutable default param value, use `None` instead.

* Moved hdfs stub universe and hdfs constants into test_hdfs.py. Replaced for-loop with itertools.product.

* Select itertool.product into a tuple

* Updated soak test for kerberos.

* Removed extra kdc.json files.
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
- Add the "base" job to openlab-zuul-jobs repo to get rid of project-config changes
- Move the "site_logs" secret to openlab-zuul-jobs
- Move pipelines definition from project-config to openlab-zuul-jobs
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants