You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-1371][WIP] Compression support for Spark SQL in-memory columnar storage
JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373)
(Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.)
This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include:
* `CompressionScheme`
Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include:
* `RunLengthEncoding`
* `DictionaryEncoding`
Algorithms to be implemented include:
* `BooleanBitSet`
* `IntDelta`
* `LongDelta`
* `CompressibleColumnBuilder`
A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns. A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time.
Memory layout of the final byte buffer is showed below:
```
.--------------------------- Column type ID (4 bytes)
| .----------------------- Null count N (4 bytes)
| | .------------------- Null positions (4 x N bytes, empty if null count is zero)
| | | .------------- Compression scheme ID (4 bytes)
| | | | .--------- Compressed non-null elements
V V V V V
+---+---+-----+---+---------+
| | | ... | | ... ... |
+---+---+-----+---+---------+
\-----------/ \-----------/
header body
```
* `CompressibleColumnAccessor`
A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column.
* `ColumnStats`
Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information.
Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible).
A major refactoring change since PR #205 is:
* Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code.
Author: Cheng Lian <[email protected]>
Closes#285 from liancheng/memColumnarCompression and squashes the following commits:
ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrusd3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance
5034453 [Cheng Lian] Bug fix, more tests, and more refactoring
c298b76 [Cheng Lian] Test suites refactored
2780d6a [Cheng Lian] [WIP] in-memory columnar compression support
211331c [Cheng Lian] WIP: in-memory columnar compression support
85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code
0 commit comments