|
| 1 | +--- |
| 2 | +slug: why-row-based-sort |
| 3 | +title: "Why Sort is row-based in Velox — A Quantitative Assessment" |
| 4 | +authors: [duanmeng, xiaoxmeng] |
| 5 | +tags: [tech-blog, sort, operator] |
| 6 | +--- |
| 7 | + |
| 8 | +## TL;DR |
| 9 | + |
| 10 | +Velox is a fully vectorized execution engine[1]. Its internal columnar memory layout enhances cache |
| 11 | +locality, exposes more inter-instruction parallelism to CPUs, and enables the use of SIMD instructions, |
| 12 | +significantly accelerating large-scale query processing. |
| 13 | + |
| 14 | +However, some operators in Velox utilize a hybrid layout, where datasets can be temporarily converted |
| 15 | +to a row-oriented format. The `OrderBy` operator is one example, where our implementation first |
| 16 | +materializes the input vectors into rows, containing both sort keys and payload columns, sorts them, and |
| 17 | +converts the rows back to vectors. |
| 18 | + |
| 19 | +In this article, we explain the rationale behind this design decision and provide experimental evidence |
| 20 | +for its implementation. We show a prototype of a hybrid sorting strategy that materializes only the |
| 21 | +sort-key columns, reducing the overhead of materializing payload columns. Contrary to expectations, the |
| 22 | +end-to-end performance did not improve—in fact, it was even up to **3×** slower. We present the two |
| 23 | +variants and discuss why one is counter-intuitively faster than the other. |
| 24 | + |
| 25 | +## Row-based vs. Non-Materialized |
| 26 | + |
| 27 | +### Row-based Sort |
| 28 | + |
| 29 | +The `OrderBy` operator in Velox’s current implementation uses a utility called `SortBuffer` to perform |
| 30 | +the sorting, which consists of three stages: |
| 31 | + |
| 32 | +1. Input Stage: Serializes input Columnar Vectors into a row format, stored in a RowContainer. |
| 33 | +2. Sort Stage: Sorts based on keys within the RowContainer. |
| 34 | +3. Output Stage: Extract output vectors column by column from the RowContainer in sorted order. |
| 35 | + |
| 36 | +<figure> |
| 37 | + <img src="/img/row-sort.png" height= "100%" width="100%"/> |
| 38 | +</figure> |
| 39 | + |
| 40 | +While row-based sorting is more efficient than column-based sorting[2,3], what if we only |
| 41 | +materialize the sort key columns? We could then use the resulting sort indices to gather |
| 42 | +the payload data into the output vectors directly. This would save the cost of converting |
| 43 | +the payload columns to rows and back again. More importantly, it would allow us to spill |
| 44 | +the original vectors directly to disk rather than first converting rows back into vectors |
| 45 | +for spilling. |
| 46 | + |
| 47 | +### Non-Materialized Sort |
| 48 | +We have implemented a [non-materializing sort strategy](https://github.com/facebookincubator/velox/pull/15157) designed |
| 49 | +to improve sorting performance. The approach materializes only the sort key columns and their original vector indices, |
| 50 | +which are then used to gather the corresponding rows from the original input vectors into the output vector after the |
| 51 | +sort completes. It changes the `SortBuffer` to `NonMaterizedSortBuffer`, which consists of three stages: |
| 52 | + |
| 53 | +1. Input Stage: Holds the input vector (its shared pointer) in a list, serializes key columns |
| 54 | +and additional index columns (VectorIndex and RowIndex) into rows, stored in a RowContainer. |
| 55 | +2. Sort Stage: Sorts based on keys within the RowContainer. |
| 56 | +3. Output Stage: Extracts the VectorIndex and RowIndex columns, uses them together to gather |
| 57 | +the corresponding rows from the original input vectors into the output vector. |
| 58 | + |
| 59 | +<figure> |
| 60 | + <img src="/img/column-sort.png" height= "100%" width="100%"/> |
| 61 | +</figure> |
| 62 | + |
| 63 | +In theory, this should have significantly reduced the overhead of materializing payload columns, |
| 64 | +especially for wide tables, since only sorting keys are materialized. However, the benchmark results |
| 65 | +were the exact opposite of our expectations. Despite successfully eliminating expensive serialization |
| 66 | +overhead and reducing the total instruction the end-to-end performance was **3x times slower**. |
| 67 | + |
| 68 | +## Benchmark Result |
| 69 | + |
| 70 | +To validate the effectiveness of our new strategy, we designed a benchmark with a varying number of |
| 71 | +payload columns: |
| 72 | + |
| 73 | +- Inputs: 1000 Input Vectors, 4096 rows per vector. |
| 74 | +- Number of payload columns: 64, 128, 256. |
| 75 | +- L2 cache: 80 MiB, L3 cache: 108 MiB. |
| 76 | + |
| 77 | +| numPayloadColumns | Mode | Input Time | Sorting Time | Output Time | Total Time | Desc | |
| 78 | +| --- | --- | --- | --- | --- | --- | --- | |
| 79 | +| 64 | Row-based | 4.27s | 0.79s | 4.23s | 11.64s | Row-based is 3.9x faster | |
| 80 | +| | Columnar | 0.28s | 0.84s | 42.30s | 45.90s | | |
| 81 | +| 128 | Row-based | 20.25s | 1.11s | 5.49s | 31.43s | Row-based is 2.0x faster | |
| 82 | +| | Columnar | 0.27s | 0.51s | 59.15s | 64.20s | | |
| 83 | +| 256 | Rows-based | 29.34s | 1.02s | 12.85s | 51.48s | Row-based is 3.0x faster | |
| 84 | +| | Columnar | 0.87s | 1.10s | 144.00s | 154.80s | | |
| 85 | + |
| 86 | +The benchmark results confirm that Row-based Sort is the superior strategy, |
| 87 | +delivering a 1.9x to 3.9x overall speedup compared to Columnar Sort. While Row-based Sort |
| 88 | +incurs a significantly higher upfront cost during the Input phase (peaking at 104s), it |
| 89 | +maintains a highly stable and efficient Output phase (maximum 32s). In contrast, Columnar |
| 90 | +Sort suffers from severe performance degradation in the Output phase as the payload increases, |
| 91 | +with execution times surging from 42s to 283s, resulting in a much slower total execution time |
| 92 | +despite its negligible input overhead. |
| 93 | + |
| 94 | +To identify the root cause of the performance divergence, we utilized `perf stat` to analyze |
| 95 | +micro-architectural efficiency and `perf mem` to profile memory access patterns during the critical |
| 96 | +Output phase. |
| 97 | + |
| 98 | +| Metrics | Row-based | Columnar | Desc | |
| 99 | +| --- | --- | --- | --- | |
| 100 | +| Total Instructions | 555.6 Billion | 475.6 Billion | Row +17% | |
| 101 | +| IPC (Instructions Per Cycle) | 2.4 | 0.82 | Row 2.9x Higher | |
| 102 | +| LLC Load Misses (Last Level Cache) | 0.14 Billion | 5.01 Billion | Columnar 35x Higher | |
| 103 | + |
| 104 | +| Memory Level | Row-based Output | Columnar Outputs | |
| 105 | +| --- | --- | --- | |
| 106 | +| RAM Hit | 5.8% | 38.1% | |
| 107 | +| LFB Hit | 1.7% | 18.9% | |
| 108 | +| RAM Hit | 5.8% | 38.1% | |
| 109 | + |
| 110 | +The results reveal a stark contrast in CPU utilization. Although the Row-based approach |
| 111 | +executes 17% more instructions (due to serialization overhead), it maintains a high IPC of 2.4, |
| 112 | +indicating a fully utilized pipeline. In contrast, the Columnar approach suffers from a low IPC |
| 113 | +of 0.82, meaning the CPU is stalled for the majority of cycles. This is directly driven by the |
| 114 | +35x difference in LLC Load Misses, which forces the Columnar implementation to fetch data from |
| 115 | +main memory repeatedly. The memory profile further confirms this bottleneck: Columnar mode is |
| 116 | +severely latency-bound, spending 38.1% of its execution time waiting for DRAM (RAM Hit) and |
| 117 | +experiencing significant congestion in the Line Fill Buffer (18.9% LFB Hit), while Row-based |
| 118 | +mode effectively utilizes the cache hierarchy. |
| 119 | + |
| 120 | +## The Memory Access Pattern |
| 121 | + |
| 122 | +Why does the non-materializing sort, specifically its gather method, cause so many cache misses? |
| 123 | +The answer lies in its memory access pattern. Since Velox is a columnar engine, the output is |
| 124 | +constructed column by column. For each column in an output vector, the gather process does the following: |
| 125 | +1. It iterates through all rows of the current output vector. |
| 126 | +2. For each row, locate the corresponding input vector via the sorted vector index. |
| 127 | +3. Locates the source row in the corresponding input vector. |
| 128 | +4. Copies the data from that single source cell to the target cell. |
| 129 | + |
| 130 | +The sorted indices, by nature, offer low predictability. This forces the gather operation for a single |
| 131 | +output column to jump unpredictably across as many as different input vectors, fetching just one |
| 132 | +value from each. This random access pattern has two devastating consequences for performance. |
| 133 | + |
| 134 | +First, at the micro-level, every single data read becomes a "long-distance" memory jump. |
| 135 | +The CPU's hardware prefetcher is rendered completely ineffective by this chaotic access pattern, |
| 136 | +resulting in almost every lookup yielding a cache miss. |
| 137 | + |
| 138 | +Second, at the macro-level, the problem compounds with each column processed. The sheer |
| 139 | +volume of data touched potentially exceeds the size of the L3 cache. This ensures |
| 140 | +that by the time we start processing the next payload column, the necessary vectors have already |
| 141 | +been evicted from the cache. Consequently, the gather process must re-fetch the same vector |
| 142 | +metadata and data from main memory over and over again for each of the 256 payload columns. |
| 143 | +This results in 256 passes of cache-thrashing, random memory access, leading to a catastrophic |
| 144 | +number of cache misses and explaining the severe performance degradation. |
| 145 | + |
| 146 | +<figure> |
| 147 | + <img src="/img/columnar-mem.png" height= "100%" width="100%"/> |
| 148 | +</figure> |
| 149 | + |
| 150 | +In contrast, Velox’s current row-based approach serializes all input vectors into rows, with |
| 151 | +each allocation producing a contiguous buffer that holds a subset of those rows. Despite the |
| 152 | +serialization, the row layout preserves strong locality when materializing output |
| 153 | +vectors: once rows are in the cache, they can be used to extract multiple output columns. |
| 154 | +This leads to much better cache-line utilization and fewer cache misses than a columnar layout, |
| 155 | +where each fetched line often yields only a single value per column. Moreover, the largely |
| 156 | +sequential scans over contiguous buffers let the hardware prefetcher operate effectively, |
| 157 | +boosting throughput even in the presence of serialization overhead. |
| 158 | + |
| 159 | +<figure> |
| 160 | + <img src="/img/row-mem.png" height= "100%" width="100%"/> |
| 161 | +</figure> |
| 162 | + |
| 163 | +## Conclusion |
| 164 | +This study reinforces the core principle of performance engineering: Hardware Sympathy. |
| 165 | +Without understanding the characteristics of the memory hierarchy and optimizing for it, |
| 166 | +simply reducing the instruction count usually does not guarantee better performance. |
| 167 | + |
| 168 | +## Reference |
| 169 | + |
| 170 | +- [1] https://velox-lib.io/blog/velox-primer-part-1/ |
| 171 | +- [2] https://duckdb.org/pdf/ICDE2023-kuiper-muehleisen-sorting.pdf |
| 172 | +- [3] https://duckdb.org/2021/08/27/external-sorting |
0 commit comments