Skip to content

Commit 9469f8b

Browse files
committed
add blog
1 parent 3d614a1 commit 9469f8b

File tree

5 files changed

+169
-0
lines changed

5 files changed

+169
-0
lines changed
Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
---
2+
slug: why-row-based-sort
3+
title: "Why Sort is row-based in Velox — A Quantitative Assessment"
4+
authors: [duanmeng, xiaoxmeng]
5+
tags: [tech-blog, sort, operator]
6+
---
7+
8+
## TL;DR
9+
10+
Velox is a fully vectorized execution engine[1]. Its internal columnar memory layout enhances cache
11+
locality, exposes more inter-instruction parallelism to CPUs, and enables the use of SIMD instructions,
12+
significantly accelerating large-scale query processing.
13+
14+
However, some operators in Velox utilize a hybrid layout, where datasets can be temporarily converted
15+
to a row-oriented format. The `OrderBy` operator is one example, where our implementation first
16+
materializes the input vectors into rows, containing both sort keys and payload columns, sorts them, and
17+
converts the rows back to vectors.
18+
19+
In this article, we explain the rationale behind this design decision and provide experimental evidence
20+
for its implementation. We show a prototype of a hybrid sorting strategy that materializes only the
21+
sort-key columns, reducing the overhead of materializing payload columns. Contrary to expectations, the
22+
end-to-end performance did not improve—in fact, it was even up to **** slower. We present the two
23+
variants and discuss why one is counter-intuitively faster than the other.
24+
25+
## Row-based vs. Non-Materialized
26+
27+
### Row-based Sort
28+
29+
The `OrderBy` operator in Velox’s current implementation uses a utility called `SortBuffer` to perform
30+
the sorting, which consists of three stages:
31+
32+
1. Input Stage: Serializes input Columnar Vectors into a row format, stored in a RowContainer.
33+
2. Sort Stage: Sorts based on keys within the RowContainer.
34+
3. Output Stage: Extract output vectors column by column from the RowContainer in sorted order.
35+
36+
<figure>
37+
<img src="/img/row-sort.png" height= "100%" width="100%"/>
38+
</figure>
39+
40+
While row-based sorting is more efficient than column-based sorting[2,3], what if we only
41+
materialize the sort key columns? We could then use the resulting sort indices to gather
42+
the payload data into the output vectors directly. This would save the cost of converting
43+
the payload columns to rows and back again. More importantly, it would allow us to spill
44+
the original vectors directly to disk rather than first converting rows back into vectors
45+
for spilling.
46+
47+
### Non-Materialized Sort
48+
We have implemented a [non-materializing sort strategy](https://github.com/facebookincubator/velox/pull/15157) designed to improve sorting performance. The approach materializes only the sort key columns and their original vector indices, which are then used to gather the corresponding rows from the original input vectors into the output vector after the sort is complete. It changes the `SortBuffer` to `NonMaterizedSortBuffer`, which consists of three stages:
49+
50+
1. Input Stage: Holds the input vector (its shared pointer) in a list, serializes key columns
51+
and additional index columns (VectorIndex and RowIndex) into rows, stored in a RowContainer.
52+
2. Sort Stage: Sorts based on keys within the RowContainer.
53+
3. Output Stage: Extracts the VectorIndex and RowIndex columns, uses them together to gather
54+
the corresponding rows from the original input vectors into the output vector.
55+
56+
<figure>
57+
<img src="/img/column-sort.png" height= "100%" width="100%"/>
58+
</figure>
59+
60+
In theory, this should have significantly reduced the overhead of materializing payload columns,
61+
especially for wide tables, since only sorting keys are materialized. However, the benchmark results
62+
were the exact opposite of our expectations. Despite successfully eliminating expensive serialization
63+
overhead and reducing the total instruction the end-to-end performance was **3x times slower**.
64+
65+
## Benchmark Result
66+
67+
To validate the effectiveness of our new strategy, we designed a benchmark with a varying number of
68+
payload columns:
69+
70+
- Inputs: 1000 Input Vectors, 4096 rows per vector.
71+
- Number of payload columns: 64, 128, 256.
72+
- L2 cache: 80 MiB, L3 cache: 108 MiB.
73+
74+
| numPayloadColumns | Mode | Input Time | Sorting Time | Output Time | Total Time | Desc |
75+
| --- | --- | --- | --- | --- | --- | --- |
76+
| 64 | Row-based | 4.27s | 0.79s | 4.23s | 11.64s | Row-based is 3.9x faster |
77+
| | Columnar | 0.28s | 0.84s | 42.30s | 45.90s | |
78+
| 128 | Row-based | 20.25s | 1.11s | 5.49s | 31.43s | Row-based is 2.0x faster |
79+
| | Columnar | 0.27s | 0.51s | 59.15s | 64.20s | |
80+
| 256 | Rows-based | 29.34s | 1.02s | 12.85s | 51.48s | Row-based is 3.0x faster |
81+
| | Columnar | 0.87s | 1.10s | 144.00s | 154.80s | |
82+
83+
The benchmark results confirm that Row-based Sort is the superior strategy,
84+
delivering a 1.9x to 3.9x overall speedup compared to Columnar Sort. While Row-based Sort
85+
incurs a significantly higher upfront cost during the Input phase (peaking at 104s), it
86+
maintains a highly stable and efficient Output phase (maximum 32s). In contrast, Columnar
87+
Sort suffers from severe performance degradation in the Output phase as the payload increases,
88+
with execution times surging from 42s to 283s, resulting in a much slower total execution time
89+
despite its negligible input overhead.
90+
91+
To identify the root cause of the performance divergence, we utilized `perf stat` to analyze
92+
micro-architectural efficiency and `perf mem` to profile memory access patterns during the critical
93+
Output phase.
94+
95+
| Metrics | Row-based | Columnar | Desc |
96+
| --- | --- | --- | --- |
97+
| Total Instructions | 555.6 Billion | 475.6 Billion | Row +17% |
98+
| IPC (Instructions Per Cycle) | 2.4 | 0.82 | Row 2.9x Higher |
99+
| LLC Load Misses (Last Level Cache) | 0.14 Billion | 5.01 Billion | Columnar 35x Higher |
100+
101+
| Memory Level | Row-based Output | Columnar Outputs |
102+
| --- | --- | --- |
103+
| RAM Hit | 5.8% | 38.1% |
104+
| LFB Hit | 1.7% | 18.9% |
105+
| RAM Hit | 5.8% | 38.1% |
106+
107+
The results reveal a stark contrast in CPU utilization. Although the Row-based approach
108+
executes 17% more instructions (due to serialization overhead), it maintains a high IPC of 2.4,
109+
indicating a fully utilized pipeline. In contrast, the Columnar approach suffers from a low IPC
110+
of 0.82, meaning the CPU is stalled for the majority of cycles. This is directly driven by the
111+
35x difference in LLC Load Misses, which forces the Columnar implementation to fetch data from
112+
main memory repeatedly. The memory profile further confirms this bottleneck: Columnar mode is
113+
severely latency-bound, spending 38.1% of its execution time waiting for DRAM (RAM Hit) and
114+
experiencing significant congestion in the Line Fill Buffer (18.9% LFB Hit), while Row-based
115+
mode effectively utilizes the cache hierarchy.
116+
117+
## The Memory Access Pattern
118+
119+
Why does the non-materializing sort, specifically its gather method, cause so many cache misses?
120+
The answer lies in its memory access pattern. Since Velox is a columnar engine, the output is
121+
constructed column by column. For each column in an output vector, the gather process does the following:
122+
1. It iterates through all rows of the current output vector.
123+
2. For each row, locate the corresponding input vector via the sorted vector index.
124+
3. Locates the source row in the corresponding input vector.
125+
4. Copies the data from that single source cell to the target cell.
126+
127+
The sorted indices, by nature, offer low predictability. This forces the gather operation for a single
128+
output column to jump unpredictably across as many as different input vectors, fetching just one
129+
value from each. This random access pattern has two devastating consequences for performance.
130+
131+
First, at the micro-level, every single data read becomes a "long-distance" memory jump.
132+
The CPU's hardware prefetcher is rendered completely ineffective by this chaotic access pattern,
133+
resulting in almost every lookup yielding a cache miss.
134+
135+
Second, at the macro-level, the problem compounds with each column processed. The sheer
136+
volume of data touched—potentially 1024 vectors—exceeds the size of the L3 cache. This ensures
137+
that by the time we start processing the next payload column, the necessary vectors have already
138+
been evicted from the cache. Consequently, the gather process must re-fetch the same vector
139+
metadata and data from main memory over and over again for each of the 256 payload columns.
140+
This results in 256 passes of cache-thrashing, random memory access, leading to a catastrophic
141+
number of cache misses and explaining the severe performance degradation.
142+
143+
<figure>
144+
<img src="/img/columnar-mem.png" height= "100%" width="100%"/>
145+
</figure>
146+
147+
In contrast, Velox’s current row-based approach serializes all input vectors into rows, with
148+
each allocation producing a contiguous buffer that holds a subset of those rows. Despite the
149+
serialization, the row layout preserves strong locality when materializing output
150+
vectors: once rows are in the cache, they can be used to extract multiple output columns.
151+
This leads to much better cache-line utilization and fewer cache misses than a columnar layout,
152+
where each fetched line often yields only a single value per column. Moreover, the largely
153+
sequential scans over contiguous buffers let the hardware prefetcher operate effectively,
154+
boosting throughput even in the presence of serialization overhead.
155+
156+
<figure>
157+
<img src="/img/row-mem.png" height= "100%" width="100%"/>
158+
</figure>
159+
160+
## Conclusion
161+
This study reinforces the core principle of performance engineering: Hardware Sympathy.
162+
Without understanding the characteristics of the memory hierarchy and optimizing for it,
163+
simply reducing the instruction count usually does not guarantee better performance.
164+
165+
## Reference
166+
167+
- [1] https://velox-lib.io/blog/velox-primer-part-1/
168+
- [2] https://duckdb.org/pdf/ICDE2023-kuiper-muehleisen-sorting.pdf
169+
- [3] https://duckdb.org/2021/08/27/external-sorting

website/static/img/column-sort.png

164 KB
Loading
202 KB
Loading

website/static/img/row-mem.png

345 KB
Loading

website/static/img/row-sort.png

205 KB
Loading

0 commit comments

Comments
 (0)