Skip to content

Commit e4f0037

Browse files
committed
add blog
1 parent 3d614a1 commit e4f0037

File tree

5 files changed

+172
-0
lines changed

5 files changed

+172
-0
lines changed
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
---
2+
slug: why-row-based-sort
3+
title: "Why Sort is row-based in Velox — A Quantitative Assessment"
4+
authors: [duanmeng, xiaoxmeng]
5+
tags: [tech-blog, sort, operator]
6+
---
7+
8+
## TL;DR
9+
10+
Velox is a fully vectorized execution engine[1]. Its internal columnar memory layout enhances cache
11+
locality, exposes more inter-instruction parallelism to CPUs, and enables the use of SIMD instructions,
12+
significantly accelerating large-scale query processing.
13+
14+
However, some operators in Velox utilize a hybrid layout, where datasets can be temporarily converted
15+
to a row-oriented format. The `OrderBy` operator is one example, where our implementation first
16+
materializes the input vectors into rows, containing both sort keys and payload columns, sorts them, and
17+
converts the rows back to vectors.
18+
19+
In this article, we explain the rationale behind this design decision and provide experimental evidence
20+
for its implementation. We show a prototype of a hybrid sorting strategy that materializes only the
21+
sort-key columns, reducing the overhead of materializing payload columns. Contrary to expectations, the
22+
end-to-end performance did not improve—in fact, it was even up to **** slower. We present the two
23+
variants and discuss why one is counter-intuitively faster than the other.
24+
25+
## Row-based vs. Non-Materialized
26+
27+
### Row-based Sort
28+
29+
The `OrderBy` operator in Velox’s current implementation uses a utility called `SortBuffer` to perform
30+
the sorting, which consists of three stages:
31+
32+
1. Input Stage: Serializes input Columnar Vectors into a row format, stored in a RowContainer.
33+
2. Sort Stage: Sorts based on keys within the RowContainer.
34+
3. Output Stage: Extract output vectors column by column from the RowContainer in sorted order.
35+
36+
<figure>
37+
<img src="/img/row-sort.png" height= "100%" width="100%"/>
38+
</figure>
39+
40+
While row-based sorting is more efficient than column-based sorting[2,3], what if we only
41+
materialize the sort key columns? We could then use the resulting sort indices to gather
42+
the payload data into the output vectors directly. This would save the cost of converting
43+
the payload columns to rows and back again. More importantly, it would allow us to spill
44+
the original vectors directly to disk rather than first converting rows back into vectors
45+
for spilling.
46+
47+
### Non-Materialized Sort
48+
We have implemented a [non-materializing sort strategy](https://github.com/facebookincubator/velox/pull/15157) designed
49+
to improve sorting performance. The approach materializes only the sort key columns and their original vector indices,
50+
which are then used to gather the corresponding rows from the original input vectors into the output vector after the
51+
sort completes. It changes the `SortBuffer` to `NonMaterizedSortBuffer`, which consists of three stages:
52+
53+
1. Input Stage: Holds the input vector (its shared pointer) in a list, serializes key columns
54+
and additional index columns (VectorIndex and RowIndex) into rows, stored in a RowContainer.
55+
2. Sort Stage: Sorts based on keys within the RowContainer.
56+
3. Output Stage: Extracts the VectorIndex and RowIndex columns, uses them together to gather
57+
the corresponding rows from the original input vectors into the output vector.
58+
59+
<figure>
60+
<img src="/img/column-sort.png" height= "100%" width="100%"/>
61+
</figure>
62+
63+
In theory, this should have significantly reduced the overhead of materializing payload columns,
64+
especially for wide tables, since only sorting keys are materialized. However, the benchmark results
65+
were the exact opposite of our expectations. Despite successfully eliminating expensive serialization
66+
overhead and reducing the total instruction the end-to-end performance was **3x times slower**.
67+
68+
## Benchmark Result
69+
70+
To validate the effectiveness of our new strategy, we designed a benchmark with a varying number of
71+
payload columns:
72+
73+
- Inputs: 1000 Input Vectors, 4096 rows per vector.
74+
- Number of payload columns: 64, 128, 256.
75+
- L2 cache: 80 MiB, L3 cache: 108 MiB.
76+
77+
| numPayloadColumns | Mode | Input Time | Sorting Time | Output Time | Total Time | Desc |
78+
| --- | --- | --- | --- | --- | --- | --- |
79+
| 64 | Row-based | 4.27s | 0.79s | 4.23s | 11.64s | Row-based is 3.9x faster |
80+
| | Columnar | 0.28s | 0.84s | 42.30s | 45.90s | |
81+
| 128 | Row-based | 20.25s | 1.11s | 5.49s | 31.43s | Row-based is 2.0x faster |
82+
| | Columnar | 0.27s | 0.51s | 59.15s | 64.20s | |
83+
| 256 | Rows-based | 29.34s | 1.02s | 12.85s | 51.48s | Row-based is 3.0x faster |
84+
| | Columnar | 0.87s | 1.10s | 144.00s | 154.80s | |
85+
86+
The benchmark results confirm that Row-based Sort is the superior strategy,
87+
delivering a 1.9x to 3.9x overall speedup compared to Columnar Sort. While Row-based Sort
88+
incurs a significantly higher upfront cost during the Input phase (peaking at 104s), it
89+
maintains a highly stable and efficient Output phase (maximum 32s). In contrast, Columnar
90+
Sort suffers from severe performance degradation in the Output phase as the payload increases,
91+
with execution times surging from 42s to 283s, resulting in a much slower total execution time
92+
despite its negligible input overhead.
93+
94+
To identify the root cause of the performance divergence, we utilized `perf stat` to analyze
95+
micro-architectural efficiency and `perf mem` to profile memory access patterns during the critical
96+
Output phase.
97+
98+
| Metrics | Row-based | Columnar | Desc |
99+
| --- | --- | --- | --- |
100+
| Total Instructions | 555.6 Billion | 475.6 Billion | Row +17% |
101+
| IPC (Instructions Per Cycle) | 2.4 | 0.82 | Row 2.9x Higher |
102+
| LLC Load Misses (Last Level Cache) | 0.14 Billion | 5.01 Billion | Columnar 35x Higher |
103+
104+
| Memory Level | Row-based Output | Columnar Outputs |
105+
| --- | --- | --- |
106+
| RAM Hit | 5.8% | 38.1% |
107+
| LFB Hit | 1.7% | 18.9% |
108+
| RAM Hit | 5.8% | 38.1% |
109+
110+
The results reveal a stark contrast in CPU utilization. Although the Row-based approach
111+
executes 17% more instructions (due to serialization overhead), it maintains a high IPC of 2.4,
112+
indicating a fully utilized pipeline. In contrast, the Columnar approach suffers from a low IPC
113+
of 0.82, meaning the CPU is stalled for the majority of cycles. This is directly driven by the
114+
35x difference in LLC Load Misses, which forces the Columnar implementation to fetch data from
115+
main memory repeatedly. The memory profile further confirms this bottleneck: Columnar mode is
116+
severely latency-bound, spending 38.1% of its execution time waiting for DRAM (RAM Hit) and
117+
experiencing significant congestion in the Line Fill Buffer (18.9% LFB Hit), while Row-based
118+
mode effectively utilizes the cache hierarchy.
119+
120+
## The Memory Access Pattern
121+
122+
Why does the non-materializing sort, specifically its gather method, cause so many cache misses?
123+
The answer lies in its memory access pattern. Since Velox is a columnar engine, the output is
124+
constructed column by column. For each column in an output vector, the gather process does the following:
125+
1. It iterates through all rows of the current output vector.
126+
2. For each row, locate the corresponding input vector via the sorted vector index.
127+
3. Locates the source row in the corresponding input vector.
128+
4. Copies the data from that single source cell to the target cell.
129+
130+
The sorted indices, by nature, offer low predictability. This forces the gather operation for a single
131+
output column to jump unpredictably across as many as different input vectors, fetching just one
132+
value from each. This random access pattern has two devastating consequences for performance.
133+
134+
First, at the micro-level, every single data read becomes a "long-distance" memory jump.
135+
The CPU's hardware prefetcher is rendered completely ineffective by this chaotic access pattern,
136+
resulting in almost every lookup yielding a cache miss.
137+
138+
Second, at the macro-level, the problem compounds with each column processed. The sheer
139+
volume of data touched potentially exceeds the size of the L3 cache. This ensures
140+
that by the time we start processing the next payload column, the necessary vectors have already
141+
been evicted from the cache. Consequently, the gather process must re-fetch the same vector
142+
metadata and data from main memory over and over again for each of the 256 payload columns.
143+
This results in 256 passes of cache-thrashing, random memory access, leading to a catastrophic
144+
number of cache misses and explaining the severe performance degradation.
145+
146+
<figure>
147+
<img src="/img/columnar-mem.png" height= "100%" width="100%"/>
148+
</figure>
149+
150+
In contrast, Velox’s current row-based approach serializes all input vectors into rows, with
151+
each allocation producing a contiguous buffer that holds a subset of those rows. Despite the
152+
serialization, the row layout preserves strong locality when materializing output
153+
vectors: once rows are in the cache, they can be used to extract multiple output columns.
154+
This leads to much better cache-line utilization and fewer cache misses than a columnar layout,
155+
where each fetched line often yields only a single value per column. Moreover, the largely
156+
sequential scans over contiguous buffers let the hardware prefetcher operate effectively,
157+
boosting throughput even in the presence of serialization overhead.
158+
159+
<figure>
160+
<img src="/img/row-mem.png" height= "100%" width="100%"/>
161+
</figure>
162+
163+
## Conclusion
164+
This study reinforces the core principle of performance engineering: Hardware Sympathy.
165+
Without understanding the characteristics of the memory hierarchy and optimizing for it,
166+
simply reducing the instruction count usually does not guarantee better performance.
167+
168+
## Reference
169+
170+
- [1] https://velox-lib.io/blog/velox-primer-part-1/
171+
- [2] https://duckdb.org/pdf/ICDE2023-kuiper-muehleisen-sorting.pdf
172+
- [3] https://duckdb.org/2021/08/27/external-sorting

website/static/img/column-sort.png

164 KB
Loading
202 KB
Loading

website/static/img/row-mem.png

345 KB
Loading

website/static/img/row-sort.png

205 KB
Loading

0 commit comments

Comments
 (0)