facebookincubator
diff --git a/‎website/blog/2025-12-24-why-row-based-sort.mdx‎
Lines changed: 169 additions & 0 deletions b/‎website/blog/2025-12-24-why-row-based-sort.mdx‎
Lines changed: 169 additions & 0 deletions
diff --git a/‎website/static/img/column-sort.png‎
164 KB b/‎website/static/img/column-sort.png‎
164 KB
diff --git a/‎website/static/img/columnar-mem.png‎
202 KB b/‎website/static/img/columnar-mem.png‎
202 KB
diff --git a/‎website/static/img/row-mem.png‎
345 KB b/‎website/static/img/row-mem.png‎
345 KB
diff --git a/‎website/static/img/row-sort.png‎
205 KB b/‎website/static/img/row-sort.png‎
205 KB
@@ -0,0 +1,169 @@
+---
+slug: why-row-based-sort
+title: "Why Sort is row-based in Velox — A Quantitative Assessment"
+authors: [duanmeng, xiaoxmeng]
+tags: [tech-blog, sort, operator]
+---
+
+## TL;DR
+
+Velox is a fully vectorized execution engine[1]. Its internal columnar memory layout enhances cache
+locality, exposes more inter-instruction parallelism to CPUs, and enables the use of SIMD instructions,
+significantly accelerating large-scale query processing.
+
+However, some operators in Velox utilize a hybrid layout, where datasets can be temporarily converted
+to a row-oriented format. The `OrderBy` operator is one example, where our implementation first
+materializes the input vectors into rows, containing both sort keys and payload columns, sorts them, and
+converts the rows back to vectors.
+
+In this article, we explain the rationale behind this design decision and provide experimental evidence
+for its implementation. We show a prototype of a hybrid sorting strategy that materializes only the
+sort-key columns, reducing the overhead of materializing payload columns. Contrary to expectations, the
+end-to-end performance did not improve—in fact, it was even up to **3×** slower. We present the two
+variants and discuss why one is counter-intuitively faster than the other.
+
+## Row-based vs. Non-Materialized
+
+### Row-based Sort
+
+The `OrderBy` operator in Velox’s current implementation uses a utility called `SortBuffer` to perform
+the sorting, which consists of three stages:
+
+1. Input Stage: Serializes input Columnar Vectors into a row format, stored in a RowContainer.
+2. Sort Stage: Sorts based on keys within the RowContainer.
+3. Output Stage: Extract output vectors column by column from the RowContainer in sorted order.
+
+<figure>
+    <img src="/img/row-sort.png" height= "100%" width="100%"/>
+</figure>
+
+While row-based sorting is more efficient than column-based sorting[2,3], what if we only
+materialize the sort key columns? We could then use the resulting sort indices to gather
+the payload data into the output vectors directly. This would save the cost of converting
+the payload columns to rows and back again. More importantly, it would allow us to spill
+the original vectors directly to disk rather than first converting rows back into vectors
+for spilling.
+
+### Non-Materialized Sort
+We have implemented a [non-materializing sort strategy](https://github.com/facebookincubator/velox/pull/15157) designed to improve sorting performance. The approach materializes only the sort key columns and their original vector indices, which are then used to gather the corresponding rows from the original input vectors into the output vector after the sort is complete. It changes the `SortBuffer` to `NonMaterizedSortBuffer`, which consists of three stages:
+
+1. Input Stage: Holds the input vector (its shared pointer) in a list, serializes key columns
+and additional index columns (VectorIndex and RowIndex) into rows, stored in a RowContainer.
+2. Sort Stage: Sorts based on keys within the RowContainer.
+3. Output Stage: Extracts the VectorIndex and RowIndex columns, uses them together to gather
+the corresponding rows from the original input vectors into the output vector.
+
+<figure>
+    <img src="/img/column-sort.png" height= "100%" width="100%"/>
+</figure>
+
+In theory, this should have significantly reduced the overhead of materializing payload columns,
+especially for wide tables, since only sorting keys are materialized. However, the benchmark results
+were the exact opposite of our expectations. Despite successfully eliminating expensive serialization
+overhead and reducing the total instruction the end-to-end performance was **3x times slower**.
+
+## Benchmark Result
+
+To validate the effectiveness of our new strategy, we designed a benchmark with a varying number of
+payload columns:
+
+- Inputs: 1000 Input Vectors, 4096 rows per vector.
+- Number of payload columns: 64, 128, 256.
+- L2 cache: 80 MiB, L3 cache: 108 MiB.
+
+| numPayloadColumns | Mode | Input Time  | Sorting Time | Output Time  | Total Time  | Desc  |
+| --- | --- | --- | --- | --- | --- | --- |
+| 64 | Row-based | 4.27s | 0.79s | 4.23s | 11.64s | Row-based is 3.9x faster |
+|  | Columnar | 0.28s | 0.84s | 42.30s  | 45.90s | |
+| 128 | Row-based | 20.25s | 1.11s | 5.49s | 31.43s | Row-based is 2.0x faster |
+|  | Columnar | 0.27s | 0.51s | 59.15s | 64.20s |  |
+| 256 | Rows-based | 29.34s | 1.02s | 12.85s | 51.48s | Row-based is 3.0x faster |
+|  | Columnar | 0.87s | 1.10s | 144.00s | 154.80s |  |
+
+The benchmark results confirm that Row-based Sort is the superior strategy,
+delivering a 1.9x to 3.9x overall speedup compared to Columnar Sort. While Row-based Sort
+incurs a significantly higher upfront cost during the Input phase (peaking at 104s), it
+maintains a highly stable and efficient Output phase (maximum 32s). In contrast, Columnar
+Sort suffers from severe performance degradation in the Output phase as the payload increases,
+with execution times surging from 42s to 283s, resulting in a much slower total execution time
+despite its negligible input overhead.
+
+To identify the root cause of the performance divergence, we utilized `perf stat` to analyze
+micro-architectural efficiency and `perf mem` to profile memory access patterns during the critical
+Output phase.
+
+| Metrics | Row-based | Columnar | Desc |
+| --- | --- | --- | --- |
+| Total Instructions | 555.6 Billion | 475.6 Billion | Row +17% |
+| IPC (Instructions Per Cycle) | 2.4 | 0.82 | Row 2.9x Higher |
+| LLC Load Misses (Last Level Cache) | 0.14 Billion | 5.01 Billion | Columnar 35x Higher |
+
+| Memory Level | Row-based Output | Columnar Outputs |
+| --- | --- | --- |
+| RAM Hit | 5.8% | 38.1% |
+| LFB Hit | 1.7% | 18.9% |
+| RAM Hit | 5.8% | 38.1% |
+
+The results reveal a stark contrast in CPU utilization. Although the Row-based approach
+executes 17% more instructions (due to serialization overhead), it maintains a high IPC of 2.4,
+indicating a fully utilized pipeline. In contrast, the Columnar approach suffers from a low IPC
+of 0.82, meaning the CPU is stalled for the majority of cycles. This is directly driven by the
+35x difference in LLC Load Misses, which forces the Columnar implementation to fetch data from
+main memory repeatedly. The memory profile further confirms this bottleneck: Columnar mode is
+severely latency-bound, spending 38.1% of its execution time waiting for DRAM (RAM Hit) and
+experiencing significant congestion in the Line Fill Buffer (18.9% LFB Hit), while Row-based
+mode effectively utilizes the cache hierarchy.
+
+## The Memory Access Pattern
+
+Why does the non-materializing sort, specifically its gather method, cause so many cache misses?
+The answer lies in its memory access pattern. Since Velox is a columnar engine, the output is
+constructed column by column. For each column in an output vector, the gather process does the following:
+1. It iterates through all rows of the current output vector.
+2. For each row, locate the corresponding input vector via the sorted vector index.
+3. Locates the source row in the corresponding input vector.
+4. Copies the data from that single source cell to the target cell.
+
+The sorted indices, by nature, offer low predictability. This forces the gather operation for a single
+output column to jump unpredictably across as many as different input vectors, fetching just one
+value from each. This random access pattern has two devastating consequences for performance.
+
+First, at the micro-level, every single data read becomes a "long-distance" memory jump.
+The CPU's hardware prefetcher is rendered completely ineffective by this chaotic access pattern,
+resulting in almost every lookup yielding a cache miss.
+
+Second, at the macro-level, the problem compounds with each column processed. The sheer
+volume of data touched—potentially 1024 vectors—exceeds the size of the L3 cache. This ensures
+that by the time we start processing the next payload column, the necessary vectors have already
+been evicted from the cache. Consequently, the gather process must re-fetch the same vector
+metadata and data from main memory over and over again for each of the 256 payload columns.
+This results in 256 passes of cache-thrashing, random memory access, leading to a catastrophic
+number of cache misses and explaining the severe performance degradation.
+
+<figure>
+    <img src="/img/columnar-mem.png" height= "100%" width="100%"/>
+</figure>
+
+In contrast, Velox’s current row-based approach serializes all input vectors into rows, with
+each allocation producing a contiguous buffer that holds a subset of those rows. Despite the
+serialization, the row layout preserves strong locality when materializing output
+vectors: once rows are in the cache, they can be used to extract multiple output columns.
+This leads to much better cache-line utilization and fewer cache misses than a columnar layout,
+where each fetched line often yields only a single value per column. Moreover, the largely
+sequential scans over contiguous buffers let the hardware prefetcher operate effectively,
+boosting throughput even in the presence of serialization overhead.
+
+<figure>
+    <img src="/img/row-mem.png" height= "100%" width="100%"/>
+</figure>
+
+## Conclusion
+This study reinforces the core principle of performance engineering: Hardware Sympathy.
+Without understanding the characteristics of the memory hierarchy and optimizing for it,
+simply reducing the instruction count usually does not guarantee better performance.
+
+## Reference
+
+- [1] https://velox-lib.io/blog/velox-primer-part-1/
+- [2] https://duckdb.org/pdf/ICDE2023-kuiper-muehleisen-sorting.pdf
+- [3] https://duckdb.org/2021/08/27/external-sorting