Commit 70b3dd5
phase 3a ggml-org#20 fix: strict-aliasing-safe outputs[] for parallel bitpack
Replaces `uint8_t * outputs = (uint8_t *)x;` (alias onto __shared__ float
x[128]) with a dedicated `__shared__ uint8_t s_outputs[128]` array.
Root cause (session-65-resume-cell-c-ppl bisect, 2026-05-17): the buun ggml-org#20
parallel 49-thread bitpack reads outputs[sym_idx] from sids 1..48 (sid==0
writes the canonical winning-path bytes during backtrack, the __syncthreads()
following the backtrack is meant to publish those writes to all sids). Under
HIP/ROCm, however, the uint8_t-into-float[] alias is a strict-aliasing
violation, and the compiler can hoist or cache cross-thread reads of
outputs[] above the __syncthreads(), so sids 1..48 observed stale
(non-winning) values. Bytes 1..48 of dst_blk->qs[] were therefore packed
from arbitrary leftover bits, corrupting the symbol bitstream past byte 0
and producing the +12.7% PPL regression session-64 measured.
The fix is purely a storage-type cleanup: declare s_outputs[] as a typed
uint8_t shared array (128 bytes) and point the local `outputs` pointer at
it. All backtrack writes, recon_norm reads, and parallel-bitpack reads now
land in a type-consistent shared buffer, eliminating the alias hazard. The
__syncthreads() between the backtrack section and the bitpack section is
unchanged — the publish-to-all-sids semantics are now actually honored.
No algorithmic / arithmetic change. Buun's perf-gain intent (parallel
bitpack across 49 threads) is retained.
Expected PPL: bit-identical to anchor `[[phase-3-anchor-post-s60]]` =
6.9020 +/- 0.05337 at chunks 1-4 (Qwen3.5-9B-Q4_K_M turboq3_tcq KV, ai00
ROCm, n_seq=1, -c 4096 -ub 512 -b 512, GGML_CUDA_DISABLE_GRAPHS=1).
Untouched: k_set_rows_turboq2_tcq retains its (uint8_t *)x alias because
that kernel still uses the serial sid==0-only bitpack (writer and reader
are the same thread, so the strict-aliasing UB doesn't manifest as a
cross-thread visibility bug). If turboq2 ever gains a parallel bitpack,
it will need the same fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent e8f2430 commit 70b3dd5
1 file changed
Lines changed: 12 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1149 | 1149 | | |
1150 | 1150 | | |
1151 | 1151 | | |
| 1152 | + | |
| 1153 | + | |
| 1154 | + | |
| 1155 | + | |
| 1156 | + | |
| 1157 | + | |
| 1158 | + | |
| 1159 | + | |
1152 | 1160 | | |
1153 | 1161 | | |
1154 | 1162 | | |
| |||
1301 | 1309 | | |
1302 | 1310 | | |
1303 | 1311 | | |
1304 | | - | |
1305 | | - | |
| 1312 | + | |
| 1313 | + | |
| 1314 | + | |
| 1315 | + | |
1306 | 1316 | | |
1307 | 1317 | | |
1308 | 1318 | | |
| |||
0 commit comments