Skip to content

Commit 70b3dd5

Browse files
jimbothigpenclaude
andcommitted
phase 3a ggml-org#20 fix: strict-aliasing-safe outputs[] for parallel bitpack
Replaces `uint8_t * outputs = (uint8_t *)x;` (alias onto __shared__ float x[128]) with a dedicated `__shared__ uint8_t s_outputs[128]` array. Root cause (session-65-resume-cell-c-ppl bisect, 2026-05-17): the buun ggml-org#20 parallel 49-thread bitpack reads outputs[sym_idx] from sids 1..48 (sid==0 writes the canonical winning-path bytes during backtrack, the __syncthreads() following the backtrack is meant to publish those writes to all sids). Under HIP/ROCm, however, the uint8_t-into-float[] alias is a strict-aliasing violation, and the compiler can hoist or cache cross-thread reads of outputs[] above the __syncthreads(), so sids 1..48 observed stale (non-winning) values. Bytes 1..48 of dst_blk->qs[] were therefore packed from arbitrary leftover bits, corrupting the symbol bitstream past byte 0 and producing the +12.7% PPL regression session-64 measured. The fix is purely a storage-type cleanup: declare s_outputs[] as a typed uint8_t shared array (128 bytes) and point the local `outputs` pointer at it. All backtrack writes, recon_norm reads, and parallel-bitpack reads now land in a type-consistent shared buffer, eliminating the alias hazard. The __syncthreads() between the backtrack section and the bitpack section is unchanged — the publish-to-all-sids semantics are now actually honored. No algorithmic / arithmetic change. Buun's perf-gain intent (parallel bitpack across 49 threads) is retained. Expected PPL: bit-identical to anchor `[[phase-3-anchor-post-s60]]` = 6.9020 +/- 0.05337 at chunks 1-4 (Qwen3.5-9B-Q4_K_M turboq3_tcq KV, ai00 ROCm, n_seq=1, -c 4096 -ub 512 -b 512, GGML_CUDA_DISABLE_GRAPHS=1). Untouched: k_set_rows_turboq2_tcq retains its (uint8_t *)x alias because that kernel still uses the serial sid==0-only bitpack (writer and reader are the same thread, so the strict-aliasing UB doesn't manifest as a cross-thread visibility bug). If turboq2 ever gains a parallel bitpack, it will need the same fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent e8f2430 commit 70b3dd5

1 file changed

Lines changed: 12 additions & 2 deletions

File tree

ggml/src/ggml-cuda/set-rows.cu

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1149,6 +1149,14 @@ static __global__ void __launch_bounds__(512, 1) k_set_rows_turboq3_tcq(
11491149
__shared__ float pred_min_cost[64];
11501150
__shared__ uint8_t pred_min_p[64];
11511151
__shared__ int shared_initial_state;
1152+
// Dedicated shared buffer for the Viterbi-backtrack output bytes. Previously
1153+
// aliased onto x[] via (uint8_t *)x, but writing uint8_t into a float-typed
1154+
// shared array is a strict-aliasing violation: under HIP/ROCm the compiler
1155+
// can hoist cross-thread reads of outputs[] above the __syncthreads() that
1156+
// follows the sid==0 backtrack write, so sids 1..48 in the parallel bitpack
1157+
// observed stale (non-winning) symbol bytes — root cause of the Phase 3a #20
1158+
// +12.7% PPL regression (session-65-resume-cell-c-ppl bisect, 2026-05-17).
1159+
__shared__ uint8_t s_outputs[128];
11521160

11531161
// Parallel pre-Viterbi: load (threads 0-127 each grab one element)
11541162
if (sid < 128) x[sid] = grp_src[sid];
@@ -1301,8 +1309,10 @@ static __global__ void __launch_bounds__(512, 1) k_set_rows_turboq3_tcq(
13011309
d_tcq_dump_x_buf[group * 128 + sid] = x[sid];
13021310

13031311
// Thread 0: backtrack (inherently sequential — each step depends on the next)
1304-
// Reads byte-packed bt from global memory (no nibble unpack).
1305-
uint8_t * outputs = (uint8_t *)x; // x[] no longer needed after forward pass
1312+
// Reads byte-packed bt from global memory (no nibble unpack). Writes the
1313+
// winning-path output bytes into __shared__ s_outputs[] (type-clean; see
1314+
// declaration comment above).
1315+
uint8_t * outputs = s_outputs;
13061316
if (sid == 0) {
13071317
int state = shared_initial_state;
13081318
for (int t = 127; t >= 0; t--) {

0 commit comments

Comments
 (0)