psema/shard: enable Windows COFF sharded build-obj + cut allocator/condvar contention by dylan-conway · Pull Request #20 · oven-sh/zig

dylan-conway · 2026-04-22T10:37:04Z

Summary

Enables build-obj --no-link --llvm-no-merge-shards --llvm-codegen-threads=N for x86_64-windows-msvc targets, and removes the host-side bottlenecks that kept the Windows compiler from scaling past ~2.7× under parallel sema + sharded emit.

bun debug zig step, 24-core x86_64 Windows host:

	serial	psema	psema + 24 shards
before	208.2s	165.0s	77.0s
after	165.1s	151.7s	35.1s

5.9× vs the original serial baseline (was 2.7× before this PR).

Changes

COFF shard emission (`4073e7160c`)

Shard naming uses target.ofmt.fileExt() instead of hardcoded .o in Compilation.zig, link.zig, and Build/Step/Compile.zig, so COFF targets get foo.{i}.obj. ELF/Mach-O behaviour unchanged.
Hash llvm_codegen_threads / llvm_no_merge_shards / no_link_obj into the cache key — these change the output file set, so a stale cache hit would otherwise produce the wrong layout.
std.c._msize: drop const to match the Windows SDK declaration so the C-backend bootstrap (zig2.c) compiles under clang-cl/msvc.

Host contention (`89647b5ee9`)

src/main.zig: when link_libc and not Debug, use smp_allocator instead of raw_c_allocator. The Windows CRT routes malloc → HeapAlloc(GetProcessHeap()) behind a single critical section, so 24 sema workers serialise on it.
std.Thread.Condition: use FutexImpl on Windows. The CONDITION_VARIABLE wrapper had no userspace "no waiters" fast-path, so every work_queue_cond.signal() and claim-shard cond.signal() was a kernel32 call. FutexImpl.wake() checks wakeable == 0 first; the underlying Futex already maps to RtlWaitOnAddress on Win8+.
tools/mimalloc_new_delete_override.cpp + windows-gnu CI splice: LLVM's operator new still hit the CRT heap. The override TU compiles mimalloc's unity static.c and provides the C++ replaceable global operators; mirrors the existing linux-musl MI_MALLOC_OVERRIDE step (which uses POSIX symbol interposition for malloc/free — that doesn't statically link on Windows, but operator new replacement does and is where LLVM's hot allocations go).

Verified

bun run build (Debug) and bun run build:release both link 24 COFF shards via lld-link and pass smoke test on x86_64-windows-msvc.
zig test lib/std/std.zig passes with the Condition change.

coderabbitai · 2026-04-22T10:43:29Z

Walkthrough

Build workflow adds Windows-GNU mimalloc splice; object-file handling generalized to target-specific extensions (e.g. .obj); LLVM shard naming, shard-flush filenames, and compilation cache keys updated; Condition defaults to FutexImpl for non-single-threaded targets; startup allocator selection changed by builtin.mode; _msize C import signature fixed; navShard now hashes file shard key plus nav FQN.

Changes

Cohort / File(s)	Summary
Windows-GNU mimalloc integration `/.github/workflows/bun_build.yaml`, `tools/mimalloc_new_delete_override.cpp`	Added a Windows-GNU conditional step to splice a mimalloc object into the bootstrap build and adjusted the existing linux-musl step label/comment. Added `tools/mimalloc_new_delete_override.cpp` providing global `operator new`/`operator delete` forwarding to mimalloc.
Object extension, shard naming, and linking `lib/std/Build/Step/Compile.zig`, `src/Compilation.zig`, `src/link.zig`	Removed hardcoded `.o` assumptions: derive object extension via `ofmt.fileExt(...)`, strip that extension when computing stems, and generate shard filenames as `"{stem}.{i}{ext}"` (handles COFF `.obj`). Updated docs and filename-trimming logic accordingly.
Compilation cache & shard behavior `src/Compilation.zig`	Added `options.llvm_codegen_threads`, `options.llvm_no_merge_shards`, and `options.no_link_obj` to the compilation cache key; wired LLVM codegen thread count from resolved thread-pool size; updated shard flush logic to use the computed `obj_ext`.
Thread/Condition implementation change `lib/std/Thread/Condition.zig`	Removed the `.windows` WindowsImpl branch so `FutexImpl` is chosen for all non-`builtin.single_threaded` platforms; added comment explaining use on Windows.
Allocator selection change & build-runner wiring `src/main.zig`	When `builtin.link_libc` is enabled, allocator selection now conditions on `builtin.mode == .Debug` (Debug preserves alignment-sensitive previous behavior; non-Debug uses `std.heap.smp_allocator`). Also sets `llvm_codegen_threads` for the build-runner from the resolved thread-pool size.
C binding fix `lib/std/c.zig`	Changed `_msize` import parameter from `?const anyopaque` to `?anyopaque` to match the C signature.
Nav shard determinism `src/Zcu.zig`	`navShard` now computes shards by hashing the file's `shardKey` and the `Nav` FQN together (instead of delegating to `File.computeShard(n)`); updated doc comment on determinism and anonymous FQNs.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main objectives of the PR: enabling Windows COFF sharded build-obj and reducing allocator/condvar contention.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the changes, rationale, and measured improvements.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/std/Thread/Condition.zig`:
- Around line 116-119: The conditional alias Impl now only chooses
SingleThreadedImpl or FutexImpl, leaving WindowsImpl unreachable; remove
WindowsImpl or gate it behind an explicit fallback/feature flag. Update the
alias selection logic referencing Impl and the concrete types
SingleThreadedImpl, FutexImpl, and WindowsImpl: either delete the WindowsImpl
definition and any uses, or add a clear conditional branch (e.g., if
(builtin.windows or feature flag) then WindowsImpl else ...) so WindowsImpl is
intentionally reachable and documented. Ensure any tests or references to
WindowsImpl are updated or removed accordingly.

In `@src/Compilation.zig`:
- Around line 2191-2196: The cache key currently uses raw CLI flags
(options.llvm_no_merge_shards and options.no_link_obj) which are normalized
later, causing unnecessary cache misses; update the cache hashing to use the
canonicalized booleans (e.g., comp.no_merge_shards and comp.no_link_obj or
whatever normalized fields exist) together with options.llvm_codegen_threads so
shard/merge/link semantics are represented by the effective config rather than
the raw flags — locate the cache.hash.add calls around
options.llvm_codegen_threads and replace the additions of the raw option flags
with additions of the normalized comp.no_merge_shards and comp.no_link_obj
values.

In `@src/main.zig`:
- Around line 182-188: The comment above the Debug-path is incorrect: the code
checks builtin.mode == .Debug and prefers std.heap.raw_c_allocator, only falling
back to std.heap.c_allocator for over-aligned requests (when
`@alignOf`(std.c.max_align_t) < `@max`(`@alignOf`(i128), std.atomic.cache_line)); also
note the -Ddebug-gpa case is handled earlier. Update the comment to state that
in Debug mode the allocator chosen is raw_c_allocator by default with a fallback
to c_allocator for over-aligned allocations, and remove the misleading claim
about “keeping c_allocator” and the -Ddebug-gpa handling.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1c497521-9d62-490c-b73a-afd2db9cf3cd

📥 Commits

Reviewing files that changed from the base of the PR and between 0bcf4c3 and 89647b5.

📒 Files selected for processing (8)

.github/workflows/bun_build.yaml
lib/std/Build/Step/Compile.zig
lib/std/Thread/Condition.zig
lib/std/c.zig
src/Compilation.zig
src/link.zig
src/main.zig
tools/mimalloc_new_delete_override.cpp

coderabbitai · 2026-04-22T10:43:32Z

 const Impl = if (builtin.single_threaded)
    SingleThreadedImpl
-else if (builtin.os.tag == .windows)
-    WindowsImpl
 else
    FutexImpl;


🧹 Nitpick | 🔵 Trivial

WindowsImpl looks unreachable after this selection change.

Since Impl no longer selects WindowsImpl, consider removing it (or clearly parking it behind an explicit fallback gate) to reduce maintenance and avoid bit-rot.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lib/std/Thread/Condition.zig` around lines 116 - 119, The conditional alias Impl now only chooses SingleThreadedImpl or FutexImpl, leaving WindowsImpl unreachable; remove WindowsImpl or gate it behind an explicit fallback/feature flag. Update the alias selection logic referencing Impl and the concrete types SingleThreadedImpl, FutexImpl, and WindowsImpl: either delete the WindowsImpl definition and any uses, or add a clear conditional branch (e.g., if (builtin.windows or feature flag) then WindowsImpl else ...) so WindowsImpl is intentionally reachable and documented. Ensure any tests or references to WindowsImpl are updated or removed accordingly.

coderabbitai · 2026-04-22T10:43:32Z

+        // Sharded codegen changes the output file *set* (one merged object vs.
+        // N shard objects), so the count and the merge/no-link knobs must be
+        // part of the cache key.
+        cache.hash.add(options.llvm_codegen_threads);
+        cache.hash.add(options.llvm_no_merge_shards);
+        cache.hash.add(options.no_link_obj);


🧹 Nitpick | 🔵 Trivial

Hash the normalized shard flags, not the raw CLI values.

comp.no_link_obj and comp.no_merge_shards are normalized later, so these raw option values can split the cache for configurations that produce the same artifact layout. For example, --no-link-obj --llvm-codegen-threads>1 without --llvm-no-merge-shards still behaves like linking. Hash the derived booleans instead to avoid unnecessary misses.

♻️ Proposed refactor

+ const no_merge_shards = options.llvm_no_merge_shards and options.llvm_codegen_threads > 1; + const no_link_obj = options.no_link_obj and + (options.llvm_codegen_threads <= 1 or no_merge_shards); + // Sharded codegen changes the output file *set* (one merged object vs. // N shard objects), so the count and the merge/no-link knobs must be // part of the cache key. cache.hash.add(options.llvm_codegen_threads); - cache.hash.add(options.llvm_no_merge_shards); - cache.hash.add(options.no_link_obj); + cache.hash.add(no_merge_shards); + cache.hash.add(no_link_obj);

- .no_link_obj = options.no_link_obj and - (options.llvm_codegen_threads <= 1 or options.llvm_no_merge_shards), - .no_merge_shards = options.llvm_no_merge_shards and options.llvm_codegen_threads > 1, + .no_link_obj = no_link_obj, + .no_merge_shards = no_merge_shards,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Sharded codegen changes the output file *set* (one merged object vs.

// N shard objects), so the count and the merge/no-link knobs must be

// part of the cache key.

cache.hash.add(options.llvm_codegen_threads);

cache.hash.add(options.llvm_no_merge_shards);

cache.hash.add(options.no_link_obj);

const no_merge_shards = options.llvm_no_merge_shards and options.llvm_codegen_threads > 1;

const no_link_obj = options.no_link_obj and

(options.llvm_codegen_threads <= 1 or no_merge_shards);

// Sharded codegen changes the output file *set* (one merged object vs.

// N shard objects), so the count and the merge/no-link knobs must be

// part of the cache key.

cache.hash.add(options.llvm_codegen_threads);

cache.hash.add(no_merge_shards);

cache.hash.add(no_link_obj);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/Compilation.zig` around lines 2191 - 2196, The cache key currently uses raw CLI flags (options.llvm_no_merge_shards and options.no_link_obj) which are normalized later, causing unnecessary cache misses; update the cache hashing to use the canonicalized booleans (e.g., comp.no_merge_shards and comp.no_link_obj or whatever normalized fields exist) together with options.llvm_codegen_threads so shard/merge/link semantics are represented by the effective config rather than the raw flags — locate the cache.hash.add calls around options.llvm_codegen_threads and replace the additions of the raw option flags with additions of the normalized comp.no_merge_shards and comp.no_link_obj values.

coderabbitai · 2026-04-22T10:43:32Z

+            // with libc malloc used elsewhere (LLVM, C++). Debug keeps
+            // c_allocator so leak tooling and -Ddebug-gpa stay accurate.
+            if (builtin.mode == .Debug) {
+                if (@alignOf(std.c.max_align_t) < @max(@alignOf(i128), std.atomic.cache_line)) {
+                    break :gpa .{ std.heap.c_allocator, false };
+                }
+                break :gpa .{ std.heap.raw_c_allocator, false };


⚠️ Potential issue | 🟡 Minor

Fix the debug-path comment.

Lines 183-188 do not “keep c_allocator” in Debug; they still prefer std.heap.raw_c_allocator and only fall back to std.heap.c_allocator for over-aligned requests. The -Ddebug-gpa case is also handled earlier at Lines 173-174, so this comment currently describes a different control path than the code below it.

Suggested comment update

- // with libc malloc used elsewhere (LLVM, C++). Debug keeps - // c_allocator so leak tooling and -Ddebug-gpa stay accurate. + // with libc malloc used elsewhere (LLVM, C++). Debug stays on + // the libc-backed allocators here: prefer raw_c_allocator for the + // fast path, but fall back to c_allocator when over-aligned + // allocations would violate max_align_t. The -Ddebug-gpa path is + // handled above.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/main.zig` around lines 182 - 188, The comment above the Debug-path is incorrect: the code checks builtin.mode == .Debug and prefers std.heap.raw_c_allocator, only falling back to std.heap.c_allocator for over-aligned requests (when `@alignOf`(std.c.max_align_t) < `@max`(`@alignOf`(i128), std.atomic.cache_line)); also note the -Ddebug-gpa case is handled earlier. Update the comment to state that in Debug mode the allocator chosen is raw_c_allocator by default with a fallback to c_allocator for over-aligned allocations, and remove the misleading claim about “keeping c_allocator” and the -Ddebug-gpa handling.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Zcu.zig`:
- Around line 4815-4836: The new navShard implementation breaks the “same-file
=> same shard” guarantee needed for module-level asm (global_assembly)
referenced by nav_val/nav_ty/func keys; update the sharding logic so asm-owning
units are assigned by file-only affinity: in either navShard or in
analUnitShard, detect when the nav/anal unit contains a global_assembly (or will
be used for global_assembly emission from nav_val/nav_ty/func) and compute the
shard using only the file_key (i.e., hash(file_key) % n) so all asm from the
same file land in the same shard, while leaving the existing combined
file_key+fqn hashing for non-asm navs. Ensure you reference navShard,
analUnitShard, and the global_assembly/nav_val/nav_ty/func paths when making the
change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5db05c4f-0a2c-4be8-ba45-45f2ded44f97

📥 Commits

Reviewing files that changed from the base of the PR and between 89647b5 and 3630eb8.

📒 Files selected for processing (1)

src/Zcu.zig

coderabbitai · 2026-04-22T22:14:41Z

+/// Shard assignment for `nav`. Keyed on the file's `shardKey` *plus* the
+/// nav's fully-qualified name so a single file with thousands of generic
+/// instantiations (e.g. printf-style formatters that monomorphise per call
+/// site) doesn't pin the entire emit wall-clock to one LLVM module.
+///
+/// Determinism: the shard key is content-derived (path + FQN bytes), but
+/// FQNs of anonymous types embed InternPool indices (`__anon_N`) which are
+/// not stable across parallel-sema runs. That's no regression — the
+/// per-shard *symbol names* already carry those indices via `shardedNavName`
+/// and the type-name suffix, so sharded `build-obj` output was never
+/// bit-reproducible under `ZIG_PARALLEL_SEMA`. CI release builds use
+/// `--llvm-codegen-threads=1` (no sharding) and remain reproducible. A
+/// proper fix needs structural type-hash naming; tracked separately.
 pub fn navShard(zcu: *Zcu, nav: InternPool.Nav.Index, n: u32) u32 {
    if (n <= 1) return 0;
-    return zcu.navFileScope(nav).computeShard(n);
+    const ip = &zcu.intern_pool;
+    var buf: [512]u8 = undefined;
+    const file_key = zcu.navFileScope(nav).shardKey(&buf);
+    var h: std.hash.Wyhash = .init(0);
+    h.update(file_key);
+    h.update(ip.getNav(nav).fqn.toSlice(ip));
+    return @intCast(h.final() % n);


⚠️ Potential issue | 🟠 Major

Keep same-file shard affinity for module-level asm users.

This change drops the old “same file => same shard” invariant, but analUnitShard still sends nav_val/nav_ty/func-keyed global_assembly through navShard. That clashes with the contract called out on Lines 4839-4842: asm strings that reference other same-file navs (for example .set aliases) no longer reliably land in the same LLVM module, so the integrated assembler can miss symbols or force duplicate emission. Please preserve file-level affinity for asm-owning units, or add a separate shard-affinity path for global_assembly.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/Zcu.zig` around lines 4815 - 4836, The new navShard implementation breaks the “same-file => same shard” guarantee needed for module-level asm (global_assembly) referenced by nav_val/nav_ty/func keys; update the sharding logic so asm-owning units are assigned by file-only affinity: in either navShard or in analUnitShard, detect when the nav/anal unit contains a global_assembly (or will be used for global_assembly emission from nav_val/nav_ty/func) and compute the shard using only the file_key (i.e., hash(file_key) % n) so all asm from the same file land in the same shard, while leaving the existing combined file_key+fqn hashing for non-asm navs. Ensure you reference navShard, analUnitShard, and the global_assembly/nav_val/nav_ty/func paths when making the change.

coderabbitai

♻️ Duplicate comments (1)

src/main.zig (1)

176-188: ⚠️ Potential issue | 🟡 Minor

Update the Debug-path comment.

The comment still says Debug “keeps c_allocator”, but the fast path at Line 188 returns std.heap.raw_c_allocator; std.heap.c_allocator is only used for the over-aligned fallback at Lines 185-187, and -Ddebug-gpa is handled earlier at Line 173.

Suggested comment update

-            // with libc malloc used elsewhere (LLVM, C++). Debug keeps
-            // c_allocator so leak tooling and -Ddebug-gpa stay accurate.
+            // with libc malloc used elsewhere (LLVM, C++). Debug stays on
+            // libc-backed allocators here: prefer raw_c_allocator on the
+            // fast path, but fall back to c_allocator for over-aligned
+            // allocations. The -Ddebug-gpa path is handled above.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/main.zig` around lines 176 - 188, The comment above the GPA selection is
out of date: the Debug branch (builtin.mode == .Debug) usually returns
std.heap.raw_c_allocator with std.heap.c_allocator used only for the
over-aligned fallback (when `@alignOf`(std.c.max_align_t) < `@max`(`@alignOf`(i128),
std.atomic.cache_line)), and the -Ddebug-gpa behavior is handled earlier; update
the comment to state that Debug normally uses raw_c_allocator, that c_allocator
is only for the over-aligned case, and that debug-gpa is already addressed prior
to this gpa block (referencing the gpa selection and the symbols builtin.mode,
std.heap.raw_c_allocator, std.heap.c_allocator, and the -Ddebug-gpa handling).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/main.zig`:
- Around line 176-188: The comment above the GPA selection is out of date: the
Debug branch (builtin.mode == .Debug) usually returns std.heap.raw_c_allocator
with std.heap.c_allocator used only for the over-aligned fallback (when
`@alignOf`(std.c.max_align_t) < `@max`(`@alignOf`(i128), std.atomic.cache_line)), and
the -Ddebug-gpa behavior is handled earlier; update the comment to state that
Debug normally uses raw_c_allocator, that c_allocator is only for the
over-aligned case, and that debug-gpa is already addressed prior to this gpa
block (referencing the gpa selection and the symbols builtin.mode,
std.heap.raw_c_allocator, std.heap.c_allocator, and the -Ddebug-gpa handling).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 329cb1a7-3334-4fd2-bc9a-0420076e93ea

📥 Commits

Reviewing files that changed from the base of the PR and between 3630eb8 and 2bb76ed.

📒 Files selected for processing (1)

src/main.zig

- ZIG_PARALLEL_SEMA: Sema runs concurrently across worker threads with per-unit claim/wait, retry-on-dependency-cycle, and per-map mutexes replacing the global sema_lock for the non-incremental fast path. - InternPool: thread-safe writers (locked single-field setters, seqlock on getNav, sorted-shard prelocking for getFunc*Ies, 256 hash shards). - llvm backend: PartitionSet emits N independent llvm modules in parallel; cross-shard refs are linkonce_odr; --llvm-codegen-threads=N partitions by file path; --llvm-no-merge-shards leaves shard .o files unmerged. - link.MachO -r: handle N shard inputs; emit hidden defs as private-extern; convert tentatives so Apple ld_new accepts the merged object. - link.Elf: handle N shard inputs; batch preads in writeRelocatable to avoid per-atom syscall storm under heavy COMDAT section counts. - link.Lld: pass all shard paths to lld for elf/coff/wasm. - std.Build.Step.Compile: llvm_codegen_threads, llvm_no_merge_shards.

@Fence

Memory model (ARM64): - getNav seqlock: payload loads .unordered -> .acquire so b2 cannot reorder before them (LDAR-before-LDAR is ordered; @Fence is gone). - setFieldTypesAlignsAll: memcpy [0..len-1) then release-store [len-1] inside the mutex; remove the post-mutex re-store in structFields. .removed/.existing race cluster: - awaitNamespaceTypeFinished returns {finished, cancelled}. - 9 Sema .existing arms (zir*Decl, anon-struct-init, reify*) wrapped in gop:while(true) retry loops; cancelled re-runs get*Type. - getOrPutKeyInner locked re-probe skips .removed (mirror lockless path). Retry/requeue: - codegen_func: reset tls_retry_loop before resolveTypesFully; on retry requeue the job instead of dropping the body. - ensureMemoizedStateUpToDate: re-probe sentinel decl on .done. Misc: - deleteUnitReferences: capture parent + write self-loop marker before free-list append, all under inline_ref_mutex (fixes UAF on realloc). - test_functions.contains: take test_functions_mutex. - Lld coffLink/wasmLink: error on multi-shard build-obj instead of silently dropping shards 1..N. - PartitionSet.emit: keep asm_path for shard 0. - types_resolved: propagate OOM instead of swallowing as false. - dumpLlvmShardStats: clamp n<=256; per-(file,shard) top-file key.

…len comptime guard - build.zig + test/tests.zig: add llvm_codegen_threads option, set on all addModuleTests targets when LLVM backend is used. - lib/std/mem.zig: gate strlen/wcslen extern fast-path on !@inComptime() (extern call at comptime is invalid; pre-existing fork bug).

…iases; IES yield - llvm/Builder.zig + ir.zig: add COMDAT support (MODULE_CODE_COMDAT records, Variable.comdat field, addComdat). Required for COFF — without comdat any, linkonce_odr emits as a strong def per shard and lld-link rejects ~350 duplicate __anon_* symbols. - codegen/llvm.zig resolveGlobalUav/updateExportedValue: setComdat(.any) on COFF for sharded linkonce_odr uavs. - Zcu.navShard: switch from fqn-hash to file-hash via File.computeShard; add analUnitShard mapping comptime/nav/func units to their file's shard. - codegen/llvm.zig genModuleLevelAssembly: route each global asm block to its source file's shard so .set aliases resolve against same-module defs. - codegen/llvm.zig PartitionSet.updateExports: broadcast to all shards; Object.updateExports collapses non-owner extern globals onto one canonical decl so InstCombine cannot fold &a==&b to false pre-link. - Zcu.isClaimedByOther + Sema.resolveInferredErrorSet: when the IES func is claimed by another thread, set tls_retry_loop and yield (cap 8) instead of parking in claimOrWait. Reuses existing requeue path.

@as

…types_wip; fork bugfixes - Sema.analyzeNavRefInner: revert is_ref to .type-only resolve under parallel_sema (the .fully override created a structural self-dep on nav_val for 'const foo = .{ .self = &foo }'). The torn-read concern was unfounded — getNav returns by-value and isExternOrFn handles both status arms; the extern/fn branch already re-ensures .fully before dereferencing .fully_resolved.val. - Type.hasRuntimeBitsInner/comptimeOnlyInner: gate the four .field_types_wip self-recursion shortcuts on !isClaimedByOther so a wip flag set by another worker falls through to claimOrWait instead of poisoning assumed_runtime_bits. - main.zig: -fno-sanitize=address was hardcoded =true (pre-existing). - lib/std/os/linux.zig: clock_getres/settime @intFromEnum on clockid_t (pre-existing; @as(isize, enum) is invalid).

@typeinfo

Under parallel sema, error-name InternPool indices (and thus the index-sorted @typeinfo order) depend on which thread interns first. The language does not specify error-set @typeinfo ordering; check membership instead.

@as

…fixes ZIG_PARALLEL_SEMA on behavior.zig: ~18s serial → ~2.2s at j=16 (8.1x), 1.14x CPU overhead. 0/130 stress runs across j=8/16/32/64 + full exec. With -fllvm --llvm-codegen-threads=32: 9.9s → 2.15s. parallel sema: - Zcu: shard unit_claims into 256 {mutex,cond,map,deferred,waiters}; tryClaim/claimOrWait/releaseClaim/isClaimedByOther/deferOn lock only the unit's shard. claim_waits gets its own mutex; detectClaimCycle walks via tryLock peeks (skip on contended foreign shard). - Zcu: tryClaim() non-blocking; ensureFuncBodyUpToDate top-level skips on busy instead of parking. - Replace sema_lock under parallel non-incremental with fine-grained locks: embed_mutex, global_assembly_mutex, file_system_inputs_mutex, per-Namespace decls_mutex, comp.mutex for ensureFileAnalyzed. resolveStructInner/resolveUnionInner gated like the ensure* sites. - Sema.resolveInferredErrorSet: drop the shared 8-retry yield cap; the nested ensureFuncBodyUpToDate blocks on the (now sharded) claim instead of re-running the caller body. - awaitNamespaceTypeFinished: return .would_block instead of unbounded spin; callers yield-and-requeue. getNamespace/enumFieldIndex keep the spin variant per their finished-type contract. - Compilation: work_queue_cond replaces the dispatch loop's Thread.yield() busy-spin; queueJob/workerAnalyzeFunc signal it. - main: ReleaseSafe uses smp_allocator (debug_allocator's single mutex serialised every alloc and dominated wall time). - ZIG_PSEMA_STATS counters. races fixed: - Type.comptimeOnlyInner .normal strat: .wip/.unknown observed under parallel sema → false (per documented contract) instead of unreachable. - Zcu.maybeUnresolveIes: early-return under parallel non-incremental; the unlocked outdated.contains() raced scanDecl's writes. - InternPool.getIfExists: skip .removed entries. misc fixes from branch sweep: - Package/Fetch: promoted lazy→eager dep is now appended to all_fetches (arena leak + double-fetch + dropped errors otherwise). - std.fs.Dir.realpath windows ".": stack temp + NameTooLong, was slicing out_buffer to max_path_bytes unconditionally. - link/Elf, link/MachO: use base.resolveZcuObjectPaths instead of open-coding the {stem}.{i}.o expansion. - Compilation.dumpLlvmShardStats: use zcu.navShard (was hashing fqn, which doesn't match the file-path router). - codegen/llvm: free the bin_filename_list gpa allocation. - zig_llvm.cpp: delete dead getAsanOptions(). - target.zig: .@"async" → .async. - libs/libcxx: @intFromBool instead of @as(u1, if ...).

Exposes the per-shard object paths when llvm_no_merge_shards is set, so a build.zig can install/consume them directly instead of waiting for the single-threaded relocatable -r merge into one object.

musl's malloc has a single global rwlock. With N parallel LLVM contexts (--llvm-codegen-threads=N) every operator new from the bitcode reader and pass pipeline serialises on it — 270M futex calls compiling bun at cg=64, ~120s wall in emit alone. Add -Dmimalloc-obj=PATH to build.zig and have the bun_build workflow compile oven-sh/mimalloc (bun-dev3-v2) static.c with MI_MALLOC_OVERRIDE for the target, then link the object into the final cross-compiled zig. mimalloc's per-thread heaps reduce the futex count ~630x; bun's zig step on Linux/64c goes ~132s → ~23s incremental.

…v_map+OptBisect races; hoist FuncInstance prep out of 4-shard lock claimOrWait now distinguishes same-thread reentry (.recursed -> dependency loop diagnostic) from cross-thread cycle (.cycle -> yield-and-requeue); the previous bare AnalysisFail silently markTransitiveFailed both units with no error and could surface as a processExportsInner unreachable. resolveNavType brackets type/linksection writes with bits.writing so getNav cannot tear (mirrors resolveNavValue). ensureExportFuncQueued locks the shard mutex around nav_map.contains. ensureNavValAnalysisQueued unlocks nav_queued_mutex before queueJob so it doesn't nest under work_queue_mutex. getFuncInstanceIes precomputes the instance nav's name/fqn/mods before lockShardsSorted so the 4-shard critical section holds only createNav+owner_nav write. OptBisect is per-thread.

…ap fixes - Shard naming uses target.ofmt.fileExt() instead of hardcoded ".o" in Compilation.zig, link.zig, and Build/Step/Compile.zig so COFF targets get foo.{i}.obj. ELF/Mach-O behaviour unchanged (fileExt() returns ".o"). - Hash llvm_codegen_threads / llvm_no_merge_shards / no_link_obj into the cache key — these change the output file set (one merged object vs N shard objects), so a stale cache hit would otherwise produce the wrong layout. - std.c: drop const from _msize to match the Windows SDK declaration so the C-backend bootstrap (zig2.c) compiles under clang-cl/msvc. With these, build-obj --no-link --llvm-no-merge-shards --llvm-codegen-threads=N works for x86_64-windows-msvc; lld-link consumes the shards directly.

Parallel sema + sharded LLVM emit on Windows was wall-clock bound on two host primitives that the linux/macos paths don't hit: - gpa = raw_c_allocator when link_libc, which on Windows is HeapAlloc(GetProcessHeap()) behind a single critical section. Switch release builds to smp_allocator (per-thread heaps backed by the page allocator) the same way the no-libc path already does. Debug keeps c_allocator so leak tooling stays accurate. - std.Thread.Condition on Windows wrapped CONDITION_VARIABLE, whose Wake* has no userspace "no waiters" fast-path — every work_queue_cond.signal() and claim-shard cond.signal() became a kernel32 call. Use FutexImpl everywhere; on Windows the Futex layer already maps to RtlWaitOnAddress (Win8+). The old WindowsImpl is left in place for reference. - LLVM's own allocations go through C++ operator new, which still hits the CRT heap. Add tools/mimalloc_new_delete_override.cpp (mimalloc's unity static.c + the replaceable global operators) and a windows-gnu splice in the bootstrap workflow mirroring the existing linux-musl step. malloc/free themselves stay on the CRT — they can't be statically interposed on Windows — but LLVM's hot path is operator new, which is replaceable per the standard. bun debug zig step on a 24-core x86_64-windows-msvc host: serial psema psema+24sh 208.2s 165.0s 77.0s before 165.1s 151.7s 35.1s after (5.9x vs original serial)

Per-file partitioning meant a small source file that hosts thousands of generic instantiations (e.g. output.zig's printf-style formatters) lands entirely in one LLVM module, pinning emit wall-clock to that one shard. For bun on a 24-core host, shard 13 took 22.5s while the rest finished in 3-14s. Hashing the FQN as well spreads instantiations across shards. Max shard drops from 22.5s to ~14s; cross-shard externs grow (CPU sum +30%) but wall-clock falls. bun debug zig step: file-only: ~35s wall, 207s cpu-sum, max-shard 22.5s file+fqn: ~27s wall, 274s cpu-sum, max-shard 13.9s Determinism: anonymous-type FQNs embed InternPool indices which are insertion-order dependent under parallel sema, so the shard set can vary between runs. This is no regression — `shardedNavName` already embeds the same indices in cross-shard symbol names, so sharded build-obj output was never bit-reproducible under ZIG_PARALLEL_SEMA. cg=1 builds (CI releases) are unaffected. A structural-hash naming fix is tracked separately.

`zig build` compiles build_runner.zig + the user's build.zig before any step runs. That compile pulled in ~10k navs of std.Build and emitted them through a single LLVM module — ~3.7s of cold-cache wall before the first user step starts. Pass llvm_codegen_threads (the same n_jobs the thread pool was sized to) so the runner emit shards like any other compile. bun debug zig step, 24-core Windows, cold local cache: build-runner compile 3.84s -> 1.28s (emit 2.96s -> 0.37s) total 27s -> 24s

The FAST fork's Windows allocator/condvar fixes (oven-sh/zig#20) haven't landed yet, and historically the picker always returned STABLE for hostOs===windows. Windows local dev and Windows CI (PR or main) now stay on STABLE; only non-Windows local + PR CI use FAST.

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

dylan-conway force-pushed the upgrade-0.15.2 branch from 04e7f6a to 356ddda Compare April 23, 2026 23:45

dylan-conway changed the base branch from upgrade-0.15.2 to upgrade-0.15.2-fast April 23, 2026 23:46

dylan-conway force-pushed the claude/windows-parallel-shards branch from 2bb76ed to 2afac24 Compare April 23, 2026 23:48

alii and others added 15 commits April 23, 2026 16:50

test/behavior/cast: check error-set membership not @typeinfo order

3298e3f

Under parallel sema, error-name InternPool indices (and thus the index-sorted @typeinfo order) depend on which thread interns first. The language does not specify error-set @typeinfo ordering; check membership instead.

std.Build.Step.Compile: add getEmittedBinShards()

bf19f0a

Exposes the per-shard object paths when llvm_no_merge_shards is set, so a build.zig can install/consume them directly instead of waiting for the single-threaded relocatable -r merge into one object.

ci: gate mimalloc splice to linux-musl targets only

4f58f6e

dylan-conway force-pushed the upgrade-0.15.2-fast branch from b2ed937 to af6e006 Compare April 23, 2026 23:52

dylan-conway force-pushed the claude/windows-parallel-shards branch from 2afac24 to 125866c Compare April 23, 2026 23:52

dylan-conway mentioned this pull request Apr 24, 2026

build: split zig compiler pin into STABLE/FAST oven-sh/bun#29661

Open

github-actions Bot mentioned this pull request Apr 26, 2026

build: bump zig (LLVM IR attribute improvements) oven-sh/bun#29753

Open

4 tasks

Jarred-Sumner force-pushed the upgrade-0.15.2-fast branch from af6e006 to 597cbfe Compare April 26, 2026 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

psema/shard: enable Windows COFF sharded build-obj + cut allocator/condvar contention#20

psema/shard: enable Windows COFF sharded build-obj + cut allocator/condvar contention#20
dylan-conway wants to merge 15 commits into
upgrade-0.15.2-fastfrom
claude/windows-parallel-shards

dylan-conway commented Apr 22, 2026

Uh oh!

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Uh oh!

coderabbitai Bot Apr 22, 2026

Uh oh!

coderabbitai Bot Apr 22, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dylan-conway commented Apr 22, 2026

Summary

Changes

COFF shard emission (4073e7160c)

Host contention (89647b5ee9)

Verified

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

COFF shard emission (`4073e7160c`)

Host contention (`89647b5ee9`)

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading