Skip to content

psema/shard: enable Windows COFF sharded build-obj + cut allocator/condvar contention#20

Open
dylan-conway wants to merge 15 commits into
upgrade-0.15.2-fastfrom
claude/windows-parallel-shards
Open

psema/shard: enable Windows COFF sharded build-obj + cut allocator/condvar contention#20
dylan-conway wants to merge 15 commits into
upgrade-0.15.2-fastfrom
claude/windows-parallel-shards

Conversation

@dylan-conway

Copy link
Copy Markdown
Member

Summary

Enables build-obj --no-link --llvm-no-merge-shards --llvm-codegen-threads=N for x86_64-windows-msvc targets, and removes the host-side bottlenecks that kept the Windows compiler from scaling past ~2.7× under parallel sema + sharded emit.

bun debug zig step, 24-core x86_64 Windows host:

serial psema psema + 24 shards
before 208.2s 165.0s 77.0s
after 165.1s 151.7s 35.1s

5.9× vs the original serial baseline (was 2.7× before this PR).

Changes

COFF shard emission (4073e7160c)

  • Shard naming uses target.ofmt.fileExt() instead of hardcoded .o in Compilation.zig, link.zig, and Build/Step/Compile.zig, so COFF targets get foo.{i}.obj. ELF/Mach-O behaviour unchanged.
  • Hash llvm_codegen_threads / llvm_no_merge_shards / no_link_obj into the cache key — these change the output file set, so a stale cache hit would otherwise produce the wrong layout.
  • std.c._msize: drop const to match the Windows SDK declaration so the C-backend bootstrap (zig2.c) compiles under clang-cl/msvc.

Host contention (89647b5ee9)

  • src/main.zig: when link_libc and not Debug, use smp_allocator instead of raw_c_allocator. The Windows CRT routes malloc → HeapAlloc(GetProcessHeap()) behind a single critical section, so 24 sema workers serialise on it.
  • std.Thread.Condition: use FutexImpl on Windows. The CONDITION_VARIABLE wrapper had no userspace "no waiters" fast-path, so every work_queue_cond.signal() and claim-shard cond.signal() was a kernel32 call. FutexImpl.wake() checks wakeable == 0 first; the underlying Futex already maps to RtlWaitOnAddress on Win8+.
  • tools/mimalloc_new_delete_override.cpp + windows-gnu CI splice: LLVM's operator new still hit the CRT heap. The override TU compiles mimalloc's unity static.c and provides the C++ replaceable global operators; mirrors the existing linux-musl MI_MALLOC_OVERRIDE step (which uses POSIX symbol interposition for malloc/free — that doesn't statically link on Windows, but operator new replacement does and is where LLVM's hot allocations go).

Verified

  • bun run build (Debug) and bun run build:release both link 24 COFF shards via lld-link and pass smoke test on x86_64-windows-msvc.
  • zig test lib/std/std.zig passes with the Condition change.

@coderabbitai

coderabbitai Bot commented Apr 22, 2026

Copy link
Copy Markdown

Walkthrough

Build workflow adds Windows-GNU mimalloc splice; object-file handling generalized to target-specific extensions (e.g. .obj); LLVM shard naming, shard-flush filenames, and compilation cache keys updated; Condition defaults to FutexImpl for non-single-threaded targets; startup allocator selection changed by builtin.mode; _msize C import signature fixed; navShard now hashes file shard key plus nav FQN.

Changes

Cohort / File(s) Summary
Windows-GNU mimalloc integration
/.github/workflows/bun_build.yaml, tools/mimalloc_new_delete_override.cpp
Added a Windows-GNU conditional step to splice a mimalloc object into the bootstrap build and adjusted the existing linux-musl step label/comment. Added tools/mimalloc_new_delete_override.cpp providing global operator new/operator delete forwarding to mimalloc.
Object extension, shard naming, and linking
lib/std/Build/Step/Compile.zig, src/Compilation.zig, src/link.zig
Removed hardcoded .o assumptions: derive object extension via ofmt.fileExt(...), strip that extension when computing stems, and generate shard filenames as "{stem}.{i}{ext}" (handles COFF .obj). Updated docs and filename-trimming logic accordingly.
Compilation cache & shard behavior
src/Compilation.zig
Added options.llvm_codegen_threads, options.llvm_no_merge_shards, and options.no_link_obj to the compilation cache key; wired LLVM codegen thread count from resolved thread-pool size; updated shard flush logic to use the computed obj_ext.
Thread/Condition implementation change
lib/std/Thread/Condition.zig
Removed the .windows WindowsImpl branch so FutexImpl is chosen for all non-builtin.single_threaded platforms; added comment explaining use on Windows.
Allocator selection change & build-runner wiring
src/main.zig
When builtin.link_libc is enabled, allocator selection now conditions on builtin.mode == .Debug (Debug preserves alignment-sensitive previous behavior; non-Debug uses std.heap.smp_allocator). Also sets llvm_codegen_threads for the build-runner from the resolved thread-pool size.
C binding fix
lib/std/c.zig
Changed _msize import parameter from ?*const anyopaque to ?*anyopaque to match the C signature.
Nav shard determinism
src/Zcu.zig
navShard now computes shards by hashing the file's shardKey and the Nav FQN together (instead of delegating to File.computeShard(n)); updated doc comment on determinism and anonymous FQNs.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main objectives of the PR: enabling Windows COFF sharded build-obj and reducing allocator/condvar contention.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the changes, rationale, and measured improvements.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/std/Thread/Condition.zig`:
- Around line 116-119: The conditional alias Impl now only chooses
SingleThreadedImpl or FutexImpl, leaving WindowsImpl unreachable; remove
WindowsImpl or gate it behind an explicit fallback/feature flag. Update the
alias selection logic referencing Impl and the concrete types
SingleThreadedImpl, FutexImpl, and WindowsImpl: either delete the WindowsImpl
definition and any uses, or add a clear conditional branch (e.g., if
(builtin.windows or feature flag) then WindowsImpl else ...) so WindowsImpl is
intentionally reachable and documented. Ensure any tests or references to
WindowsImpl are updated or removed accordingly.

In `@src/Compilation.zig`:
- Around line 2191-2196: The cache key currently uses raw CLI flags
(options.llvm_no_merge_shards and options.no_link_obj) which are normalized
later, causing unnecessary cache misses; update the cache hashing to use the
canonicalized booleans (e.g., comp.no_merge_shards and comp.no_link_obj or
whatever normalized fields exist) together with options.llvm_codegen_threads so
shard/merge/link semantics are represented by the effective config rather than
the raw flags — locate the cache.hash.add calls around
options.llvm_codegen_threads and replace the additions of the raw option flags
with additions of the normalized comp.no_merge_shards and comp.no_link_obj
values.

In `@src/main.zig`:
- Around line 182-188: The comment above the Debug-path is incorrect: the code
checks builtin.mode == .Debug and prefers std.heap.raw_c_allocator, only falling
back to std.heap.c_allocator for over-aligned requests (when
`@alignOf`(std.c.max_align_t) < `@max`(`@alignOf`(i128), std.atomic.cache_line)); also
note the -Ddebug-gpa case is handled earlier. Update the comment to state that
in Debug mode the allocator chosen is raw_c_allocator by default with a fallback
to c_allocator for over-aligned allocations, and remove the misleading claim
about “keeping c_allocator” and the -Ddebug-gpa handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1c497521-9d62-490c-b73a-afd2db9cf3cd

📥 Commits

Reviewing files that changed from the base of the PR and between 0bcf4c3 and 89647b5.

📒 Files selected for processing (8)
  • .github/workflows/bun_build.yaml
  • lib/std/Build/Step/Compile.zig
  • lib/std/Thread/Condition.zig
  • lib/std/c.zig
  • src/Compilation.zig
  • src/link.zig
  • src/main.zig
  • tools/mimalloc_new_delete_override.cpp

Comment on lines 116 to 119
const Impl = if (builtin.single_threaded)
SingleThreadedImpl
else if (builtin.os.tag == .windows)
WindowsImpl
else
FutexImpl;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

WindowsImpl looks unreachable after this selection change.

Since Impl no longer selects WindowsImpl, consider removing it (or clearly parking it behind an explicit fallback gate) to reduce maintenance and avoid bit-rot.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/std/Thread/Condition.zig` around lines 116 - 119, The conditional alias
Impl now only chooses SingleThreadedImpl or FutexImpl, leaving WindowsImpl
unreachable; remove WindowsImpl or gate it behind an explicit fallback/feature
flag. Update the alias selection logic referencing Impl and the concrete types
SingleThreadedImpl, FutexImpl, and WindowsImpl: either delete the WindowsImpl
definition and any uses, or add a clear conditional branch (e.g., if
(builtin.windows or feature flag) then WindowsImpl else ...) so WindowsImpl is
intentionally reachable and documented. Ensure any tests or references to
WindowsImpl are updated or removed accordingly.

Comment thread src/Compilation.zig
Comment on lines +2191 to +2196
// Sharded codegen changes the output file *set* (one merged object vs.
// N shard objects), so the count and the merge/no-link knobs must be
// part of the cache key.
cache.hash.add(options.llvm_codegen_threads);
cache.hash.add(options.llvm_no_merge_shards);
cache.hash.add(options.no_link_obj);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Hash the normalized shard flags, not the raw CLI values.

comp.no_link_obj and comp.no_merge_shards are normalized later, so these raw option values can split the cache for configurations that produce the same artifact layout. For example, --no-link-obj --llvm-codegen-threads>1 without --llvm-no-merge-shards still behaves like linking. Hash the derived booleans instead to avoid unnecessary misses.

♻️ Proposed refactor
+        const no_merge_shards = options.llvm_no_merge_shards and options.llvm_codegen_threads > 1;
+        const no_link_obj = options.no_link_obj and
+            (options.llvm_codegen_threads <= 1 or no_merge_shards);
+
         // Sharded codegen changes the output file *set* (one merged object vs.
         // N shard objects), so the count and the merge/no-link knobs must be
         // part of the cache key.
         cache.hash.add(options.llvm_codegen_threads);
-        cache.hash.add(options.llvm_no_merge_shards);
-        cache.hash.add(options.no_link_obj);
+        cache.hash.add(no_merge_shards);
+        cache.hash.add(no_link_obj);
-            .no_link_obj = options.no_link_obj and
-                (options.llvm_codegen_threads <= 1 or options.llvm_no_merge_shards),
-            .no_merge_shards = options.llvm_no_merge_shards and options.llvm_codegen_threads > 1,
+            .no_link_obj = no_link_obj,
+            .no_merge_shards = no_merge_shards,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Sharded codegen changes the output file *set* (one merged object vs.
// N shard objects), so the count and the merge/no-link knobs must be
// part of the cache key.
cache.hash.add(options.llvm_codegen_threads);
cache.hash.add(options.llvm_no_merge_shards);
cache.hash.add(options.no_link_obj);
const no_merge_shards = options.llvm_no_merge_shards and options.llvm_codegen_threads > 1;
const no_link_obj = options.no_link_obj and
(options.llvm_codegen_threads <= 1 or no_merge_shards);
// Sharded codegen changes the output file *set* (one merged object vs.
// N shard objects), so the count and the merge/no-link knobs must be
// part of the cache key.
cache.hash.add(options.llvm_codegen_threads);
cache.hash.add(no_merge_shards);
cache.hash.add(no_link_obj);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Compilation.zig` around lines 2191 - 2196, The cache key currently uses
raw CLI flags (options.llvm_no_merge_shards and options.no_link_obj) which are
normalized later, causing unnecessary cache misses; update the cache hashing to
use the canonicalized booleans (e.g., comp.no_merge_shards and comp.no_link_obj
or whatever normalized fields exist) together with options.llvm_codegen_threads
so shard/merge/link semantics are represented by the effective config rather
than the raw flags — locate the cache.hash.add calls around
options.llvm_codegen_threads and replace the additions of the raw option flags
with additions of the normalized comp.no_merge_shards and comp.no_link_obj
values.

Comment thread src/main.zig
Comment on lines +182 to +188
// with libc malloc used elsewhere (LLVM, C++). Debug keeps
// c_allocator so leak tooling and -Ddebug-gpa stay accurate.
if (builtin.mode == .Debug) {
if (@alignOf(std.c.max_align_t) < @max(@alignOf(i128), std.atomic.cache_line)) {
break :gpa .{ std.heap.c_allocator, false };
}
break :gpa .{ std.heap.raw_c_allocator, false };

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the debug-path comment.

Lines 183-188 do not “keep c_allocator” in Debug; they still prefer std.heap.raw_c_allocator and only fall back to std.heap.c_allocator for over-aligned requests. The -Ddebug-gpa case is also handled earlier at Lines 173-174, so this comment currently describes a different control path than the code below it.

Suggested comment update
-            // with libc malloc used elsewhere (LLVM, C++). Debug keeps
-            // c_allocator so leak tooling and -Ddebug-gpa stay accurate.
+            // with libc malloc used elsewhere (LLVM, C++). Debug stays on
+            // the libc-backed allocators here: prefer raw_c_allocator for the
+            // fast path, but fall back to c_allocator when over-aligned
+            // allocations would violate max_align_t. The -Ddebug-gpa path is
+            // handled above.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.zig` around lines 182 - 188, The comment above the Debug-path is
incorrect: the code checks builtin.mode == .Debug and prefers
std.heap.raw_c_allocator, only falling back to std.heap.c_allocator for
over-aligned requests (when `@alignOf`(std.c.max_align_t) < `@max`(`@alignOf`(i128),
std.atomic.cache_line)); also note the -Ddebug-gpa case is handled earlier.
Update the comment to state that in Debug mode the allocator chosen is
raw_c_allocator by default with a fallback to c_allocator for over-aligned
allocations, and remove the misleading claim about “keeping c_allocator” and the
-Ddebug-gpa handling.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Zcu.zig`:
- Around line 4815-4836: The new navShard implementation breaks the “same-file
=> same shard” guarantee needed for module-level asm (global_assembly)
referenced by nav_val/nav_ty/func keys; update the sharding logic so asm-owning
units are assigned by file-only affinity: in either navShard or in
analUnitShard, detect when the nav/anal unit contains a global_assembly (or will
be used for global_assembly emission from nav_val/nav_ty/func) and compute the
shard using only the file_key (i.e., hash(file_key) % n) so all asm from the
same file land in the same shard, while leaving the existing combined
file_key+fqn hashing for non-asm navs. Ensure you reference navShard,
analUnitShard, and the global_assembly/nav_val/nav_ty/func paths when making the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5db05c4f-0a2c-4be8-ba45-45f2ded44f97

📥 Commits

Reviewing files that changed from the base of the PR and between 89647b5 and 3630eb8.

📒 Files selected for processing (1)
  • src/Zcu.zig

Comment thread src/Zcu.zig
Comment on lines +4815 to +4836
/// Shard assignment for `nav`. Keyed on the file's `shardKey` *plus* the
/// nav's fully-qualified name so a single file with thousands of generic
/// instantiations (e.g. printf-style formatters that monomorphise per call
/// site) doesn't pin the entire emit wall-clock to one LLVM module.
///
/// Determinism: the shard key is content-derived (path + FQN bytes), but
/// FQNs of anonymous types embed InternPool indices (`__anon_N`) which are
/// not stable across parallel-sema runs. That's no regression — the
/// per-shard *symbol names* already carry those indices via `shardedNavName`
/// and the type-name suffix, so sharded `build-obj` output was never
/// bit-reproducible under `ZIG_PARALLEL_SEMA`. CI release builds use
/// `--llvm-codegen-threads=1` (no sharding) and remain reproducible. A
/// proper fix needs structural type-hash naming; tracked separately.
pub fn navShard(zcu: *Zcu, nav: InternPool.Nav.Index, n: u32) u32 {
if (n <= 1) return 0;
return zcu.navFileScope(nav).computeShard(n);
const ip = &zcu.intern_pool;
var buf: [512]u8 = undefined;
const file_key = zcu.navFileScope(nav).shardKey(&buf);
var h: std.hash.Wyhash = .init(0);
h.update(file_key);
h.update(ip.getNav(nav).fqn.toSlice(ip));
return @intCast(h.final() % n);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep same-file shard affinity for module-level asm users.

This change drops the old “same file => same shard” invariant, but analUnitShard still sends nav_val/nav_ty/func-keyed global_assembly through navShard. That clashes with the contract called out on Lines 4839-4842: asm strings that reference other same-file navs (for example .set aliases) no longer reliably land in the same LLVM module, so the integrated assembler can miss symbols or force duplicate emission. Please preserve file-level affinity for asm-owning units, or add a separate shard-affinity path for global_assembly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Zcu.zig` around lines 4815 - 4836, The new navShard implementation breaks
the “same-file => same shard” guarantee needed for module-level asm
(global_assembly) referenced by nav_val/nav_ty/func keys; update the sharding
logic so asm-owning units are assigned by file-only affinity: in either navShard
or in analUnitShard, detect when the nav/anal unit contains a global_assembly
(or will be used for global_assembly emission from nav_val/nav_ty/func) and
compute the shard using only the file_key (i.e., hash(file_key) % n) so all asm
from the same file land in the same shard, while leaving the existing combined
file_key+fqn hashing for non-asm navs. Ensure you reference navShard,
analUnitShard, and the global_assembly/nav_val/nav_ty/func paths when making the
change.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/main.zig (1)

176-188: ⚠️ Potential issue | 🟡 Minor

Update the Debug-path comment.

The comment still says Debug “keeps c_allocator”, but the fast path at Line 188 returns std.heap.raw_c_allocator; std.heap.c_allocator is only used for the over-aligned fallback at Lines 185-187, and -Ddebug-gpa is handled earlier at Line 173.

Suggested comment update
-            // with libc malloc used elsewhere (LLVM, C++). Debug keeps
-            // c_allocator so leak tooling and -Ddebug-gpa stay accurate.
+            // with libc malloc used elsewhere (LLVM, C++). Debug stays on
+            // libc-backed allocators here: prefer raw_c_allocator on the
+            // fast path, but fall back to c_allocator for over-aligned
+            // allocations. The -Ddebug-gpa path is handled above.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.zig` around lines 176 - 188, The comment above the GPA selection is
out of date: the Debug branch (builtin.mode == .Debug) usually returns
std.heap.raw_c_allocator with std.heap.c_allocator used only for the
over-aligned fallback (when `@alignOf`(std.c.max_align_t) < `@max`(`@alignOf`(i128),
std.atomic.cache_line)), and the -Ddebug-gpa behavior is handled earlier; update
the comment to state that Debug normally uses raw_c_allocator, that c_allocator
is only for the over-aligned case, and that debug-gpa is already addressed prior
to this gpa block (referencing the gpa selection and the symbols builtin.mode,
std.heap.raw_c_allocator, std.heap.c_allocator, and the -Ddebug-gpa handling).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/main.zig`:
- Around line 176-188: The comment above the GPA selection is out of date: the
Debug branch (builtin.mode == .Debug) usually returns std.heap.raw_c_allocator
with std.heap.c_allocator used only for the over-aligned fallback (when
`@alignOf`(std.c.max_align_t) < `@max`(`@alignOf`(i128), std.atomic.cache_line)), and
the -Ddebug-gpa behavior is handled earlier; update the comment to state that
Debug normally uses raw_c_allocator, that c_allocator is only for the
over-aligned case, and that debug-gpa is already addressed prior to this gpa
block (referencing the gpa selection and the symbols builtin.mode,
std.heap.raw_c_allocator, std.heap.c_allocator, and the -Ddebug-gpa handling).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 329cb1a7-3334-4fd2-bc9a-0420076e93ea

📥 Commits

Reviewing files that changed from the base of the PR and between 3630eb8 and 2bb76ed.

📒 Files selected for processing (1)
  • src/main.zig

@dylan-conway dylan-conway changed the base branch from upgrade-0.15.2 to upgrade-0.15.2-fast April 23, 2026 23:46
@dylan-conway dylan-conway force-pushed the claude/windows-parallel-shards branch from 2bb76ed to 2afac24 Compare April 23, 2026 23:48
alii and others added 15 commits April 23, 2026 16:50
- ZIG_PARALLEL_SEMA: Sema runs concurrently across worker threads with
  per-unit claim/wait, retry-on-dependency-cycle, and per-map mutexes
  replacing the global sema_lock for the non-incremental fast path.
- InternPool: thread-safe writers (locked single-field setters, seqlock
  on getNav, sorted-shard prelocking for getFunc*Ies, 256 hash shards).
- llvm backend: PartitionSet emits N independent llvm modules in parallel;
  cross-shard refs are linkonce_odr; --llvm-codegen-threads=N partitions
  by file path; --llvm-no-merge-shards leaves shard .o files unmerged.
- link.MachO -r: handle N shard inputs; emit hidden defs as private-extern;
  convert tentatives so Apple ld_new accepts the merged object.
- link.Elf: handle N shard inputs; batch preads in writeRelocatable to
  avoid per-atom syscall storm under heavy COMDAT section counts.
- link.Lld: pass all shard paths to lld for elf/coff/wasm.
- std.Build.Step.Compile: llvm_codegen_threads, llvm_no_merge_shards.
Memory model (ARM64):
- getNav seqlock: payload loads .unordered -> .acquire so b2 cannot
  reorder before them (LDAR-before-LDAR is ordered; @Fence is gone).
- setFieldTypesAlignsAll: memcpy [0..len-1) then release-store [len-1]
  inside the mutex; remove the post-mutex re-store in structFields.

.removed/.existing race cluster:
- awaitNamespaceTypeFinished returns {finished, cancelled}.
- 9 Sema .existing arms (zir*Decl, anon-struct-init, reify*) wrapped in
  gop:while(true) retry loops; cancelled re-runs get*Type.
- getOrPutKeyInner locked re-probe skips .removed (mirror lockless path).

Retry/requeue:
- codegen_func: reset tls_retry_loop before resolveTypesFully; on retry
  requeue the job instead of dropping the body.
- ensureMemoizedStateUpToDate: re-probe sentinel decl on .done.

Misc:
- deleteUnitReferences: capture parent + write self-loop marker before
  free-list append, all under inline_ref_mutex (fixes UAF on realloc).
- test_functions.contains: take test_functions_mutex.
- Lld coffLink/wasmLink: error on multi-shard build-obj instead of
  silently dropping shards 1..N.
- PartitionSet.emit: keep asm_path for shard 0.
- types_resolved: propagate OOM instead of swallowing as false.
- dumpLlvmShardStats: clamp n<=256; per-(file,shard) top-file key.
…len comptime guard

- build.zig + test/tests.zig: add llvm_codegen_threads option, set on all
  addModuleTests targets when LLVM backend is used.
- lib/std/mem.zig: gate strlen/wcslen extern fast-path on !@inComptime()
  (extern call at comptime is invalid; pre-existing fork bug).
…iases; IES yield

- llvm/Builder.zig + ir.zig: add COMDAT support (MODULE_CODE_COMDAT
  records, Variable.comdat field, addComdat). Required for COFF — without
  comdat any, linkonce_odr emits as a strong def per shard and lld-link
  rejects ~350 duplicate __anon_* symbols.
- codegen/llvm.zig resolveGlobalUav/updateExportedValue: setComdat(.any)
  on COFF for sharded linkonce_odr uavs.
- Zcu.navShard: switch from fqn-hash to file-hash via File.computeShard;
  add analUnitShard mapping comptime/nav/func units to their file's shard.
- codegen/llvm.zig genModuleLevelAssembly: route each global asm block to
  its source file's shard so .set aliases resolve against same-module defs.
- codegen/llvm.zig PartitionSet.updateExports: broadcast to all shards;
  Object.updateExports collapses non-owner extern globals onto one
  canonical decl so InstCombine cannot fold &a==&b to false pre-link.
- Zcu.isClaimedByOther + Sema.resolveInferredErrorSet: when the IES func
  is claimed by another thread, set tls_retry_loop and yield (cap 8)
  instead of parking in claimOrWait. Reuses existing requeue path.
…types_wip; fork bugfixes

- Sema.analyzeNavRefInner: revert is_ref to .type-only resolve under
  parallel_sema (the .fully override created a structural self-dep on
  nav_val for 'const foo = .{ .self = &foo }'). The torn-read concern
  was unfounded — getNav returns by-value and isExternOrFn handles both
  status arms; the extern/fn branch already re-ensures .fully before
  dereferencing .fully_resolved.val.
- Type.hasRuntimeBitsInner/comptimeOnlyInner: gate the four
  .field_types_wip self-recursion shortcuts on !isClaimedByOther so a
  wip flag set by another worker falls through to claimOrWait instead of
  poisoning assumed_runtime_bits.
- main.zig: -fno-sanitize=address was hardcoded =true (pre-existing).
- lib/std/os/linux.zig: clock_getres/settime @intFromEnum on clockid_t
  (pre-existing; @as(isize, enum) is invalid).
Under parallel sema, error-name InternPool indices (and thus the
index-sorted @typeinfo order) depend on which thread interns first.
The language does not specify error-set @typeinfo ordering; check
membership instead.
…fixes

ZIG_PARALLEL_SEMA on behavior.zig: ~18s serial → ~2.2s at j=16 (8.1x),
1.14x CPU overhead. 0/130 stress runs across j=8/16/32/64 + full exec.
With -fllvm --llvm-codegen-threads=32: 9.9s → 2.15s.

parallel sema:
- Zcu: shard unit_claims into 256 {mutex,cond,map,deferred,waiters};
  tryClaim/claimOrWait/releaseClaim/isClaimedByOther/deferOn lock only
  the unit's shard. claim_waits gets its own mutex; detectClaimCycle
  walks via tryLock peeks (skip on contended foreign shard).
- Zcu: tryClaim() non-blocking; ensureFuncBodyUpToDate top-level skips
  on busy instead of parking.
- Replace sema_lock under parallel non-incremental with fine-grained
  locks: embed_mutex, global_assembly_mutex, file_system_inputs_mutex,
  per-Namespace decls_mutex, comp.mutex for ensureFileAnalyzed.
  resolveStructInner/resolveUnionInner gated like the ensure* sites.
- Sema.resolveInferredErrorSet: drop the shared 8-retry yield cap; the
  nested ensureFuncBodyUpToDate blocks on the (now sharded) claim
  instead of re-running the caller body.
- awaitNamespaceTypeFinished: return .would_block instead of unbounded
  spin; callers yield-and-requeue. getNamespace/enumFieldIndex keep the
  spin variant per their finished-type contract.
- Compilation: work_queue_cond replaces the dispatch loop's
  Thread.yield() busy-spin; queueJob/workerAnalyzeFunc signal it.
- main: ReleaseSafe uses smp_allocator (debug_allocator's single mutex
  serialised every alloc and dominated wall time).
- ZIG_PSEMA_STATS counters.

races fixed:
- Type.comptimeOnlyInner .normal strat: .wip/.unknown observed under
  parallel sema → false (per documented contract) instead of unreachable.
- Zcu.maybeUnresolveIes: early-return under parallel non-incremental;
  the unlocked outdated.contains() raced scanDecl's writes.
- InternPool.getIfExists: skip .removed entries.

misc fixes from branch sweep:
- Package/Fetch: promoted lazy→eager dep is now appended to all_fetches
  (arena leak + double-fetch + dropped errors otherwise).
- std.fs.Dir.realpath windows ".": stack temp + NameTooLong, was
  slicing out_buffer to max_path_bytes unconditionally.
- link/Elf, link/MachO: use base.resolveZcuObjectPaths instead of
  open-coding the {stem}.{i}.o expansion.
- Compilation.dumpLlvmShardStats: use zcu.navShard (was hashing fqn,
  which doesn't match the file-path router).
- codegen/llvm: free the bin_filename_list gpa allocation.
- zig_llvm.cpp: delete dead getAsanOptions().
- target.zig: .@"async" → .async.
- libs/libcxx: @intFromBool instead of @as(u1, if ...).
Exposes the per-shard object paths when llvm_no_merge_shards is set,
so a build.zig can install/consume them directly instead of waiting
for the single-threaded relocatable -r merge into one object.
musl's malloc has a single global rwlock. With N parallel LLVM contexts
(--llvm-codegen-threads=N) every operator new from the bitcode reader
and pass pipeline serialises on it — 270M futex calls compiling bun
at cg=64, ~120s wall in emit alone.

Add -Dmimalloc-obj=PATH to build.zig and have the bun_build workflow
compile oven-sh/mimalloc (bun-dev3-v2) static.c with MI_MALLOC_OVERRIDE
for the target, then link the object into the final cross-compiled zig.
mimalloc's per-thread heaps reduce the futex count ~630x; bun's zig
step on Linux/64c goes ~132s → ~23s incremental.
…v_map+OptBisect races; hoist FuncInstance prep out of 4-shard lock

claimOrWait now distinguishes same-thread reentry (.recursed -> dependency
loop diagnostic) from cross-thread cycle (.cycle -> yield-and-requeue);
the previous bare AnalysisFail silently markTransitiveFailed both units
with no error and could surface as a processExportsInner unreachable.
resolveNavType brackets type/linksection writes with bits.writing so
getNav cannot tear (mirrors resolveNavValue). ensureExportFuncQueued
locks the shard mutex around nav_map.contains. ensureNavValAnalysisQueued
unlocks nav_queued_mutex before queueJob so it doesn't nest under
work_queue_mutex. getFuncInstanceIes precomputes the instance nav's
name/fqn/mods before lockShardsSorted so the 4-shard critical section
holds only createNav+owner_nav write. OptBisect is per-thread.
…ap fixes

- Shard naming uses target.ofmt.fileExt() instead of hardcoded ".o" in
  Compilation.zig, link.zig, and Build/Step/Compile.zig so COFF targets
  get foo.{i}.obj. ELF/Mach-O behaviour unchanged (fileExt() returns ".o").
- Hash llvm_codegen_threads / llvm_no_merge_shards / no_link_obj into the
  cache key — these change the output file set (one merged object vs N
  shard objects), so a stale cache hit would otherwise produce the wrong
  layout.
- std.c: drop const from _msize to match the Windows SDK declaration so
  the C-backend bootstrap (zig2.c) compiles under clang-cl/msvc.

With these, build-obj --no-link --llvm-no-merge-shards --llvm-codegen-threads=N
works for x86_64-windows-msvc; lld-link consumes the shards directly.
Parallel sema + sharded LLVM emit on Windows was wall-clock bound on
two host primitives that the linux/macos paths don't hit:

- gpa = raw_c_allocator when link_libc, which on Windows is
  HeapAlloc(GetProcessHeap()) behind a single critical section. Switch
  release builds to smp_allocator (per-thread heaps backed by the page
  allocator) the same way the no-libc path already does. Debug keeps
  c_allocator so leak tooling stays accurate.
- std.Thread.Condition on Windows wrapped CONDITION_VARIABLE, whose
  Wake* has no userspace "no waiters" fast-path — every
  work_queue_cond.signal() and claim-shard cond.signal() became a
  kernel32 call. Use FutexImpl everywhere; on Windows the Futex layer
  already maps to RtlWaitOnAddress (Win8+). The old WindowsImpl is left
  in place for reference.
- LLVM's own allocations go through C++ operator new, which still hits
  the CRT heap. Add tools/mimalloc_new_delete_override.cpp (mimalloc's
  unity static.c + the replaceable global operators) and a windows-gnu
  splice in the bootstrap workflow mirroring the existing linux-musl
  step. malloc/free themselves stay on the CRT — they can't be
  statically interposed on Windows — but LLVM's hot path is operator
  new, which is replaceable per the standard.

bun debug zig step on a 24-core x86_64-windows-msvc host:

  serial    psema    psema+24sh
  208.2s    165.0s   77.0s        before
  165.1s    151.7s   35.1s        after  (5.9x vs original serial)
Per-file partitioning meant a small source file that hosts thousands of
generic instantiations (e.g. output.zig's printf-style formatters) lands
entirely in one LLVM module, pinning emit wall-clock to that one shard.
For bun on a 24-core host, shard 13 took 22.5s while the rest finished
in 3-14s.

Hashing the FQN as well spreads instantiations across shards. Max shard
drops from 22.5s to ~14s; cross-shard externs grow (CPU sum +30%) but
wall-clock falls. bun debug zig step:

  file-only:   ~35s wall, 207s cpu-sum, max-shard 22.5s
  file+fqn:    ~27s wall, 274s cpu-sum, max-shard 13.9s

Determinism: anonymous-type FQNs embed InternPool indices which are
insertion-order dependent under parallel sema, so the shard set can vary
between runs. This is no regression — `shardedNavName` already embeds
the same indices in cross-shard symbol names, so sharded build-obj output
was never bit-reproducible under ZIG_PARALLEL_SEMA. cg=1 builds (CI
releases) are unaffected. A structural-hash naming fix is tracked
separately.
`zig build` compiles build_runner.zig + the user's build.zig before any
step runs. That compile pulled in ~10k navs of std.Build and emitted
them through a single LLVM module — ~3.7s of cold-cache wall before the
first user step starts. Pass llvm_codegen_threads (the same n_jobs the
thread pool was sized to) so the runner emit shards like any other
compile.

bun debug zig step, 24-core Windows, cold local cache:

  build-runner compile  3.84s -> 1.28s  (emit 2.96s -> 0.37s)
  total                 27s   -> 24s
@dylan-conway dylan-conway force-pushed the claude/windows-parallel-shards branch from 2afac24 to 125866c Compare April 23, 2026 23:52
dylan-conway added a commit to oven-sh/bun that referenced this pull request Apr 24, 2026
The FAST fork's Windows allocator/condvar fixes (oven-sh/zig#20) haven't
landed yet, and historically the picker always returned STABLE for
hostOs===windows. Windows local dev and Windows CI (PR or main) now stay
on STABLE; only non-Windows local + PR CI use FAST.
Jarred-Sumner pushed a commit to oven-sh/bun that referenced this pull request May 4, 2026
The FAST fork's Windows allocator/condvar fixes (oven-sh/zig#20) haven't
landed yet, and historically the picker always returned STABLE for
hostOs===windows. Windows local dev and Windows CI (PR or main) now stay
on STABLE; only non-Windows local + PR CI use FAST.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants