psema/shard: enable Windows COFF sharded build-obj + cut allocator/condvar contention#20
psema/shard: enable Windows COFF sharded build-obj + cut allocator/condvar contention#20dylan-conway wants to merge 15 commits into
Conversation
WalkthroughBuild workflow adds Windows-GNU mimalloc splice; object-file handling generalized to target-specific extensions (e.g. Changes
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/std/Thread/Condition.zig`:
- Around line 116-119: The conditional alias Impl now only chooses
SingleThreadedImpl or FutexImpl, leaving WindowsImpl unreachable; remove
WindowsImpl or gate it behind an explicit fallback/feature flag. Update the
alias selection logic referencing Impl and the concrete types
SingleThreadedImpl, FutexImpl, and WindowsImpl: either delete the WindowsImpl
definition and any uses, or add a clear conditional branch (e.g., if
(builtin.windows or feature flag) then WindowsImpl else ...) so WindowsImpl is
intentionally reachable and documented. Ensure any tests or references to
WindowsImpl are updated or removed accordingly.
In `@src/Compilation.zig`:
- Around line 2191-2196: The cache key currently uses raw CLI flags
(options.llvm_no_merge_shards and options.no_link_obj) which are normalized
later, causing unnecessary cache misses; update the cache hashing to use the
canonicalized booleans (e.g., comp.no_merge_shards and comp.no_link_obj or
whatever normalized fields exist) together with options.llvm_codegen_threads so
shard/merge/link semantics are represented by the effective config rather than
the raw flags — locate the cache.hash.add calls around
options.llvm_codegen_threads and replace the additions of the raw option flags
with additions of the normalized comp.no_merge_shards and comp.no_link_obj
values.
In `@src/main.zig`:
- Around line 182-188: The comment above the Debug-path is incorrect: the code
checks builtin.mode == .Debug and prefers std.heap.raw_c_allocator, only falling
back to std.heap.c_allocator for over-aligned requests (when
`@alignOf`(std.c.max_align_t) < `@max`(`@alignOf`(i128), std.atomic.cache_line)); also
note the -Ddebug-gpa case is handled earlier. Update the comment to state that
in Debug mode the allocator chosen is raw_c_allocator by default with a fallback
to c_allocator for over-aligned allocations, and remove the misleading claim
about “keeping c_allocator” and the -Ddebug-gpa handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1c497521-9d62-490c-b73a-afd2db9cf3cd
📒 Files selected for processing (8)
.github/workflows/bun_build.yamllib/std/Build/Step/Compile.ziglib/std/Thread/Condition.ziglib/std/c.zigsrc/Compilation.zigsrc/link.zigsrc/main.zigtools/mimalloc_new_delete_override.cpp
| const Impl = if (builtin.single_threaded) | ||
| SingleThreadedImpl | ||
| else if (builtin.os.tag == .windows) | ||
| WindowsImpl | ||
| else | ||
| FutexImpl; |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
WindowsImpl looks unreachable after this selection change.
Since Impl no longer selects WindowsImpl, consider removing it (or clearly parking it behind an explicit fallback gate) to reduce maintenance and avoid bit-rot.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lib/std/Thread/Condition.zig` around lines 116 - 119, The conditional alias
Impl now only chooses SingleThreadedImpl or FutexImpl, leaving WindowsImpl
unreachable; remove WindowsImpl or gate it behind an explicit fallback/feature
flag. Update the alias selection logic referencing Impl and the concrete types
SingleThreadedImpl, FutexImpl, and WindowsImpl: either delete the WindowsImpl
definition and any uses, or add a clear conditional branch (e.g., if
(builtin.windows or feature flag) then WindowsImpl else ...) so WindowsImpl is
intentionally reachable and documented. Ensure any tests or references to
WindowsImpl are updated or removed accordingly.
| // Sharded codegen changes the output file *set* (one merged object vs. | ||
| // N shard objects), so the count and the merge/no-link knobs must be | ||
| // part of the cache key. | ||
| cache.hash.add(options.llvm_codegen_threads); | ||
| cache.hash.add(options.llvm_no_merge_shards); | ||
| cache.hash.add(options.no_link_obj); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Hash the normalized shard flags, not the raw CLI values.
comp.no_link_obj and comp.no_merge_shards are normalized later, so these raw option values can split the cache for configurations that produce the same artifact layout. For example, --no-link-obj --llvm-codegen-threads>1 without --llvm-no-merge-shards still behaves like linking. Hash the derived booleans instead to avoid unnecessary misses.
♻️ Proposed refactor
+ const no_merge_shards = options.llvm_no_merge_shards and options.llvm_codegen_threads > 1;
+ const no_link_obj = options.no_link_obj and
+ (options.llvm_codegen_threads <= 1 or no_merge_shards);
+
// Sharded codegen changes the output file *set* (one merged object vs.
// N shard objects), so the count and the merge/no-link knobs must be
// part of the cache key.
cache.hash.add(options.llvm_codegen_threads);
- cache.hash.add(options.llvm_no_merge_shards);
- cache.hash.add(options.no_link_obj);
+ cache.hash.add(no_merge_shards);
+ cache.hash.add(no_link_obj);- .no_link_obj = options.no_link_obj and
- (options.llvm_codegen_threads <= 1 or options.llvm_no_merge_shards),
- .no_merge_shards = options.llvm_no_merge_shards and options.llvm_codegen_threads > 1,
+ .no_link_obj = no_link_obj,
+ .no_merge_shards = no_merge_shards,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Sharded codegen changes the output file *set* (one merged object vs. | |
| // N shard objects), so the count and the merge/no-link knobs must be | |
| // part of the cache key. | |
| cache.hash.add(options.llvm_codegen_threads); | |
| cache.hash.add(options.llvm_no_merge_shards); | |
| cache.hash.add(options.no_link_obj); | |
| const no_merge_shards = options.llvm_no_merge_shards and options.llvm_codegen_threads > 1; | |
| const no_link_obj = options.no_link_obj and | |
| (options.llvm_codegen_threads <= 1 or no_merge_shards); | |
| // Sharded codegen changes the output file *set* (one merged object vs. | |
| // N shard objects), so the count and the merge/no-link knobs must be | |
| // part of the cache key. | |
| cache.hash.add(options.llvm_codegen_threads); | |
| cache.hash.add(no_merge_shards); | |
| cache.hash.add(no_link_obj); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/Compilation.zig` around lines 2191 - 2196, The cache key currently uses
raw CLI flags (options.llvm_no_merge_shards and options.no_link_obj) which are
normalized later, causing unnecessary cache misses; update the cache hashing to
use the canonicalized booleans (e.g., comp.no_merge_shards and comp.no_link_obj
or whatever normalized fields exist) together with options.llvm_codegen_threads
so shard/merge/link semantics are represented by the effective config rather
than the raw flags — locate the cache.hash.add calls around
options.llvm_codegen_threads and replace the additions of the raw option flags
with additions of the normalized comp.no_merge_shards and comp.no_link_obj
values.
| // with libc malloc used elsewhere (LLVM, C++). Debug keeps | ||
| // c_allocator so leak tooling and -Ddebug-gpa stay accurate. | ||
| if (builtin.mode == .Debug) { | ||
| if (@alignOf(std.c.max_align_t) < @max(@alignOf(i128), std.atomic.cache_line)) { | ||
| break :gpa .{ std.heap.c_allocator, false }; | ||
| } | ||
| break :gpa .{ std.heap.raw_c_allocator, false }; |
There was a problem hiding this comment.
Fix the debug-path comment.
Lines 183-188 do not “keep c_allocator” in Debug; they still prefer std.heap.raw_c_allocator and only fall back to std.heap.c_allocator for over-aligned requests. The -Ddebug-gpa case is also handled earlier at Lines 173-174, so this comment currently describes a different control path than the code below it.
Suggested comment update
- // with libc malloc used elsewhere (LLVM, C++). Debug keeps
- // c_allocator so leak tooling and -Ddebug-gpa stay accurate.
+ // with libc malloc used elsewhere (LLVM, C++). Debug stays on
+ // the libc-backed allocators here: prefer raw_c_allocator for the
+ // fast path, but fall back to c_allocator when over-aligned
+ // allocations would violate max_align_t. The -Ddebug-gpa path is
+ // handled above.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/main.zig` around lines 182 - 188, The comment above the Debug-path is
incorrect: the code checks builtin.mode == .Debug and prefers
std.heap.raw_c_allocator, only falling back to std.heap.c_allocator for
over-aligned requests (when `@alignOf`(std.c.max_align_t) < `@max`(`@alignOf`(i128),
std.atomic.cache_line)); also note the -Ddebug-gpa case is handled earlier.
Update the comment to state that in Debug mode the allocator chosen is
raw_c_allocator by default with a fallback to c_allocator for over-aligned
allocations, and remove the misleading claim about “keeping c_allocator” and the
-Ddebug-gpa handling.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/Zcu.zig`:
- Around line 4815-4836: The new navShard implementation breaks the “same-file
=> same shard” guarantee needed for module-level asm (global_assembly)
referenced by nav_val/nav_ty/func keys; update the sharding logic so asm-owning
units are assigned by file-only affinity: in either navShard or in
analUnitShard, detect when the nav/anal unit contains a global_assembly (or will
be used for global_assembly emission from nav_val/nav_ty/func) and compute the
shard using only the file_key (i.e., hash(file_key) % n) so all asm from the
same file land in the same shard, while leaving the existing combined
file_key+fqn hashing for non-asm navs. Ensure you reference navShard,
analUnitShard, and the global_assembly/nav_val/nav_ty/func paths when making the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| /// Shard assignment for `nav`. Keyed on the file's `shardKey` *plus* the | ||
| /// nav's fully-qualified name so a single file with thousands of generic | ||
| /// instantiations (e.g. printf-style formatters that monomorphise per call | ||
| /// site) doesn't pin the entire emit wall-clock to one LLVM module. | ||
| /// | ||
| /// Determinism: the shard key is content-derived (path + FQN bytes), but | ||
| /// FQNs of anonymous types embed InternPool indices (`__anon_N`) which are | ||
| /// not stable across parallel-sema runs. That's no regression — the | ||
| /// per-shard *symbol names* already carry those indices via `shardedNavName` | ||
| /// and the type-name suffix, so sharded `build-obj` output was never | ||
| /// bit-reproducible under `ZIG_PARALLEL_SEMA`. CI release builds use | ||
| /// `--llvm-codegen-threads=1` (no sharding) and remain reproducible. A | ||
| /// proper fix needs structural type-hash naming; tracked separately. | ||
| pub fn navShard(zcu: *Zcu, nav: InternPool.Nav.Index, n: u32) u32 { | ||
| if (n <= 1) return 0; | ||
| return zcu.navFileScope(nav).computeShard(n); | ||
| const ip = &zcu.intern_pool; | ||
| var buf: [512]u8 = undefined; | ||
| const file_key = zcu.navFileScope(nav).shardKey(&buf); | ||
| var h: std.hash.Wyhash = .init(0); | ||
| h.update(file_key); | ||
| h.update(ip.getNav(nav).fqn.toSlice(ip)); | ||
| return @intCast(h.final() % n); |
There was a problem hiding this comment.
Keep same-file shard affinity for module-level asm users.
This change drops the old “same file => same shard” invariant, but analUnitShard still sends nav_val/nav_ty/func-keyed global_assembly through navShard. That clashes with the contract called out on Lines 4839-4842: asm strings that reference other same-file navs (for example .set aliases) no longer reliably land in the same LLVM module, so the integrated assembler can miss symbols or force duplicate emission. Please preserve file-level affinity for asm-owning units, or add a separate shard-affinity path for global_assembly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/Zcu.zig` around lines 4815 - 4836, The new navShard implementation breaks
the “same-file => same shard” guarantee needed for module-level asm
(global_assembly) referenced by nav_val/nav_ty/func keys; update the sharding
logic so asm-owning units are assigned by file-only affinity: in either navShard
or in analUnitShard, detect when the nav/anal unit contains a global_assembly
(or will be used for global_assembly emission from nav_val/nav_ty/func) and
compute the shard using only the file_key (i.e., hash(file_key) % n) so all asm
from the same file land in the same shard, while leaving the existing combined
file_key+fqn hashing for non-asm navs. Ensure you reference navShard,
analUnitShard, and the global_assembly/nav_val/nav_ty/func paths when making the
change.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/main.zig (1)
176-188:⚠️ Potential issue | 🟡 MinorUpdate the Debug-path comment.
The comment still says Debug “keeps
c_allocator”, but the fast path at Line 188 returnsstd.heap.raw_c_allocator;std.heap.c_allocatoris only used for the over-aligned fallback at Lines 185-187, and-Ddebug-gpais handled earlier at Line 173.Suggested comment update
- // with libc malloc used elsewhere (LLVM, C++). Debug keeps - // c_allocator so leak tooling and -Ddebug-gpa stay accurate. + // with libc malloc used elsewhere (LLVM, C++). Debug stays on + // libc-backed allocators here: prefer raw_c_allocator on the + // fast path, but fall back to c_allocator for over-aligned + // allocations. The -Ddebug-gpa path is handled above.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.zig` around lines 176 - 188, The comment above the GPA selection is out of date: the Debug branch (builtin.mode == .Debug) usually returns std.heap.raw_c_allocator with std.heap.c_allocator used only for the over-aligned fallback (when `@alignOf`(std.c.max_align_t) < `@max`(`@alignOf`(i128), std.atomic.cache_line)), and the -Ddebug-gpa behavior is handled earlier; update the comment to state that Debug normally uses raw_c_allocator, that c_allocator is only for the over-aligned case, and that debug-gpa is already addressed prior to this gpa block (referencing the gpa selection and the symbols builtin.mode, std.heap.raw_c_allocator, std.heap.c_allocator, and the -Ddebug-gpa handling).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/main.zig`:
- Around line 176-188: The comment above the GPA selection is out of date: the
Debug branch (builtin.mode == .Debug) usually returns std.heap.raw_c_allocator
with std.heap.c_allocator used only for the over-aligned fallback (when
`@alignOf`(std.c.max_align_t) < `@max`(`@alignOf`(i128), std.atomic.cache_line)), and
the -Ddebug-gpa behavior is handled earlier; update the comment to state that
Debug normally uses raw_c_allocator, that c_allocator is only for the
over-aligned case, and that debug-gpa is already addressed prior to this gpa
block (referencing the gpa selection and the symbols builtin.mode,
std.heap.raw_c_allocator, std.heap.c_allocator, and the -Ddebug-gpa handling).
04e7f6a to
356ddda
Compare
2bb76ed to
2afac24
Compare
- ZIG_PARALLEL_SEMA: Sema runs concurrently across worker threads with per-unit claim/wait, retry-on-dependency-cycle, and per-map mutexes replacing the global sema_lock for the non-incremental fast path. - InternPool: thread-safe writers (locked single-field setters, seqlock on getNav, sorted-shard prelocking for getFunc*Ies, 256 hash shards). - llvm backend: PartitionSet emits N independent llvm modules in parallel; cross-shard refs are linkonce_odr; --llvm-codegen-threads=N partitions by file path; --llvm-no-merge-shards leaves shard .o files unmerged. - link.MachO -r: handle N shard inputs; emit hidden defs as private-extern; convert tentatives so Apple ld_new accepts the merged object. - link.Elf: handle N shard inputs; batch preads in writeRelocatable to avoid per-atom syscall storm under heavy COMDAT section counts. - link.Lld: pass all shard paths to lld for elf/coff/wasm. - std.Build.Step.Compile: llvm_codegen_threads, llvm_no_merge_shards.
Memory model (ARM64): - getNav seqlock: payload loads .unordered -> .acquire so b2 cannot reorder before them (LDAR-before-LDAR is ordered; @Fence is gone). - setFieldTypesAlignsAll: memcpy [0..len-1) then release-store [len-1] inside the mutex; remove the post-mutex re-store in structFields. .removed/.existing race cluster: - awaitNamespaceTypeFinished returns {finished, cancelled}. - 9 Sema .existing arms (zir*Decl, anon-struct-init, reify*) wrapped in gop:while(true) retry loops; cancelled re-runs get*Type. - getOrPutKeyInner locked re-probe skips .removed (mirror lockless path). Retry/requeue: - codegen_func: reset tls_retry_loop before resolveTypesFully; on retry requeue the job instead of dropping the body. - ensureMemoizedStateUpToDate: re-probe sentinel decl on .done. Misc: - deleteUnitReferences: capture parent + write self-loop marker before free-list append, all under inline_ref_mutex (fixes UAF on realloc). - test_functions.contains: take test_functions_mutex. - Lld coffLink/wasmLink: error on multi-shard build-obj instead of silently dropping shards 1..N. - PartitionSet.emit: keep asm_path for shard 0. - types_resolved: propagate OOM instead of swallowing as false. - dumpLlvmShardStats: clamp n<=256; per-(file,shard) top-file key.
…len comptime guard - build.zig + test/tests.zig: add llvm_codegen_threads option, set on all addModuleTests targets when LLVM backend is used. - lib/std/mem.zig: gate strlen/wcslen extern fast-path on !@inComptime() (extern call at comptime is invalid; pre-existing fork bug).
…iases; IES yield - llvm/Builder.zig + ir.zig: add COMDAT support (MODULE_CODE_COMDAT records, Variable.comdat field, addComdat). Required for COFF — without comdat any, linkonce_odr emits as a strong def per shard and lld-link rejects ~350 duplicate __anon_* symbols. - codegen/llvm.zig resolveGlobalUav/updateExportedValue: setComdat(.any) on COFF for sharded linkonce_odr uavs. - Zcu.navShard: switch from fqn-hash to file-hash via File.computeShard; add analUnitShard mapping comptime/nav/func units to their file's shard. - codegen/llvm.zig genModuleLevelAssembly: route each global asm block to its source file's shard so .set aliases resolve against same-module defs. - codegen/llvm.zig PartitionSet.updateExports: broadcast to all shards; Object.updateExports collapses non-owner extern globals onto one canonical decl so InstCombine cannot fold &a==&b to false pre-link. - Zcu.isClaimedByOther + Sema.resolveInferredErrorSet: when the IES func is claimed by another thread, set tls_retry_loop and yield (cap 8) instead of parking in claimOrWait. Reuses existing requeue path.
…types_wip; fork bugfixes
- Sema.analyzeNavRefInner: revert is_ref to .type-only resolve under
parallel_sema (the .fully override created a structural self-dep on
nav_val for 'const foo = .{ .self = &foo }'). The torn-read concern
was unfounded — getNav returns by-value and isExternOrFn handles both
status arms; the extern/fn branch already re-ensures .fully before
dereferencing .fully_resolved.val.
- Type.hasRuntimeBitsInner/comptimeOnlyInner: gate the four
.field_types_wip self-recursion shortcuts on !isClaimedByOther so a
wip flag set by another worker falls through to claimOrWait instead of
poisoning assumed_runtime_bits.
- main.zig: -fno-sanitize=address was hardcoded =true (pre-existing).
- lib/std/os/linux.zig: clock_getres/settime @intFromEnum on clockid_t
(pre-existing; @as(isize, enum) is invalid).
…fixes
ZIG_PARALLEL_SEMA on behavior.zig: ~18s serial → ~2.2s at j=16 (8.1x),
1.14x CPU overhead. 0/130 stress runs across j=8/16/32/64 + full exec.
With -fllvm --llvm-codegen-threads=32: 9.9s → 2.15s.
parallel sema:
- Zcu: shard unit_claims into 256 {mutex,cond,map,deferred,waiters};
tryClaim/claimOrWait/releaseClaim/isClaimedByOther/deferOn lock only
the unit's shard. claim_waits gets its own mutex; detectClaimCycle
walks via tryLock peeks (skip on contended foreign shard).
- Zcu: tryClaim() non-blocking; ensureFuncBodyUpToDate top-level skips
on busy instead of parking.
- Replace sema_lock under parallel non-incremental with fine-grained
locks: embed_mutex, global_assembly_mutex, file_system_inputs_mutex,
per-Namespace decls_mutex, comp.mutex for ensureFileAnalyzed.
resolveStructInner/resolveUnionInner gated like the ensure* sites.
- Sema.resolveInferredErrorSet: drop the shared 8-retry yield cap; the
nested ensureFuncBodyUpToDate blocks on the (now sharded) claim
instead of re-running the caller body.
- awaitNamespaceTypeFinished: return .would_block instead of unbounded
spin; callers yield-and-requeue. getNamespace/enumFieldIndex keep the
spin variant per their finished-type contract.
- Compilation: work_queue_cond replaces the dispatch loop's
Thread.yield() busy-spin; queueJob/workerAnalyzeFunc signal it.
- main: ReleaseSafe uses smp_allocator (debug_allocator's single mutex
serialised every alloc and dominated wall time).
- ZIG_PSEMA_STATS counters.
races fixed:
- Type.comptimeOnlyInner .normal strat: .wip/.unknown observed under
parallel sema → false (per documented contract) instead of unreachable.
- Zcu.maybeUnresolveIes: early-return under parallel non-incremental;
the unlocked outdated.contains() raced scanDecl's writes.
- InternPool.getIfExists: skip .removed entries.
misc fixes from branch sweep:
- Package/Fetch: promoted lazy→eager dep is now appended to all_fetches
(arena leak + double-fetch + dropped errors otherwise).
- std.fs.Dir.realpath windows ".": stack temp + NameTooLong, was
slicing out_buffer to max_path_bytes unconditionally.
- link/Elf, link/MachO: use base.resolveZcuObjectPaths instead of
open-coding the {stem}.{i}.o expansion.
- Compilation.dumpLlvmShardStats: use zcu.navShard (was hashing fqn,
which doesn't match the file-path router).
- codegen/llvm: free the bin_filename_list gpa allocation.
- zig_llvm.cpp: delete dead getAsanOptions().
- target.zig: .@"async" → .async.
- libs/libcxx: @intFromBool instead of @as(u1, if ...).
Exposes the per-shard object paths when llvm_no_merge_shards is set, so a build.zig can install/consume them directly instead of waiting for the single-threaded relocatable -r merge into one object.
musl's malloc has a single global rwlock. With N parallel LLVM contexts (--llvm-codegen-threads=N) every operator new from the bitcode reader and pass pipeline serialises on it — 270M futex calls compiling bun at cg=64, ~120s wall in emit alone. Add -Dmimalloc-obj=PATH to build.zig and have the bun_build workflow compile oven-sh/mimalloc (bun-dev3-v2) static.c with MI_MALLOC_OVERRIDE for the target, then link the object into the final cross-compiled zig. mimalloc's per-thread heaps reduce the futex count ~630x; bun's zig step on Linux/64c goes ~132s → ~23s incremental.
…v_map+OptBisect races; hoist FuncInstance prep out of 4-shard lock claimOrWait now distinguishes same-thread reentry (.recursed -> dependency loop diagnostic) from cross-thread cycle (.cycle -> yield-and-requeue); the previous bare AnalysisFail silently markTransitiveFailed both units with no error and could surface as a processExportsInner unreachable. resolveNavType brackets type/linksection writes with bits.writing so getNav cannot tear (mirrors resolveNavValue). ensureExportFuncQueued locks the shard mutex around nav_map.contains. ensureNavValAnalysisQueued unlocks nav_queued_mutex before queueJob so it doesn't nest under work_queue_mutex. getFuncInstanceIes precomputes the instance nav's name/fqn/mods before lockShardsSorted so the 4-shard critical section holds only createNav+owner_nav write. OptBisect is per-thread.
…ap fixes
- Shard naming uses target.ofmt.fileExt() instead of hardcoded ".o" in
Compilation.zig, link.zig, and Build/Step/Compile.zig so COFF targets
get foo.{i}.obj. ELF/Mach-O behaviour unchanged (fileExt() returns ".o").
- Hash llvm_codegen_threads / llvm_no_merge_shards / no_link_obj into the
cache key — these change the output file set (one merged object vs N
shard objects), so a stale cache hit would otherwise produce the wrong
layout.
- std.c: drop const from _msize to match the Windows SDK declaration so
the C-backend bootstrap (zig2.c) compiles under clang-cl/msvc.
With these, build-obj --no-link --llvm-no-merge-shards --llvm-codegen-threads=N
works for x86_64-windows-msvc; lld-link consumes the shards directly.
Parallel sema + sharded LLVM emit on Windows was wall-clock bound on two host primitives that the linux/macos paths don't hit: - gpa = raw_c_allocator when link_libc, which on Windows is HeapAlloc(GetProcessHeap()) behind a single critical section. Switch release builds to smp_allocator (per-thread heaps backed by the page allocator) the same way the no-libc path already does. Debug keeps c_allocator so leak tooling stays accurate. - std.Thread.Condition on Windows wrapped CONDITION_VARIABLE, whose Wake* has no userspace "no waiters" fast-path — every work_queue_cond.signal() and claim-shard cond.signal() became a kernel32 call. Use FutexImpl everywhere; on Windows the Futex layer already maps to RtlWaitOnAddress (Win8+). The old WindowsImpl is left in place for reference. - LLVM's own allocations go through C++ operator new, which still hits the CRT heap. Add tools/mimalloc_new_delete_override.cpp (mimalloc's unity static.c + the replaceable global operators) and a windows-gnu splice in the bootstrap workflow mirroring the existing linux-musl step. malloc/free themselves stay on the CRT — they can't be statically interposed on Windows — but LLVM's hot path is operator new, which is replaceable per the standard. bun debug zig step on a 24-core x86_64-windows-msvc host: serial psema psema+24sh 208.2s 165.0s 77.0s before 165.1s 151.7s 35.1s after (5.9x vs original serial)
Per-file partitioning meant a small source file that hosts thousands of generic instantiations (e.g. output.zig's printf-style formatters) lands entirely in one LLVM module, pinning emit wall-clock to that one shard. For bun on a 24-core host, shard 13 took 22.5s while the rest finished in 3-14s. Hashing the FQN as well spreads instantiations across shards. Max shard drops from 22.5s to ~14s; cross-shard externs grow (CPU sum +30%) but wall-clock falls. bun debug zig step: file-only: ~35s wall, 207s cpu-sum, max-shard 22.5s file+fqn: ~27s wall, 274s cpu-sum, max-shard 13.9s Determinism: anonymous-type FQNs embed InternPool indices which are insertion-order dependent under parallel sema, so the shard set can vary between runs. This is no regression — `shardedNavName` already embeds the same indices in cross-shard symbol names, so sharded build-obj output was never bit-reproducible under ZIG_PARALLEL_SEMA. cg=1 builds (CI releases) are unaffected. A structural-hash naming fix is tracked separately.
`zig build` compiles build_runner.zig + the user's build.zig before any step runs. That compile pulled in ~10k navs of std.Build and emitted them through a single LLVM module — ~3.7s of cold-cache wall before the first user step starts. Pass llvm_codegen_threads (the same n_jobs the thread pool was sized to) so the runner emit shards like any other compile. bun debug zig step, 24-core Windows, cold local cache: build-runner compile 3.84s -> 1.28s (emit 2.96s -> 0.37s) total 27s -> 24s
b2ed937 to
af6e006
Compare
2afac24 to
125866c
Compare
The FAST fork's Windows allocator/condvar fixes (oven-sh/zig#20) haven't landed yet, and historically the picker always returned STABLE for hostOs===windows. Windows local dev and Windows CI (PR or main) now stay on STABLE; only non-Windows local + PR CI use FAST.
af6e006 to
597cbfe
Compare
The FAST fork's Windows allocator/condvar fixes (oven-sh/zig#20) haven't landed yet, and historically the picker always returned STABLE for hostOs===windows. Windows local dev and Windows CI (PR or main) now stay on STABLE; only non-Windows local + PR CI use FAST.
Summary
Enables
build-obj --no-link --llvm-no-merge-shards --llvm-codegen-threads=Nforx86_64-windows-msvctargets, and removes the host-side bottlenecks that kept the Windows compiler from scaling past ~2.7× under parallel sema + sharded emit.bun debug zig step, 24-core x86_64 Windows host:
5.9× vs the original serial baseline (was 2.7× before this PR).
Changes
COFF shard emission (
4073e7160c)target.ofmt.fileExt()instead of hardcoded.oinCompilation.zig,link.zig, andBuild/Step/Compile.zig, so COFF targets getfoo.{i}.obj. ELF/Mach-O behaviour unchanged.llvm_codegen_threads/llvm_no_merge_shards/no_link_objinto the cache key — these change the output file set, so a stale cache hit would otherwise produce the wrong layout.std.c._msize: dropconstto match the Windows SDK declaration so the C-backend bootstrap (zig2.c) compiles under clang-cl/msvc.Host contention (
89647b5ee9)src/main.zig: whenlink_libcand not Debug, usesmp_allocatorinstead ofraw_c_allocator. The Windows CRT routes malloc →HeapAlloc(GetProcessHeap())behind a single critical section, so 24 sema workers serialise on it.std.Thread.Condition: useFutexImplon Windows. The CONDITION_VARIABLE wrapper had no userspace "no waiters" fast-path, so everywork_queue_cond.signal()and claim-shardcond.signal()was a kernel32 call.FutexImpl.wake()checkswakeable == 0first; the underlyingFutexalready maps toRtlWaitOnAddresson Win8+.tools/mimalloc_new_delete_override.cpp+ windows-gnu CI splice: LLVM'soperator newstill hit the CRT heap. The override TU compiles mimalloc's unitystatic.cand provides the C++ replaceable global operators; mirrors the existing linux-muslMI_MALLOC_OVERRIDEstep (which uses POSIX symbol interposition for malloc/free — that doesn't statically link on Windows, but operator new replacement does and is where LLVM's hot allocations go).Verified
bun run build(Debug) andbun run build:releaseboth link 24 COFF shards via lld-link and pass smoke test on x86_64-windows-msvc.zig test lib/std/std.zigpasses with the Condition change.