-
Notifications
You must be signed in to change notification settings - Fork 267
Improve pure Go speed in low shard count #306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds benchmark profiling flags and a -fast1 option; introduces deterministic RNG prefill for benchmarks; adds generic unsafe/safe byte load/store helpers with updated build tags; implements 8-byte/16-bit table-based Galois multiply/XOR paths and a runtime skip2B switch; adds an internal generic integer constraint Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Bench as benchmark/main
participant Prof as pprof/trace
participant RS as reedsolomon
User->>Bench: run with flags (-cpu.profile, -mem.profile, -trace.profile, -fast1, ...)
alt cpu.profile set
Bench->>Prof: StartCPUProfile(file)
note right of Prof #DDEEFF: deferred StopCPUProfile
end
alt trace.profile set
Bench->>Prof: trace.Start(file)
note right of Prof #DDEEFF: deferred trace.Stop
end
Bench->>RS: getOptions(..., include WithFastOneParityMatrix if -fast1)
loop benchmark
Bench->>RS: Encode/Decode shards (prefilled via deterministic RNG)
RS-->>Bench: results
end
alt mem.profile set
Bench->>Prof: WriteHeapProfile on exit
end
sequenceDiagram
autonumber
participant Caller
participant Galois as galMulSlice/galMulSliceXor
participant Table as mt / mt16
participant SIMD
Caller->>Galois: request multiply/XOR
alt AVX2 or SSSE3 available
Galois->>SIMD: use SIMD path
SIMD-->>Galois: write outputs
else neither SIMD nor skip2B true
Galois->>Table: use 16-bit table (mt16) for 8-byte blocks
Table-->>Galois: produce 16-bit results -> store16 writes
else skip2B == true
Galois->>Table: use 8-bit scalar table per byte
Table-->>Galois: write bytes
end
Galois-->>Caller: done (tail bytes handled by byte fallback)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (11)
reedsolomon.go (1)
1692-1695: Generic constraint should be more permissive (+consider uintptr)Current type set excludes user-defined integer types and uintptr. Since these helpers are used with unsafe.Add (which takes a uintptr), allowing uintptr (and using tilde to accept aliases) improves ergonomics and future-proofs the constraint.
Apply:
-type indexer interface { - int | int8 | int16 | int32 | int64 | uint | uint8 | uint16 | uint32 | uint64 -} +type indexer interface { + ~int | ~int8 | ~int16 | ~int32 | ~int64 | + ~uint | ~uint8 | ~uint16 | ~uint32 | ~uint64 | + ~uintptr +}benchmark/main.go (2)
115-122: Close CPU profile file and check StartCPUProfile errorMinor robustness fix: StartCPUProfile returns an error; also make sure the profile file is closed. Defer order should ensure StopCPUProfile runs before Close.
- if cpuprofile != "" { - f, err := os.Create(cpuprofile) - if err != nil { - log.Fatal(err) - } - pprof.StartCPUProfile(f) - defer pprof.StopCPUProfile() - } + if cpuprofile != "" { + f, err := os.Create(cpuprofile) + if err != nil { + log.Fatal(err) + } + if err := pprof.StartCPUProfile(f); err != nil { + f.Close() + log.Fatal(err) + } + defer pprof.StopCPUProfile() + defer f.Close() + }
457-459: Flag wiring for fast1 is good; confirm behavior/documentationThe fast1 flag correctly appends WithFastOneParityMatrix. Consider printing a note in non-CSV mode when fast1 is active to aid reproducibility of benchmark output.
unsafe.go (2)
17-43: AllocAligned implementation looks solid; minor clarity nitThe alignment math is correct. If you want to make intent clearer, name the misalignment variable “misalign” and document that each shard starts at a 64B boundary due to eachAligned stride.
45-50: Unnecessary commented code can be removedThe commented alternatives using binary/unsafe are no longer needed and can be removed to reduce noise.
galois_nopshufb_amd64.go (1)
74-87: XOR variant: same benefits; validate table initialization costsThe XOR path mirrors the non-XOR case well. Given the 32 MB table, consider noting (in code or PR description) the amortization threshold for mt16 creation to guide users on memory/latency trade-offs when assembly isn’t available.
galois_noasm.go (4)
26-29: Tail loop: tiny readability nit — direct indexing is simpler here.If store8/load8 resolve to plain indexing on pure-Go builds, consider direct assignment for clarity in this simple per-byte tail. If the helpers are needed for build-tag uniformity, feel free to ignore.
- for n, input := range in { - store8(out, mt[input], n) - } + for n, input := range in { + out[n] = mt[input] + }
38-47: XOR 8-byte mt16 path looks solid; please add an in-place test.Semantics preserve in-place correctness (in==out) since each lane is read (load16(out, off), load16(in, off)) before being written. To guard against future refactors, add a test that exercises in==out for various lengths (including <8, =8, 8k+tail).
I can provide a table-driven unit test covering in==out with random data and multiple c values (including c=0,1, and random) if you’d like.
49-51: Tail XOR: optional simplification for readability.Same nit as above: if helpers aren’t strictly required here, the direct form is a bit easier to read.
- for n := range in { - store8(out, load8(out, n)^mt[load8(in, n)], n) - } + for n := range in { + out[n] ^= mt[in[n]] + }
15-24: Fast-path c==0 to avoid table work (pure-Go path).Two small short-circuit wins:
- galMulSlice: when c==0, the output is all zeros. Zero the slice and return.
- galMulSliceXor: when c==0, out ^= 0 → no-op. Just return.
These save the mt16 fetch and loops on common code paths.
Example patch (outside the shown changed lines):
// At the top of galMulSlice, after the c==1 check. if c == 0 { out = out[:len(in)] for i := range out { out[i] = 0 } return } // At the top of galMulSliceXor, after the c==1 check. if c == 0 { // out ^= 0 → no-op return }Also applies to: 38-47
galois_amd64.go (1)
139-149: XOR non-SIMD mt16 fallback mirrors noasm path correctly; consider dedup.Logic matches the noasm variant and remains safe for in-place use. To reduce duplication across files, consider factoring the 8-byte mt16 loops into a small shared helper (guarded by build tags) that both amd64 and noasm variants can call. This keeps behavior in one place.
Example sketch:
//go:build !pshufb || (amd64 && !avx2_ssse3_runtime) func galMulSliceGo(c byte, in, out []byte) { /* mt16 loop + tail */ } ... in both files ... if !o.useAVX2 && !o.useSSSE3 { galMulSliceGo(c, in, out) // then tail or return depending on helper contract }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
benchmark/main.go(5 hunks)galois.go(2 hunks)galois_amd64.go(2 hunks)galois_noasm.go(2 hunks)galois_nopshufb_amd64.go(2 hunks)reedsolomon.go(1 hunks)unsafe.go(3 hunks)unsafe_disabled.go(3 hunks)
🔥 Files not summarized due to errors (1)
- galois.go: Error: Server error: no LLM provider could handle the message
👮 Files not reviewed due to content moderation or server errors (1)
- galois.go
🧰 Additional context used
🧬 Code graph analysis (1)
benchmark/main.go (1)
options.go (1)
WithFastOneParityMatrix(258-262)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: build (1.25.x, ubuntu-latest)
- GitHub Check: build (1.24.x, windows-latest)
- GitHub Check: build (1.25.x, windows-latest)
- GitHub Check: build (1.25.x, macos-latest)
- GitHub Check: build (1.24.x, macos-latest)
- GitHub Check: build (1.23.x, ubuntu-latest)
- GitHub Check: build (1.24.x, ubuntu-latest)
- GitHub Check: build (1.23.x, macos-latest)
- GitHub Check: build (1.23.x, windows-latest)
- GitHub Check: build-special
🔇 Additional comments (4)
unsafe_disabled.go (1)
11-14: Fallback helpers look correct and endian-consistentLittle-endian loads/stores and direct 8-bit access are straightforward and safe for all architectures. Signatures align with the new indexer constraint.
Also applies to: 29-55
galois_nopshufb_amd64.go (1)
51-65: 8-byte, 16-bit table path is a nice winSwitching to getMulTable16 + 8-byte stride should materially reduce loop overhead on the pure Go path. Tail handling remains correct. LGTM.
galois_noasm.go (1)
15-24: 8-byte mt16 path is correct and alias-safe (in-place ok).The 4× store16 with prior load16 per 2-byte lane ensures we always read before we overwrite the same lane, so in==out in-place operation remains correct. Chunking by 8 bytes with in/out advancement is clean and bounds-safe.
galois_amd64.go (1)
84-95: Non-SIMD fallback using mt16 is correct; race instrumentation not needed.Good addition. The pure-Go loop is only used when AVX2/SSSE3 aren’t active, and being Go code, it’s already race-instrumented by the toolchain (unlike the asm paths), so the absence of raceReadSlice/raceWriteSlice here is appropriate. The in/out advancement and 8-byte chunking mirror the noasm implementation.
Add a 32MB lookup table that allows looking up 2 values at once. Will only be used when no assembly functions are available. Benchmarks: (with no assembly) ``` benchmark old ns/op new ns/op delta BenchmarkGalois128K-32 24802 15883 -35.96% BenchmarkGalois1M-32 202043 123711 -38.77% BenchmarkGaloisXor128K-32 29152 18226 -37.48% BenchmarkGaloisXor1M-32 230820 148170 -35.81% BenchmarkEncode2x1x1M-32 25016 25261 +0.98% BenchmarkEncode800x200/64-32 127666 103062 -19.27% BenchmarkEncode800x200/256-32 470906 354406 -24.74% BenchmarkEncode800x200/1024-32 1835598 1377017 -24.98% BenchmarkEncode800x200/4096-32 7423581 5238705 -29.43% BenchmarkEncode800x200/16384-32 30795690 21454123 -30.33% BenchmarkEncode800x200/65536-32 120715900 96913158 -19.72% BenchmarkEncode800x200/262144-32 490054033 368235100 -24.86% BenchmarkEncode800x200/1048576-32 2011852300 1549867400 -22.96% BenchmarkEncode1K/4+4/cauchy-32 4894 2764 -43.52% BenchmarkEncode1K/8+8/cauchy-32 17621 11426 -35.16% BenchmarkEncode1K/16+16/cauchy-32 66220 58349 -11.89% BenchmarkEncode1K/32+32/cauchy-32 257469 239701 -6.90% BenchmarkEncode1K/64+64/cauchy-32 1041104 967883 -7.03% BenchmarkEncode1K/128+128/cauchy-32 4065196 4178904 +2.80% BenchmarkDecode1K/4+4/cauchy-32 5928 3833 -35.34% BenchmarkDecode1K/4+4/cauchy-inv-32 5853 3487 -40.42% BenchmarkDecode1K/4+4/cauchy-single-32 1790 1193 -33.35% BenchmarkDecode1K/4+4/cauchy-single-inv-32 1448 904 -37.58% BenchmarkDecode1K/8+8/cauchy-32 20207 13253 -34.41% BenchmarkDecode1K/8+8/cauchy-inv-32 19062 11996 -37.07% BenchmarkDecode1K/8+8/cauchy-single-32 3579 2252 -37.08% BenchmarkDecode1K/8+8/cauchy-single-inv-32 2609 1605 -38.48% BenchmarkDecode1K/16+16/cauchy-32 119255 70093 -41.22% BenchmarkDecode1K/16+16/cauchy-inv-32 92758 59753 -35.58% BenchmarkDecode1K/16+16/cauchy-single-32 7827 5863 -25.09% BenchmarkDecode1K/16+16/cauchy-single-inv-32 6115 3981 -34.90% BenchmarkDecode1K/32+32/cauchy-32 320032 294357 -8.02% BenchmarkDecode1K/32+32/cauchy-inv-32 277815 239258 -13.88% BenchmarkDecode1K/32+32/cauchy-single-32 15596 13458 -13.71% BenchmarkDecode1K/32+32/cauchy-single-inv-32 9237 7938 -14.06% BenchmarkDecode1K/64+64/cauchy-32 1401801 1369309 -2.32% BenchmarkDecode1K/64+64/cauchy-inv-32 1067298 955657 -10.46% BenchmarkDecode1K/64+64/cauchy-single-32 41705 35822 -14.11% BenchmarkDecode1K/64+64/cauchy-single-inv-32 19323 15517 -19.70% BenchmarkDecode1K/128+128/cauchy-32 7225691 7160856 -0.90% BenchmarkDecode1K/128+128/cauchy-inv-32 4159486 3818808 -8.19% BenchmarkDecode1K/128+128/cauchy-single-32 124440 109424 -12.07% BenchmarkDecode1K/128+128/cauchy-single-inv-32 36524 31136 -14.75% BenchmarkEncode10x2x10000-32 50440 42083 -16.57% BenchmarkEncode100x20x10000-32 1801528 1106006 -38.61% BenchmarkEncode17x3x1M-32 1288167 1001932 -22.22% BenchmarkEncode10x4x16M-32 13472620 10327386 -23.35% BenchmarkEncode5x2x1M-32 310911 185120 -40.46% BenchmarkEncode10x2x1M-32 589206 410960 -30.25% BenchmarkEncode10x4x1M-32 1015838 866408 -14.71% BenchmarkEncode50x20x1M-32 22557354 19183256 -14.96% BenchmarkEncode17x3x16M-32 16104608 13499597 -16.18% BenchmarkEncode_8x4x8M-32 5519244 3932924 -28.74% BenchmarkEncode_12x4x12M-32 12346553 9079752 -26.46% BenchmarkEncode_16x4x16M-32 21485559 17850644 -16.92% BenchmarkEncode_16x4x32M-32 42595970 35871009 -15.79% BenchmarkEncode_16x4x64M-32 82783715 69571462 -15.96% BenchmarkEncode_8x5x8M-32 7074967 5185001 -26.71% BenchmarkEncode_8x6x8M-32 8509701 6225369 -26.84% BenchmarkEncode_8x7x8M-32 9515548 7190973 -24.43% BenchmarkEncode_8x9x8M-32 12594933 10125813 -19.60% BenchmarkEncode_8x10x8M-32 13865411 11618479 -16.21% BenchmarkEncode_8x11x8M-32 14907551 13268491 -10.99% BenchmarkEncode_8x8x05M-32 969947 684114 -29.47% BenchmarkEncode_8x8x1M-32 1626195 1284539 -21.01% BenchmarkEncode_8x8x8M-32 11010901 8424452 -23.49% BenchmarkEncode_8x8x32M-32 42695493 32035889 -24.97% BenchmarkEncode_24x8x24M-32 96542217 79041486 -18.13% BenchmarkEncode_24x8x48M-32 188859267 155747871 -17.53% BenchmarkVerify800x200/64-32 135567 109423 -19.28% BenchmarkVerify800x200/256-32 481634 347095 -27.93% BenchmarkVerify800x200/1024-32 1869884 1322751 -29.26% BenchmarkVerify800x200/4096-32 7491368 5123890 -31.60% BenchmarkVerify800x200/16384-32 29535987 20752858 -29.74% BenchmarkVerify800x200/65536-32 125985833 85297623 -32.30% BenchmarkVerify800x200/262144-32 508379267 355721767 -30.03% BenchmarkVerify800x200/1048576-32 2068388300 1449036800 -29.94% BenchmarkVerify10x2x10000-32 53561 42944 -19.82% BenchmarkVerify50x5x100000-32 907225 783960 -13.59% BenchmarkVerify10x2x1M-32 640749 536002 -16.35% BenchmarkVerify5x2x1M-32 396362 304214 -23.25% BenchmarkVerify10x4x1M-32 1212746 1091746 -9.98% BenchmarkVerify50x20x1M-32 24401410 20788040 -14.81% BenchmarkVerify10x4x16M-32 17572573 14192901 -19.23% BenchmarkReconstruct10x2x10000-32 35819 40261 +12.40% BenchmarkReconstruct800x200/64-32 910172 847324 -6.91% BenchmarkReconstruct800x200/256-32 1950439 1606723 -17.62% BenchmarkReconstruct800x200/1024-32 5781554 4567305 -21.00% BenchmarkReconstruct800x200/4096-32 21559582 16228687 -24.73% BenchmarkReconstruct800x200/16384-32 84484767 61475983 -27.23% BenchmarkReconstruct800x200/65536-32 352869233 257830060 -26.93% BenchmarkReconstruct800x200/262144-32 1505845200 1137826500 -24.44% BenchmarkReconstruct800x200/1048576-32 6090028900 4613878500 -24.24% BenchmarkReconstruct50x5x50000-32 689686 465919 -32.44% BenchmarkReconstruct10x2x1M-32 514134 376113 -26.85% BenchmarkReconstruct5x2x1M-32 309217 163760 -47.04% BenchmarkReconstruct10x4x1M-32 751008 634086 -15.57% BenchmarkReconstruct50x20x1M-32 11040695 8815629 -20.15% BenchmarkReconstruct10x4x16M-32 8192061 7168362 -12.50% BenchmarkReconstructData10x2x10000-32 37946 38289 +0.90% BenchmarkReconstructData800x200/64-32 981698 846979 -13.72% BenchmarkReconstructData800x200/256-32 1962800 1593880 -18.80% BenchmarkReconstructData800x200/1024-32 5818608 4503410 -22.60% BenchmarkReconstructData800x200/4096-32 22954233 15814243 -31.11% BenchmarkReconstructData800x200/16384-32 88836892 63010456 -29.07% BenchmarkReconstructData800x200/65536-32 384783267 246166020 -36.02% BenchmarkReconstructData800x200/262144-32 1265891200 1127835600 -10.91% BenchmarkReconstructData800x200/1048576-32 5463190100 4630147200 -15.25% BenchmarkReconstructData50x5x50000-32 667775 465496 -30.29% BenchmarkReconstructData10x2x1M-32 513945 369514 -28.10% BenchmarkReconstructData5x2x1M-32 282178 151657 -46.25% BenchmarkReconstructData10x4x1M-32 724144 593335 -18.06% BenchmarkReconstructData50x20x1M-32 11100261 9621467 -13.32% BenchmarkReconstructData10x4x16M-32 8161166 6423180 -21.30% BenchmarkReconstructP10x2x10000-32 2975 2759 -7.26% BenchmarkReconstructP10x5x20000-32 10975 11836 +7.85% BenchmarkSplit10x4x160M-32 2171627 2003794 -7.73% BenchmarkSplit5x2x5M-32 68340 62255 -8.90% BenchmarkSplit10x2x1M-32 11724 10878 -7.22% BenchmarkSplit10x4x10M-32 127757 126308 -1.13% BenchmarkSplit50x20x50M-32 631860 606853 -3.96% BenchmarkSplit17x3x272M-32 1603306 1504980 -6.13% BenchmarkParallel_8x8x64K-32 81963 62322 -23.96% BenchmarkParallel_8x8x05M-32 657356 493039 -25.00% BenchmarkParallel_20x10x05M-32 2035771 1701756 -16.41% BenchmarkParallel_8x8x1M-32 1312086 989650 -24.57% BenchmarkParallel_8x8x8M-32 12395383 8855818 -28.56% BenchmarkParallel_8x8x32M-32 145119543 46900712 -67.68% BenchmarkParallel_8x3x1M-32 488483 346840 -29.00% BenchmarkParallel_8x4x1M-32 652976 466834 -28.51% BenchmarkParallel_8x5x1M-32 847350 599749 -29.22% BenchmarkStreamEncode10x2x10000-32 53168 44803 -15.73% BenchmarkStreamEncode100x20x10000-32 1967911 1206515 -38.69% BenchmarkStreamEncode17x3x1M-32 1825471 1604931 -12.08% BenchmarkStreamEncode10x4x16M-32 20421472 17869211 -12.50% BenchmarkStreamEncode5x2x1M-32 512446 434971 -15.12% BenchmarkStreamEncode10x2x1M-32 931308 847344 -9.02% BenchmarkStreamEncode10x4x1M-32 1385173 1276586 -7.84% BenchmarkStreamEncode50x20x1M-32 24626658 22235222 -9.71% BenchmarkStreamEncode17x3x16M-32 28870364 25197182 -12.72% BenchmarkStreamVerify10x2x10000-32 59623 46435 -22.12% BenchmarkStreamVerify50x5x50000-32 1349335 1316518 -2.43% BenchmarkStreamVerify10x2x1M-32 1037307 934257 -9.93% BenchmarkStreamVerify5x2x1M-32 649599 551300 -15.13% BenchmarkStreamVerify10x4x1M-32 1694152 1497405 -11.61% BenchmarkStreamVerify50x20x1M-32 26931902 23227298 -13.76% BenchmarkStreamVerify10x4x16M-32 25068808 20562626 -17.98% benchmark old MB/s new MB/s speedup BenchmarkGalois128K-32 5284.82 8252.31 1.56x BenchmarkGalois1M-32 5189.86 8475.99 1.63x BenchmarkGaloisXor128K-32 4496.17 7191.61 1.60x BenchmarkGaloisXor1M-32 4542.83 7076.83 1.56x BenchmarkEncode2x1x1M-32 125747.73 124529.14 0.99x BenchmarkEncode800x200/64-32 501.31 620.99 1.24x BenchmarkEncode800x200/256-32 543.63 722.34 1.33x BenchmarkEncode800x200/1024-32 557.86 743.64 1.33x BenchmarkEncode800x200/4096-32 551.76 781.87 1.42x BenchmarkEncode800x200/16384-32 532.02 763.68 1.44x BenchmarkEncode800x200/65536-32 542.89 676.23 1.25x BenchmarkEncode800x200/262144-32 534.93 711.89 1.33x BenchmarkEncode800x200/1048576-32 521.20 676.56 1.30x BenchmarkEncode1K/4+4/cauchy-32 1673.81 2963.98 1.77x BenchmarkEncode1K/8+8/cauchy-32 929.79 1433.96 1.54x BenchmarkEncode1K/16+16/cauchy-32 494.84 561.59 1.13x BenchmarkEncode1K/32+32/cauchy-32 254.54 273.41 1.07x BenchmarkEncode1K/64+64/cauchy-32 125.90 135.42 1.08x BenchmarkEncode1K/128+128/cauchy-32 64.48 62.73 0.97x BenchmarkDecode1K/4+4/cauchy-32 1381.93 2137.44 1.55x BenchmarkDecode1K/4+4/cauchy-inv-32 1399.72 2349.22 1.68x BenchmarkDecode1K/4+4/cauchy-single-32 4575.70 6869.25 1.50x BenchmarkDecode1K/4+4/cauchy-single-inv-32 5657.40 9062.93 1.60x BenchmarkDecode1K/8+8/cauchy-32 810.82 1236.21 1.52x BenchmarkDecode1K/8+8/cauchy-inv-32 859.53 1365.73 1.59x BenchmarkDecode1K/8+8/cauchy-single-32 4577.48 7275.14 1.59x BenchmarkDecode1K/8+8/cauchy-single-inv-32 6279.80 10205.55 1.63x BenchmarkDecode1K/16+16/cauchy-32 274.77 467.49 1.70x BenchmarkDecode1K/16+16/cauchy-inv-32 353.26 548.39 1.55x BenchmarkDecode1K/16+16/cauchy-single-32 4186.28 5588.65 1.33x BenchmarkDecode1K/16+16/cauchy-single-inv-32 5358.52 8230.70 1.54x BenchmarkDecode1K/32+32/cauchy-32 204.78 222.64 1.09x BenchmarkDecode1K/32+32/cauchy-inv-32 235.90 273.91 1.16x BenchmarkDecode1K/32+32/cauchy-single-32 4202.16 4869.76 1.16x BenchmarkDecode1K/32+32/cauchy-single-inv-32 7095.27 8255.89 1.16x BenchmarkDecode1K/64+64/cauchy-32 93.50 95.72 1.02x BenchmarkDecode1K/64+64/cauchy-inv-32 122.81 137.15 1.12x BenchmarkDecode1K/64+64/cauchy-single-32 3142.82 3658.95 1.16x BenchmarkDecode1K/64+64/cauchy-single-inv-32 6783.14 8446.98 1.25x BenchmarkDecode1K/128+128/cauchy-32 36.28 36.61 1.01x BenchmarkDecode1K/128+128/cauchy-inv-32 63.02 68.65 1.09x BenchmarkDecode1K/128+128/cauchy-single-32 2106.58 2395.66 1.14x BenchmarkDecode1K/128+128/cauchy-single-inv-32 7177.38 8419.32 1.17x BenchmarkEncode10x2x10000-32 2379.08 2851.50 1.20x BenchmarkEncode100x20x10000-32 666.10 1084.98 1.63x BenchmarkEncode17x3x1M-32 16280.13 20931.08 1.29x BenchmarkEncode10x4x16M-32 17433.95 22743.51 1.30x BenchmarkEncode5x2x1M-32 23608.18 39650.03 1.68x BenchmarkEncode10x2x1M-32 21355.70 30618.31 1.43x BenchmarkEncode10x4x1M-32 14451.19 16943.58 1.17x BenchmarkEncode50x20x1M-32 3253.94 3826.27 1.18x BenchmarkEncode17x3x16M-32 20835.30 24855.88 1.19x BenchmarkEncode_8x4x8M-32 18238.60 25595.03 1.40x BenchmarkEncode_12x4x12M-32 16306.30 22173.14 1.36x BenchmarkEncode_16x4x16M-32 15617.20 18797.32 1.20x BenchmarkEncode_16x4x32M-32 15754.74 18708.38 1.19x BenchmarkEncode_16x4x64M-32 16213.06 19292.07 1.19x BenchmarkEncode_8x5x8M-32 15413.77 21032.18 1.36x BenchmarkEncode_8x6x8M-32 13800.78 18864.83 1.37x BenchmarkEncode_8x7x8M-32 13223.53 17498.20 1.32x BenchmarkEncode_8x9x8M-32 11322.52 14083.45 1.24x BenchmarkEncode_8x10x8M-32 10890.04 12996.10 1.19x BenchmarkEncode_8x11x8M-32 10691.46 12012.18 1.12x BenchmarkEncode_8x8x05M-32 8648.52 12262.00 1.42x BenchmarkEncode_8x8x1M-32 10316.85 13060.88 1.27x BenchmarkEncode_8x8x8M-32 12189.53 15931.92 1.31x BenchmarkEncode_8x8x32M-32 12574.42 16758.42 1.33x BenchmarkEncode_24x8x24M-32 8341.49 10188.40 1.22x BenchmarkEncode_24x8x48M-32 8528.11 10341.15 1.21x BenchmarkVerify800x200/64-32 472.09 584.89 1.24x BenchmarkVerify800x200/256-32 531.52 737.55 1.39x BenchmarkVerify800x200/1024-32 547.63 774.14 1.41x BenchmarkVerify800x200/4096-32 546.76 799.39 1.46x BenchmarkVerify800x200/16384-32 554.71 789.48 1.42x BenchmarkVerify800x200/65536-32 520.19 768.32 1.48x BenchmarkVerify800x200/262144-32 515.65 736.94 1.43x BenchmarkVerify800x200/1048576-32 506.95 723.64 1.43x BenchmarkVerify10x2x10000-32 2240.42 2794.32 1.25x BenchmarkVerify50x5x100000-32 6062.44 7015.67 1.16x BenchmarkVerify10x2x1M-32 19637.81 23475.50 1.20x BenchmarkVerify5x2x1M-32 18518.52 24127.82 1.30x BenchmarkVerify10x4x1M-32 12104.82 13446.41 1.11x BenchmarkVerify50x20x1M-32 3008.04 3530.89 1.17x BenchmarkVerify10x4x16M-32 13366.34 16549.19 1.24x BenchmarkReconstruct10x2x10000-32 3350.18 2980.57 0.89x BenchmarkReconstruct800x200/64-32 70.32 75.53 1.07x BenchmarkReconstruct800x200/256-32 131.25 159.33 1.21x BenchmarkReconstruct800x200/1024-32 177.12 224.20 1.27x BenchmarkReconstruct800x200/4096-32 189.99 252.39 1.33x BenchmarkReconstruct800x200/16384-32 193.93 266.51 1.37x BenchmarkReconstruct800x200/65536-32 185.72 254.18 1.37x BenchmarkReconstruct800x200/262144-32 174.08 230.39 1.32x BenchmarkReconstruct800x200/1048576-32 172.18 227.27 1.32x BenchmarkReconstruct50x5x50000-32 7974.64 11804.62 1.48x BenchmarkReconstruct10x2x1M-32 24473.99 33455.17 1.37x BenchmarkReconstruct5x2x1M-32 23737.47 44822.02 1.89x BenchmarkReconstruct10x4x1M-32 19547.15 23151.54 1.18x BenchmarkReconstruct50x20x1M-32 6648.16 8326.16 1.25x BenchmarkReconstruct10x4x16M-32 28671.79 32766.35 1.14x BenchmarkReconstructData10x2x10000-32 3162.37 3134.05 0.99x BenchmarkReconstructData800x200/64-32 65.19 75.56 1.16x BenchmarkReconstructData800x200/256-32 130.43 160.61 1.23x BenchmarkReconstructData800x200/1024-32 175.99 227.38 1.29x BenchmarkReconstructData800x200/4096-32 178.44 259.01 1.45x BenchmarkReconstructData800x200/16384-32 184.43 260.02 1.41x BenchmarkReconstructData800x200/65536-32 170.32 266.23 1.56x BenchmarkReconstructData800x200/262144-32 207.08 232.43 1.12x BenchmarkReconstructData800x200/1048576-32 191.93 226.47 1.18x BenchmarkReconstructData50x5x50000-32 8236.31 11815.36 1.43x BenchmarkReconstructData10x2x1M-32 24482.98 34052.56 1.39x BenchmarkReconstructData5x2x1M-32 26012.05 48398.94 1.86x BenchmarkReconstructData10x4x1M-32 20272.31 24741.62 1.22x BenchmarkReconstructData50x20x1M-32 6612.49 7628.81 1.15x BenchmarkReconstructData10x4x16M-32 28780.33 36567.72 1.27x BenchmarkReconstructP10x2x10000-32 40334.81 43493.13 1.08x BenchmarkReconstructP10x5x20000-32 27335.36 25347.36 0.93x BenchmarkParallel_8x8x64K-32 12793.36 16825.03 1.32x BenchmarkParallel_8x8x05M-32 12761.14 17014.08 1.33x BenchmarkParallel_20x10x05M-32 7726.13 9242.59 1.20x BenchmarkParallel_8x8x1M-32 12786.67 16952.67 1.33x BenchmarkParallel_8x8x8M-32 10828.04 15155.88 1.40x BenchmarkParallel_8x8x32M-32 3699.51 11446.97 3.09x BenchmarkParallel_8x3x1M-32 23612.54 33255.54 1.41x BenchmarkParallel_8x4x1M-32 19270.09 26953.70 1.40x BenchmarkParallel_8x5x1M-32 16087.19 22728.65 1.41x BenchmarkStreamEncode10x2x10000-32 1880.84 2231.98 1.19x BenchmarkStreamEncode100x20x10000-32 508.15 828.83 1.63x BenchmarkStreamEncode17x3x1M-32 9765.04 11106.89 1.14x BenchmarkStreamEncode10x4x16M-32 8215.48 9388.90 1.14x BenchmarkStreamEncode5x2x1M-32 10231.09 12053.39 1.18x BenchmarkStreamEncode10x2x1M-32 11259.17 12374.86 1.10x BenchmarkStreamEncode10x4x1M-32 7570.00 8213.91 1.09x BenchmarkStreamEncode50x20x1M-32 2128.95 2357.92 1.11x BenchmarkStreamEncode17x3x16M-32 9879.08 11319.23 1.15x BenchmarkStreamVerify10x2x10000-32 1677.21 2153.54 1.28x BenchmarkStreamVerify50x5x50000-32 3705.53 3797.90 1.02x BenchmarkStreamVerify10x2x1M-32 10108.64 11223.64 1.11x BenchmarkStreamVerify5x2x1M-32 8070.95 9510.04 1.18x BenchmarkStreamVerify10x4x1M-32 6189.39 7002.62 1.13x BenchmarkStreamVerify50x20x1M-32 1946.72 2257.21 1.16x BenchmarkStreamVerify10x4x16M-32 6692.47 8159.08 1.22x ```
5d7e142 to
fd9ea30
Compare
This reverts commit fd9ea30.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
options.go (1)
28-28: New skip2B flag: confirm default semantics and add a tuning hook (optional).
- Default false enables the 2-byte (16-bit) table path in pure-Go. Please confirm this is intended for all shard sizes unless overridden at runtime.
- Consider adding an Option to toggle this explicitly for benchmarking/regressions without touching internal code.
Proposed Option for explicit control:
+// WithTwoByteTables enables/disables the 2-byte lookup-table path on pure-Go. +// Default is enabled (skip2B=false). +func WithTwoByteTables(enabled bool) Option { + return func(o *options) { + o.skip2B = !enabled + } +}If you already gate this at runtime based on shard counts (e.g., large K+M), ignore the Option and consider documenting the threshold near the field.
galois_noasm.go (2)
15-26: 8-byte path looks correct; pre-load to improve alias safety and ILP (optional).The logic is sound and will significantly cut per-byte overhead. For extra robustness when in/out alias in unusual ways and to improve instruction-level parallelism, pre-load the 4 x 16-bit lookups before any stores.
Apply this within the loop:
- for len(in) >= 8 { - store16(out, mt16[load16(in, 0)], 0) - store16(out, mt16[load16(in, 2)], 2) - store16(out, mt16[load16(in, 4)], 4) - store16(out, mt16[load16(in, 6)], 6) + for len(in) >= 8 { + v0 := mt16[load16(in, 0)] + v1 := mt16[load16(in, 2)] + v2 := mt16[load16(in, 4)] + v3 := mt16[load16(in, 6)] + store16(out, v0, 0) + store16(out, v1, 2) + store16(out, v2, 4) + store16(out, v3, 6)Minor style nit: using an index loop (i += 8) can avoid repeated slice header adjustments, but slicing is fine and clear.
40-51: XOR 8-byte path correct; consider pre-loading to limit read-after-write concerns (optional).Semantics match the scalar path. As above, pre-loading out/in values before any stores can slightly improve ILP and makes aliasing unquestionably safe.
Suggested tweak inside the loop:
- for len(in) >= 8 { - store16(out, load16(out, 0)^mt16[load16(in, 0)], 0) - store16(out, load16(out, 2)^mt16[load16(in, 2)], 2) - store16(out, load16(out, 4)^mt16[load16(in, 4)], 4) - store16(out, load16(out, 6)^mt16[load16(in, 6)], 6) + for len(in) >= 8 { + o0, o1 := load16(out, 0), load16(out, 2) + o2, o3 := load16(out, 4), load16(out, 6) + v0, v1 := mt16[load16(in, 0)], mt16[load16(in, 2)] + v2, v3 := mt16[load16(in, 4)], mt16[load16(in, 6)] + store16(out, o0^v0, 0) + store16(out, o1^v1, 2) + store16(out, o2^v2, 4) + store16(out, o3^v3, 6)benchmark/main.go (2)
43-43: New -fast1 flag: nice toggle for 1-parity optimization.Note: When codec=xor, WithFastOneParityMatrix() is already included; adding -fast1 is redundant but harmless.
115-142: Close CPU profile file to avoid descriptor leak (ordering matters).You start CPU profiling but never close the file. While StopCPUProfile flushes data, closing the file is good hygiene. Ensure StopCPUProfile runs before Close.
Apply this diff:
if cpuprofile != "" { f, err := os.Create(cpuprofile) if err != nil { log.Fatal(err) } pprof.StartCPUProfile(f) - defer pprof.StopCPUProfile() + // Ensure Stop runs before closing the file. + defer f.Close() + defer pprof.StopCPUProfile() }Trace and heap profile blocks already defer Stop/Write before Close correctly. No change needed there.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (9)
benchmark/main.go(5 hunks)galois.go(2 hunks)galois_amd64.go(2 hunks)galois_noasm.go(2 hunks)galois_nopshufb_amd64.go(2 hunks)options.go(1 hunks)reedsolomon.go(2 hunks)unsafe.go(3 hunks)unsafe_disabled.go(3 hunks)
🔥 Files not summarized due to errors (1)
- galois.go: Error: Server error: no LLM provider could handle the message
🚧 Files skipped from review as they are similar to previous changes (5)
- reedsolomon.go
- galois_nopshufb_amd64.go
- galois_amd64.go
- unsafe_disabled.go
- unsafe.go
👮 Files not reviewed due to content moderation or server errors (1)
- galois.go
🧰 Additional context used
🧬 Code graph analysis (1)
benchmark/main.go (1)
options.go (1)
WithFastOneParityMatrix(259-263)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: build (1.25.x, macos-latest)
- GitHub Check: build (1.24.x, macos-latest)
- GitHub Check: build (1.25.x, ubuntu-latest)
- GitHub Check: build (1.23.x, windows-latest)
- GitHub Check: build (1.24.x, windows-latest)
- GitHub Check: build (1.25.x, windows-latest)
- GitHub Check: build (1.24.x, ubuntu-latest)
- GitHub Check: build (1.23.x, ubuntu-latest)
- GitHub Check: build (1.23.x, macos-latest)
- GitHub Check: build-special
🔇 Additional comments (6)
galois_noasm.go (2)
28-31: Tail write with store8: LGTM.Replacing the per-byte assignment with store8 maintains semantics and consistency with the unsafe helpers. No concerns.
53-55: Tail XOR store with load8/store8: LGTM.The change preserves behavior and aligns with the helper API. Looks good.
benchmark/main.go (4)
9-16: Profiling imports: good addition.The added imports are used correctly below. No issues.
52-52: Grouping profiling vars: LGTM.Clear and conventional naming.
71-75: Flag plumbing for profiles: LGTM.Flag names are sensible and discoverable.
457-459: Hooking fast1 into getOptions: LGTM.Option application is idempotent; works across codecs. No concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
benchmark/main.go (5)
9-10: Tidy imports: you can drop io/log after the refactors belowIf you switch the prefill to rng.Read and unify error handling to use exitErr, the io and log imports become unused. Safe to remove.
- "io" - "log" "math" "math/rand" "os" "runtime" "runtime/debug" - "runtime/pprof" - "runtime/trace" + "runtime/pprof" + "runtime/trace"Also applies to: 16-17
44-44: Clarify -fast1 flag help to set user expectationsFast-one-parity only applies when m=1. Suggest making that explicit in the help to avoid confusion.
- fast1 = flag.Bool("fast1", false, "Use fast 1 parity encoder (if available)") + fast1 = flag.Bool("fast1", false, "Use fast 1 parity encoder (only when m=1; ignored otherwise)")
116-143: Profile lifecycle polish: close files, handle Start errors, and GC before heap profile
- CPU: currently the file handle isn’t explicitly closed; close it after Stop to avoid descriptor leaks and ensure flush.
- CPU/Trace: check Start errors and close the file on failure.
- Mem: call runtime.GC() before writing the heap profile so it reflects up-to-date stats.
- Consistency: prefer exitErr over log.Fatal to match the rest of the tool’s error handling.
if cpuprofile != "" { - f, err := os.Create(cpuprofile) - if err != nil { - log.Fatal(err) - } - pprof.StartCPUProfile(f) - defer pprof.StopCPUProfile() + f, err := os.Create(cpuprofile) + if err != nil { + exitErr(err) + } + if err := pprof.StartCPUProfile(f); err != nil { + _ = f.Close() + exitErr(err) + } + defer func() { + pprof.StopCPUProfile() + _ = f.Close() + }() } if memprofile != "" { - f, err := os.Create(memprofile) - if err != nil { - log.Fatal(err) - } - defer f.Close() - defer pprof.WriteHeapProfile(f) + f, err := os.Create(memprofile) + if err != nil { + exitErr(err) + } + defer func() { _ = f.Close() }() + defer func() { + // Ensure the profile reflects up-to-date heap stats. + runtime.GC() + _ = pprof.WriteHeapProfile(f) + }() } if traceprofile != "" { - f, err := os.Create(traceprofile) - if err != nil { - log.Fatal(err) - } - defer f.Close() - err = trace.Start(f) - if err != nil { - log.Fatal(err) - } - defer trace.Stop() + f, err := os.Create(traceprofile) + if err != nil { + exitErr(err) + } + if err := trace.Start(f); err != nil { + _ = f.Close() + exitErr(err) + } + defer func() { + trace.Stop() + _ = f.Close() + }() }
150-154: Simplify prefill: use rng.Read and drop io.ReadFullmath/rand.Rand.Read always fills the buffer and returns a nil error. You can simplify the loop and, if you adopt this, remove the io import.
- for j := range data[i] { - // We fill with random data, otherwise lookups may not use the full tables - _, err := io.ReadFull(rng, data[i][j]) - exitErr(err) - } + for j := range data[i] { + // We fill with random data, otherwise lookups may not use the full tables + _, _ = rng.Read(data[i][j]) + }
464-466: Gate -fast1 to m=1 and warn when ignoredThe option is no-op for m>1 in reedsolomon.New, but making it explicit avoids user confusion. Optional: print a warning when ignored.
- if *fast1 { - o = append(o, reedsolomon.WithFastOneParityMatrix()) - } + if *fast1 && *mShards == 1 { + o = append(o, reedsolomon.WithFastOneParityMatrix()) + } else if *fast1 && *mShards != 1 { + fmt.Fprintln(os.Stderr, "WARN: -fast1 only applies when -m=1; ignoring") + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
benchmark/main.go(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
benchmark/main.go (4)
reedsolomon.go (1)
New(413-582)unsafe.go (1)
AllocAligned(19-43)unsafe_disabled.go (1)
AllocAligned(17-27)options.go (1)
WithFastOneParityMatrix(259-263)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: build (1.23.x, macos-latest)
- GitHub Check: build (1.25.x, ubuntu-latest)
- GitHub Check: build (1.25.x, windows-latest)
- GitHub Check: build (1.23.x, windows-latest)
- GitHub Check: build (1.24.x, macos-latest)
- GitHub Check: build (1.25.x, macos-latest)
- GitHub Check: build (1.24.x, windows-latest)
- GitHub Check: build (1.24.x, ubuntu-latest)
- GitHub Check: build (1.23.x, ubuntu-latest)
- GitHub Check: build-special
🔇 Additional comments (3)
benchmark/main.go (3)
53-53: LGTM: profile flag variablesDeclaring cpuprofile, memprofile, and traceprofile as package-level strings is fine for flag wiring.
73-76: LGTM: profiling flags are wired correctlyFlags parse into the declared vars and are optional. Names are consistent with common conventions (cpu.profile, mem.profile, trace.profile).
147-147: LGTM: deterministic RNG for reproducible benchesSeeding with 0 is a good call for reproducibility across runs and platforms.
Add a 32MB lookup table that allows looking up 2 values at once. Will only be used when no assembly functions are available.
Benchmarks: (with no assembly)
Disabled when shard count gets high as speed degenerates.
Summary by CodeRabbit
New Features
Performance
Chores