UPSTREAM PR #17639: ggml-cuda: reorder only relevant nodes by loci-dev · Pull Request #385 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-01T08:43:39Z

GGML_CUDA_GRAPH_OPT=1 is broken with any tensor offloading options like n-cpu-moe because we just copy the graph back to enable fusion within streams. This PR only re-orders nodes within streams.

Also, because we don't use CUDA graphs for hybrid inference at the moment, GGML_CUDA_GRAPH_OPT=1 is slower than not using it. This may change in the future

loci-review · 2025-12-01T09:26:30Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #385

Overview

This PR modifies CUDA graph optimization logic across 2 files in the GGML backend. Analysis shows 0.0% performance change across all measured functions and binaries. The code changes implement per-concurrent-region node reordering instead of global graph reordering to fix broken hybrid CPU-GPU inference when GGML_CUDA_GRAPH_OPT=1 is enabled.

Measured Performance Impact:

llama_decode: 733,095 ns (base: 733,093 ns, +2 ns, 0.0%)
llama_tokenize: 393,964 ns (base: 393,963 ns, +1 ns, 0.0%)
llama_encode: 293,279 ns (base: 293,279 ns, 0 ns, 0.0%)
ggml_backend_graph_compute: 128 ns (base: 128 ns, 0 ns, 0.0%)

Power Consumption:

build.bin.libllama.so: 193,066 nJ (base: 193,066 nJ, +0.03 nJ, 0.0%)
build.bin.libggml-cpu.so: 116,811 nJ (base: 116,811 nJ, 0.0 nJ, 0.0%)
build.bin.libggml-base.so: 59,158 nJ (base: 59,158 nJ, 0.0 nJ, 0.0%)
All other binaries: 0.0% change

Tokens Per Second Impact: None. The inference functions llama_decode, llama_encode, and llama_tokenize show no measurable response time changes. The +2 ns change in llama_decode is negligible compared to the 2 ms threshold that would cause 7% tokens per second degradation on the reference system.

The changes affect graph optimization setup code, not the inference hot path, explaining the absence of performance impact in measurements.

ggml-cuda: reorder only relevant nodes

532ca7f

loci-dev temporarily deployed to PROD__AL_DEMO December 1, 2025 08:43 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 9368c2d to 50d76f4 Compare December 1, 2025 09:13

loci-dev force-pushed the main branch 26 times, most recently from 6eae205 to 565a9d5 Compare December 3, 2025 12:15

loci-dev force-pushed the main branch 30 times, most recently from 105e379 to 1bd5bdc Compare December 8, 2025 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17639: ggml-cuda: reorder only relevant nodes#385

UPSTREAM PR #17639: ggml-cuda: reorder only relevant nodes#385
loci-dev wants to merge 1 commit into
mainfrom
upstream-PR17639-branch_am17an-cuda_graph_opt_cpu_moe

loci-dev commented Dec 1, 2025

Uh oh!

loci-review Bot commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 1, 2025

Uh oh!

loci-review Bot commented Dec 1, 2025

Performance Analysis Summary: PR #385

Overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants