Skip to content

UPSTREAM PR #17639: ggml-cuda: reorder only relevant nodes#385

Open
loci-dev wants to merge 1 commit into
mainfrom
upstream-PR17639-branch_am17an-cuda_graph_opt_cpu_moe
Open

UPSTREAM PR #17639: ggml-cuda: reorder only relevant nodes#385
loci-dev wants to merge 1 commit into
mainfrom
upstream-PR17639-branch_am17an-cuda_graph_opt_cpu_moe

Conversation

@loci-dev

@loci-dev loci-dev commented Dec 1, 2025

Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#17639

GGML_CUDA_GRAPH_OPT=1 is broken with any tensor offloading options like n-cpu-moe because we just copy the graph back to enable fusion within streams. This PR only re-orders nodes within streams.

Also, because we don't use CUDA graphs for hybrid inference at the moment, GGML_CUDA_GRAPH_OPT=1 is slower than not using it. This may change in the future

@loci-review

loci-review Bot commented Dec 1, 2025

Copy link
Copy Markdown

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #385

Overview

This PR modifies CUDA graph optimization logic across 2 files in the GGML backend. Analysis shows 0.0% performance change across all measured functions and binaries. The code changes implement per-concurrent-region node reordering instead of global graph reordering to fix broken hybrid CPU-GPU inference when GGML_CUDA_GRAPH_OPT=1 is enabled.

Measured Performance Impact:

  • llama_decode: 733,095 ns (base: 733,093 ns, +2 ns, 0.0%)
  • llama_tokenize: 393,964 ns (base: 393,963 ns, +1 ns, 0.0%)
  • llama_encode: 293,279 ns (base: 293,279 ns, 0 ns, 0.0%)
  • ggml_backend_graph_compute: 128 ns (base: 128 ns, 0 ns, 0.0%)

Power Consumption:

  • build.bin.libllama.so: 193,066 nJ (base: 193,066 nJ, +0.03 nJ, 0.0%)
  • build.bin.libggml-cpu.so: 116,811 nJ (base: 116,811 nJ, 0.0 nJ, 0.0%)
  • build.bin.libggml-base.so: 59,158 nJ (base: 59,158 nJ, 0.0 nJ, 0.0%)
  • All other binaries: 0.0% change

Tokens Per Second Impact: None. The inference functions llama_decode, llama_encode, and llama_tokenize show no measurable response time changes. The +2 ns change in llama_decode is negligible compared to the 2 ms threshold that would cause 7% tokens per second degradation on the reference system.

The changes affect graph optimization setup code, not the inference hot path, explaining the absence of performance impact in measurements.

@loci-dev loci-dev force-pushed the main branch 26 times, most recently from 6eae205 to 565a9d5 Compare December 3, 2025 12:15
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 105e379 to 1bd5bdc Compare December 8, 2025 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants