ggml-cuda: reorder only relevant nodes by am17an · Pull Request #17639 · ggml-org/llama.cpp

am17an · 2025-12-01T08:07:59Z

GGML_CUDA_GRAPH_OPT=1 is broken with any tensor offloading options like n-cpu-moe because we just copy the graph back to enable fusion within streams. This PR only re-orders nodes within streams.

Also, because we don't use CUDA graphs for hybrid inference at the moment, GGML_CUDA_GRAPH_OPT=1 is slower than not using it. This may change in the future

wishstudio · 2025-12-01T09:32:08Z

I'm trying to merging master on my PR #16548 and hit hard by this 😭

Is it possible to avoid changing cgraph in evaluate_and_capture_cuda_graph? Because the graph plan api generally expects to use const struct ggml_cgraph and does not expect the backend to change cgraph halfway (which is good design IMO).

Does the node changes need to be persistent after returning from this function? If not, isn't it easier to just create a temporary cgraph to use inside this function, and discard it before returning. struct ggml_cgraph is just a few pointers, duplicating won't affect performance at all.

It is of course possible to hack over this in my PR, but I feel the current solution is already a bit hacky and I would like to avoid a hack over a hack.

am17an · 2025-12-01T09:35:40Z

@wishstudio for now you should just not use this path in your PR (it is not enabled by default)

am17an · 2025-12-01T10:19:01Z

BTW @wishstudio do you still face an issue after this PR? I think it solves your particular issue

wishstudio · 2025-12-01T10:50:36Z

I’m not sure if my graph handling code is semantically correct with the nodes changing halfway but if you are still working on it it’s ok for me to skip this codepath rn. Without this PR it won’t even compile because the cgraph itself is being changed. I think this PR should make it work.

JohannesGaessler · 2025-12-01T21:12:59Z

@wishstudio sorry, I was not aware of your efforts and that the recently merged approach would be causing problems for you. As I'm sure you're aware Diego is currently on Hiatus. For this reason pull requests such as yours that have to do with his work will probably not receive good attention from maintainers. As it is I'm already operating at full capacity but I intend to also do work having to do with ggml backends. When I do I'll take a look at the ggml graph plan API and see if I can review your work (definitely no promises though). FWIW I don't consider the current implementation for concurrent CUDA streams to be something that we should be using long-term and I agree that ggml backends should not be changing ggml graphs if at all possible.

am17an · 2025-12-02T04:35:49Z

@wishstudio I'll take a look at your PR and try to review it as well

wishstudio · 2025-12-02T07:36:53Z

@am17an After you merged this PR I updated mine and it is working fine now. Thanks!

@JohannesGaessler Thank you for the explanation and I appreciate your stance on code quality. Please take your time! I think that PR also enables more graph based optimizations to be done in all the backends, but it still needs some design work. The annoying part atm is I did some refactoring and the diff all got messed up, causing conflicts every time evaluate_and_capture_cuda_graph is changed.

am17an · 2025-12-03T02:54:48Z

BTW to be clear as to there's no confusion. The nodes don't change "halfway through", they are captured by cuda graph and they remain the same way throughout until the graph is captured again.

This PR fixed a bug when the size of the cgraph changes because there are splits added for cpu moe, I will add code to disable this when cuda graphs cannot be used as it leads to a performance drop anyway.

Reordering or skipping nodes is allowed in graph compute under the correct conditions. The earlier PR plus this PR does not break this paradigm

wishstudio · 2025-12-07T10:24:02Z

BTW to be clear as to there's no confusion. The nodes don't change "halfway through", they are captured by cuda graph and they remain the same way throughout until the graph is captured again.

Yes I do understand that.

Reordering or skipping nodes is allowed in graph compute under the correct conditions.

This is exactly what I'm arguing. Since you added the graph optimize function, I would think all graph modifications done by a backend should be moved there. In my intuition an "optimize" operation is obviously allowed to change the thing it optimizes, but a "compute" operation is like calling a function, which in most cases one would not expect that it changes the function itself.

Of course in this case we can argue that it's more like a mutable cache inside an immutable function so the changes are invisible to the user. But based on the work on my PR I don't remember seeing any code doing actual graph changes in the CUDA backend (correct me if I'm wrong). The only similar thing is perhaps fusing but technically speaking it is done on the fly and the changes are never written to the graph. So I don't think this is strictly necessary.

For reordering and skipping nodes it's analogous to a superscalar CPU converting instructions to uops, optimizing them in a uops buffer and executing them in any order it likes. I'm perfectly ok with it as long as it uses separate buffers and the original instructions memory are kept intact.

am17an · 2025-12-07T11:16:09Z

Since you added the graph optimize function, I would think all graph modifications done by a backend should be moved there.

Yes I would prefer that as well. If you read the original PR there is a path forward for getting proper support for these kind of out-of-order operations, so eventually we would not require this hack.

That being said, currently even fusion modifies the cgraph in a few ways, not the memory buffer but the ggml_tensors, which also is a bad pattern that I plan to imminently remove.

This reverts commit ed32089.

ggml-cuda: reorder only relevant nodes

532ca7f

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 1, 2025

am17an requested a review from JohannesGaessler December 1, 2025 08:08

loci-dev mentioned this pull request Dec 1, 2025

UPSTREAM PR #17639: ggml-cuda: reorder only relevant nodes auroralabs-loci/llama.cpp#385

Open

JohannesGaessler approved these changes Dec 1, 2025

View reviewed changes

taronaeo mentioned this pull request Dec 1, 2025

release: fix duplicate libs, store symbolic links #17299

Merged

am17an merged commit ed32089 into ggml-org:master Dec 2, 2025
72 of 74 checks passed

am17an deleted the cuda_graph_opt_cpu_moe branch December 2, 2025 04:36

taronaeo mentioned this pull request Dec 2, 2025

ci : skip winget update when not in ggml-org #17465

Merged

gabe-l-hart mentioned this pull request Dec 10, 2025

feat: llama.cpp bump (17f7f4) for SSM performance improvements ollama/ollama#13408

Merged

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Dec 20, 2025

Revert "ggml-cuda: reorder only relevant nodes (ggml-org#17639)"

81194bf

This reverts commit ed32089.

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

ggml-cuda: reorder only relevant nodes (ggml-org#17639)

181d953

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

ggml-cuda: reorder only relevant nodes (#17639)

cda22d7

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

ggml-cuda: reorder only relevant nodes (ggml-org#17639)

f472f2d

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

ggml-cuda: reorder only relevant nodes (ggml-org#17639)

35eac1b

my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026

ggml-cuda: reorder only relevant nodes (ggml-org#17639)

5788ce5

my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026

ggml-cuda: reorder only relevant nodes (ggml-org#17639)

f575f84

phibya pushed a commit to ziee-ai/llama.cpp that referenced this pull request May 29, 2026

ggml-cuda: reorder only relevant nodes (ggml-org#17639)

bfc5efb

fewtarius pushed a commit to fewtarius/CachyLLama that referenced this pull request May 30, 2026

ggml-cuda: reorder only relevant nodes (ggml-org#17639)

08cf495

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda: reorder only relevant nodes#17639

ggml-cuda: reorder only relevant nodes#17639
am17an merged 1 commit into
ggml-org:masterfrom
am17an:cuda_graph_opt_cpu_moe

am17an commented Dec 1, 2025

Uh oh!

wishstudio commented Dec 1, 2025 •

edited

Loading

Uh oh!

am17an commented Dec 1, 2025

Uh oh!

am17an commented Dec 1, 2025 •

edited

Loading

Uh oh!

wishstudio commented Dec 1, 2025 via email •

edited

Loading

Uh oh!

JohannesGaessler commented Dec 1, 2025

Uh oh!

am17an commented Dec 2, 2025

Uh oh!

Uh oh!

wishstudio commented Dec 2, 2025

Uh oh!

am17an commented Dec 3, 2025

Uh oh!

wishstudio commented Dec 7, 2025 •

edited

Loading

Uh oh!

am17an commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

am17an commented Dec 1, 2025

Uh oh!

wishstudio commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Dec 1, 2025

Uh oh!

am17an commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wishstudio commented Dec 1, 2025 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Dec 1, 2025

Uh oh!

am17an commented Dec 2, 2025

Uh oh!

Uh oh!

wishstudio commented Dec 2, 2025

Uh oh!

am17an commented Dec 3, 2025

Uh oh!

wishstudio commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wishstudio commented Dec 1, 2025 •

edited

Loading

am17an commented Dec 1, 2025 •

edited

Loading

wishstudio commented Dec 1, 2025 via email •

edited

Loading

wishstudio commented Dec 7, 2025 •

edited

Loading