UPSTREAM PR #17639: ggml-cuda: reorder only relevant nodes#385
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #385OverviewThis PR modifies CUDA graph optimization logic across 2 files in the GGML backend. Analysis shows 0.0% performance change across all measured functions and binaries. The code changes implement per-concurrent-region node reordering instead of global graph reordering to fix broken hybrid CPU-GPU inference when Measured Performance Impact:
Power Consumption:
Tokens Per Second Impact: None. The inference functions The changes affect graph optimization setup code, not the inference hot path, explaining the absence of performance impact in measurements. |
6eae205 to
565a9d5
Compare
105e379 to
1bd5bdc
Compare
Mirrored from ggml-org/llama.cpp#17639
GGML_CUDA_GRAPH_OPT=1is broken with any tensor offloading options liken-cpu-moebecause we just copy the graph back to enable fusion within streams. This PR only re-orders nodes within streams.Also, because we don't use CUDA graphs for hybrid inference at the moment,
GGML_CUDA_GRAPH_OPT=1is slower than not using it. This may change in the future