-
Notifications
You must be signed in to change notification settings - Fork 2k
Open
Labels
General perf<NV>Broad performance issues not specific to a particular component<NV>Broad performance issues not specific to a particular componentPerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.
Description
Proposal to improve performance
Hi NVIDIA team,
I’m looking to get routed to the right engineering owner for a short 20–30 min technical screen.
Public-safe evidence from our H100 runs:
- Explicit N×N fp16 materialization becomes infeasible at large N (measured CUDA OOM at N=500,000; attempted allocation is hundreds of GiB).
- An indexed O(N) retrieval path continues to operate at the same N (no N^2 matrix construction).
- Memoization on repeated queries yields a large hot-path speedup (example: 863×).
Under NDA we can provide a reproducible runbook + evidence bundle (logs/scripts) for a controlled review (no repo handover).
Could you route this to the right CUDA/perf + (Triton/TensorRT-LLM) owner for a 20–30 minute technical screen this week?
Thanks,
Stanislav Byriukov
@NVIDIA/trt-llm-triton-backend-devs , @NVIDIA/trt-llm-qa-perf , @QiJune
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
coderabbitai
Metadata
Metadata
Labels
General perf<NV>Broad performance issues not specific to a particular component<NV>Broad performance issues not specific to a particular componentPerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.