Skip to content

Request: 20–30 min technical screen + routing (H100 retrieval/memory primitive; reproducible under NDA) #10164

@StanByriukov02

Description

@StanByriukov02

Proposal to improve performance

Hi NVIDIA team,

I’m looking to get routed to the right engineering owner for a short 20–30 min technical screen.

Public-safe evidence from our H100 runs:

  • Explicit N×N fp16 materialization becomes infeasible at large N (measured CUDA OOM at N=500,000; attempted allocation is hundreds of GiB).
  • An indexed O(N) retrieval path continues to operate at the same N (no N^2 matrix construction).
  • Memoization on repeated queries yields a large hot-path speedup (example: 863×).

Under NDA we can provide a reproducible runbook + evidence bundle (logs/scripts) for a controlled review (no repo handover).

Could you route this to the right CUDA/perf + (Triton/TensorRT-LLM) owner for a 20–30 minute technical screen this week?

Thanks,

Stanislav Byriukov

@NVIDIA/trt-llm-triton-backend-devs , @NVIDIA/trt-llm-qa-perf , @QiJune

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

General perf<NV>Broad performance issues not specific to a particular componentPerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions