Skip to content

Conversation

xuanzic
Copy link
Contributor

@xuanzic xuanzic commented Jun 27, 2025

PR title

[TRTLLM-6104] feat: add request_perf_metrics to triton LLMAPI backend based on PR #5497

Description

add per request kv cache / timing / spec dec metrics in triton using LLMAPI pytorch runtime
Usage:

1. cp -R triton_backend/all_models/llmapi/ llmapi_repo/
2. python3 triton_backend/scripts/launch_triton_server.py --model_repo=llmapi_repo/

curl -X POST localhost:8000/v2/models/tensorrt_llm/generate -d '{"text_input": "Please explain to me what is machine learning? ", "max_tokens":10, "sampling_param_return_perf_metrics":true}' | jq

Response will look like:


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   800    0   675  100   125    507     93  0:00:01  0:00:01 --:--:--   601
{
  "acceptance_rate": "0.0",
  "arrival_time_ns": "76735247746000",
  "first_scheduled_time_ns": "76735248284000",
  "first_token_time_ns": "76735374300000",
  "kv_cache_alloc_new_blocks": "1",
  "kv_cache_alloc_total_blocks": "1",
  "kv_cache_hit_rate": "0.0",
  "kv_cache_missed_block": "1",
  "kv_cache_reused_block": "0",
  "last_token_time_ns": "76736545324000",
  "model_name": "tensorrt_llm",
  "model_version": "1",
  "text_output": "Please explain to me what is machine learning? \n\nMachine learning is a field of computer science that involves the development of algorithms and models that can learn from data without being explicitly programmed. It is a",
  "total_accepted_draft_tokens": "0",
  "total_draft_tokens": "0"
}

Test Coverage

verify request_perf_metrics status in the SamplingParams

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Collaborator

@achartier achartier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@achartier
Copy link
Collaborator

/bot run --stage-list "A30-Triton-[Post-Merge]-2"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10376 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10376 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7672 (Partly Tested) completed with status: 'SUCCESS'

@achartier
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10393 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10393 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7685 completed with status: 'SUCCESS'

@xuanzic xuanzic force-pushed the triton-llmapi-perfmetrics branch 2 times, most recently from 6273131 to 00d3edb Compare July 1, 2025 02:46
@achartier achartier force-pushed the triton-llmapi-perfmetrics branch from 00d3edb to 349b011 Compare July 1, 2025 02:48
@achartier achartier enabled auto-merge (squash) July 1, 2025 02:50
@achartier
Copy link
Collaborator

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10421 [ reuse-pipeline ] triggered by Bot

auto-merge was automatically disabled July 1, 2025 02:58

Head branch was pushed to by a user without write access

@xuanzic xuanzic force-pushed the triton-llmapi-perfmetrics branch from 349b011 to 2a05e71 Compare July 1, 2025 02:58
@tensorrt-cicd
Copy link
Collaborator

PR_Github #10421 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #10393 for commit 349b011

@achartier achartier force-pushed the triton-llmapi-perfmetrics branch from 2a05e71 to e91dd26 Compare July 1, 2025 03:19
@achartier achartier enabled auto-merge (squash) July 1, 2025 03:19
@achartier
Copy link
Collaborator

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10427 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10427 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #10393 for commit e91dd26

auto-merge was automatically disabled July 1, 2025 04:08

Head branch was pushed to by a user without write access

@xuanzic xuanzic force-pushed the triton-llmapi-perfmetrics branch from e91dd26 to fab2f90 Compare July 1, 2025 04:08
@achartier
Copy link
Collaborator

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10435 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10435 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #10393 for commit fab2f90

@achartier achartier merged commit 34212e2 into NVIDIA:main Jul 1, 2025
3 checks passed
Shunkangz pushed a commit to Shunkangz/TensorRT-LLM that referenced this pull request Jul 2, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 9, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants