-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][doc] Update gpt-oss deployment guide to latest release image #7101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][doc] Update gpt-oss deployment guide to latest release image #7101
Conversation
📝 WalkthroughWalkthroughDocumentation updates: README tech-blog date and Latest News were adjusted; the GPT‑OSS deployment tech blog was revised to use release-oriented NGC image tags and pip‑wheel guidance, update model references to Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/CI
participant CLI as trtllm-serve CLI
participant Serve as trtllm-serve
participant Backend as MoE Backend (TRITON)
rect #e8f5e9
Note over User,CLI: Prepare config & model
end
User->>CLI: start serve --model openai/gpt-oss-120b --extra_llm_api_options "moe_config: {backend: TRITON}"
CLI->>Serve: launch with extra_llm_api_options
Serve->>Backend: initialize TRITON MoE kernels (select backend=TRITON)
Backend-->>Serve: ready
Serve-->>User: API ready (serving openai/gpt-oss-120b)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
bb9b804
to
842d457
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (3)
105-110
: Remove stray “s-” bullet prefix.There’s a leading “s-” before the first bullet in Key takeaways.
-Key takeaways: -- `enable_attention_dp` is set to `false` to use TP instead of DP for attention. -s- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph. +Key takeaways: +- `enable_attention_dp` is set to `false` to use TP instead of DP for attention. +- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.
213-223
: Line continuation broken by inline comment after backslash.The backslash must be the last non-whitespace character on the line. As written, it escapes the following space, not the newline, breaking the multiline command. Move the comment to its own line or after the command.
-trtllm-serve \ - openai/gpt-oss-120b \ # Or ${local_model_path} +trtllm-serve \ + # Or use ${local_model_path} + openai/gpt-oss-120b \
231-241
: Same line-continuation issue in max-throughput serve command.-trtllm-serve \ - openai/gpt-oss-120b \ # Or ${local_model_path} +trtllm-serve \ + # Or use ${local_model_path} + openai/gpt-oss-120b \
🧹 Nitpick comments (6)
README.md (1)
21-22
: Hyphenate “High-Performance” in blog title for consistency.Update the Tech Blogs entry to use the compound adjective form.
-* [08/05] Running a High Performance GPT-OSS-120B Inference Server with TensorRT-LLM +* [08/05] Running a High-Performance GPT-OSS-120B Inference Server with TensorRT-LLMdocs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (5)
22-24
: Fix heading spacing and minor grammar.
- Remove the extra space after the hashes (MD019).
- Add “the” in “status of the latest releases.”
-### NGC Docker Image +### NGC Docker Image @@ -Visit the [NGC TensorRT-LLM Release page](...) to find the most up-to-date NGC container image to use. You can also check the latest [release notes](...) to keep track of the support status of latest releases. +Visit the [NGC TensorRT-LLM Release page](...) to find the most up-to-date NGC container image to use. You can also check the latest [release notes](...) to keep track of the support status of the latest releases.
56-59
: Polish section title and phrasing (“pip” lowercase; clearer wording).-### TensorRT-LLM PIP Wheel Install +### TensorRT-LLM pip wheel installation @@ -Regular releases of TensorRT-LLM are also provided as [pip Python wheels](https://pypi.org/project/tensorrt-llm/#history). You can find instructions on pip install [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html). +Regular releases of TensorRT-LLM are also provided as [pip wheels](https://pypi.org/project/tensorrt-llm/#history). You can find installation instructions for pip [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
265-266
: Add apostrophe: “NVIDIA’s”.Keep request text polished and consistent with the example output.
- "content": "What is NVIDIAs advantage for inference?" + "content": "What is NVIDIA's advantage for inference?"
36-37
: Align release tag in blog9 with README badgeThe
blog9_Deploying_GPT_OSS_on_TRTLLM.md
example is still using1.1.0rc0
, but our README badge and NGC release page have moved to1.1.0rc1
. To avoid confusion and drift, please update or parameterize this tag.• File:
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
Line 36: change to the latest tag or use a variableOption A — bump to rc1:
- nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \ + nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc1 \Option B — parameterize for future-proofing:
+ export TRTLLM_TAG=1.1.0rc1 # update to match latest from NGC @@ - nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \ + nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_TAG} \
22-26
: Refine documentation formatting and correct typosPlease address the following in docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (and similar occurrences at lines 36, 213–214, 231–232, 265):
• Heading spacing
- Line 22: change
- ### NGC Docker Image + ### NGC Docker Image• Typographical errors
- Line 26: “matach” → “match”
- Line 265: “NVIDIAs” → “NVIDIA’s”
• Trailing backslashes with inline comments (breaks Markdown code fences)
- Lines 213, 231: remove comments after “\” or move them above, e.g.:
- trtllm-serve \ # Or ${local_model_path} + # Use a local model if preferred: + trtllm-serve \• Hard-coded image tag
- Line 36: replace
nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0
with a placeholder or template variable, e.g.:- nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 + nvcr.io/nvidia/tensorrt-llm/release:${LATEST_TRTLLM_RELEASE}These updates will improve consistency, readability, and future-proof the guide.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
README.md
(2 hunks)docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
(7 hunks)
🧰 Additional context used
🪛 LanguageTool
README.md
[grammar] ~21-~21: There might be a mistake here.
Context: ...-120B Inference Server with TensorRT-LLM ✨ [➡️ link](./docs/source/blogs/tech_blo...
(QB_NEW_EN)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
[grammar] ~24-~24: There might be a mistake here.
Context: ...es) to keep track of the support status of latest releases. Run the following doc...
(QB_NEW_EN)
[grammar] ~26-~26: There might be a mistake here.
Context: ...orRT-LLM container in interactive mode (change image tag to matach latest release): `...
(QB_NEW_EN)
[grammar] ~26-~26: Ensure spelling is correct
Context: ...n interactive mode (change image tag to matach latest release): ```bash docker run --...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
22-22: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
README.md (1)
46-46
: Looks good; links/readability are clear.New Latest News item reads well and matches the Tech Blogs date. No action needed.
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good pending typo fixes
3af8f19
to
4a59592
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (11)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (10)
24-24
: Minor grammar: add “the”.-Visit the ... release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status of latest releases. +Visit the ... release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status of the latest releases.
26-26
: Clarify “latest release” substitution.Since the example may be copy-pasted, explicitly indicate the placeholder for the image tag.
-Run the following docker command to start the TensorRT-LLM container in interactive mode (change image tag to match latest release): +Run the following docker command to start the TensorRT-LLM container in interactive mode (replace <latest_tag> with the tag from the NGC release page):
33-37
: Optional: allow gated HF models by passing HF token into the container.If the HF repo requires acceptance/auth, users will hit 401/403 without a token. Consider documenting this toggle.
-p 8000:8000 \ -e TRTLLM_ENABLE_PDL=1 \ + # Optional: pass your HF token if the model repo is gated + -e HF_TOKEN=$HF_TOKEN \ -v ~/.cache:/root/.cache:rw \ nvcr.io/nvidia/tensorrt-llm/release:<latest_tag> \Add a note below the block: “If required, set HF_TOKEN in your shell (e.g., export HF_TOKEN=...) or run huggingface-cli login inside the container.”
107-107
: Typo: stray “s-” bullet marker.-s- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph. +- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.
137-137
: Wording nit: “10 times of” → “10 times”.-`--num_requests` is set to 10 times of `--concurrency` to run enough number of requests. +`--num_requests` is set to 10 times `--concurrency` to run enough requests.
213-214
: Bash gotcha: comment after line-continuation.Having “\ # Or ${local_model_path}” on the same line can confuse readers and, in some shells/editors, cause copy-paste issues. Move the comment to its own line.
-trtllm-serve \ - openai/gpt-oss-120b \ # Or ${local_model_path} +trtllm-serve \ + # Or use ${local_model_path} instead of the HF repo + openai/gpt-oss-120b \Repeat the same change in the max-throughput command block below.
231-232
: Mirror the serve command comment fix here as well.-trtllm-serve \ - openai/gpt-oss-120b \ # Or ${local_model_path} +trtllm-serve \ + # Or use ${local_model_path} instead of the HF repo + openai/gpt-oss-120b \
3-3
: Terminology: “open-source” → “open-weights”.README’s Latest News uses “open-weights models”; align here for consistency and accuracy.
-NVIDIA has announced day-0 support for OpenAI's new open-source model series, +NVIDIA has announced day-0 support for OpenAI's new open-weights model series,
170-174
: Hopper note for max-throughput: remind to use TRITON backend on H200/H100.Earlier you note TRITON is recommended on Hopper; mirror that guidance here to prevent users from copying CUTLASS on H200/H100.
Compared to the low-latency configuration, we: - set `enable_attention_dp` to `true` to use attention DP which is better for high throughput. - set `stream_interval` to 10 to stream results to the client every 10 tokens. At high concurrency, the detokenization overhead of streaming mode cannot be hidden under GPU execution time, so `stream_interval` serves as a workaround to reduce this overhead. -- set `moe_config.backend` to `CUTLASS` to use the `CUTLASS` MoE kernels which are optimized for high throughput. +- set `moe_config.backend` to `CUTLASS` to use the `CUTLASS` MoE kernels which are optimized for high throughput. + +> Note (H200/H100): On Hopper GPUs, prefer the `TRITON` MoE backend as described below. If you are on H200/H100, replace `CUTLASS` with `TRITON` here.If this guidance is not universally accurate for current releases, please update or scope it (e.g., “as of 1.1.0rc1”).
260-271
: Alternative curl payload quoting to allow natural apostrophes.Current example avoids apostrophes to prevent shell-quoting pitfalls. Consider a variant using double-quoted payload with escaped quotes, so users can include apostrophes naturally.
For example:
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{ \"model\": \"openai/gpt-oss-120b\", \"messages\": [{\"role\": \"user\", \"content\": \"What is NVIDIA's advantage for inference?\"}], \"max_tokens\": 1024, \"top_p\": 0.9 }" -w "\n"README.md (1)
45-47
: Cross-check announcement wording and versioning with the blog.Latest News uses “open-weights models” and the blog previously said “open-source”; also the blog’s Docker snippet pins
release:1.1.0rc0
while the badge is1.1.0rc1
. Please align terminology and version references across README and the blog.If rc1 is the intended canonical version for this PR, update the blog snippet per my other comment.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
README.md
(2 hunks)docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
(7 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
[grammar] ~24-~24: There might be a mistake here.
Context: ...es) to keep track of the support status of latest releases. Run the following doc...
(QB_NEW_EN)
[grammar] ~26-~26: There might be a mistake here.
Context: ...orRT-LLM container in interactive mode (change image tag to match latest release): ``...
(QB_NEW_EN)
[grammar] ~26-~26: There might be a mistake here.
Context: ...n interactive mode (change image tag to match latest release): ```bash docker run --...
(QB_NEW_EN)
README.md
[grammar] ~21-~21: There might be a mistake here.
Context: ...-120B Inference Server with TensorRT-LLM ✨ [➡️ link](./docs/source/blogs/tech_blo...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
README.md (1)
21-23
: Date update looks good; aligns with the Latest News entry.
Signed-off-by: Farshad Ghodsian <[email protected]>
Signed-off-by: Farshad Ghodsian <[email protected]>
Signed-off-by: Farshad Ghodsian <[email protected]>
4a59592
to
845696c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)
107-107
: Stray “s-” before bullet point.Typo breaks formatting: remove the extra “s-”.
Apply this diff:
-- `enable_attention_dp` is set to `false` to use TP instead of DP for attention. -s- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph. +- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.
24-37
: Fix trailing comments on backslashes in shell code blocksThe scan uncovered several inline comments immediately following a backslash, which will break shell line continuations. Please remove these comments from the continuation lines (or move them to a separate preceding line).
Flagged instances in docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:
- Line 213:
openai/gpt-oss-120b \ # Or ${local_model_path}- Line 221:
--max_batch_size ${max_batch_size} \ # E.g., 1- Line 231:
openai/gpt-oss-120b \ # Or ${local_model_path}- Line 239:
--max_batch_size ${max_batch_size} \ # E.g., 640Similar patterns were also detected in examples/models/core/bert/README.md, examples/models/core/llama/README.md, and docs/source/performance/perf-analysis.md. Please audit all backslash continuations across the repo and ensure no trailing spaces or comments follow the “\” characters.
🧹 Nitpick comments (8)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (8)
36-36
: Container tag pin is fine; consider a placeholder to avoid drift (optional).Keeping
release:1.1.0rc0
is correct if that’s the latest published image on NGC. To reduce future churn, you could swap in a placeholder like<latest_tag>
and direct users to NGC above.Apply this optional diff:
- nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \ + nvcr.io/nvidia/tensorrt-llm/release:<latest_tag> \Given our prior learning on published tags, keep the pinned tag if
<latest_tag>
might confuse users. Your call.
265-265
: Keep possessive apostrophe without breaking curl by using a here‑doc.“NVIDIAs” is ungrammatical. Prefer “NVIDIA’s” and avoid shell quoting pitfalls by feeding JSON via a here-doc.
Apply this diff to replace the curl example:
-curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ - "model": "openai/gpt-oss-120b", - "messages": [ - { - "role": "user", - "content": "What is NVIDIAs advantage for inference?" - } - ], - "max_tokens": 1024, - "top_p": 0.9 -}' -w "\n" +curl localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + --data @- -w "\n" <<'JSON' +{ + "model": "openai/gpt-oss-120b", + "messages": [ + { "role": "user", "content": "What is NVIDIA's advantage for inference?" } + ], + "max_tokens": 1024, + "top_p": 0.9 +} +JSON
351-351
: Qualify backend support statement with version context.Add “as of release 1.1.0rc0” (or similar) so the note ages gracefully if support lands later.
Apply this diff:
-OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for optimal performance. `TRTLLM` MoE backend is not supported on Hopper, and `CUTLASS` backend support is still ongoing. +OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for optimal performance. As of the 1.1.0rc0 release, the `TRTLLM` MoE backend is not supported on Hopper, and `CUTLASS` backend support is still ongoing.
1-1
: Optional: align with markdownlint list style (asterisks).One list in this file appears with dashes; most elsewhere use asterisks. Consider standardizing list markers to avoid MD004 warnings.
273-341
: Sample response block includes internal reasoning tokens.The example output shows meta markers like “<|channel|>analysis” which may confuse readers; typical OpenAI-compatible responses don’t contain these. Consider trimming to a concise, realistic assistant message.
22-37
: Optional: show env-substitution for image tag.To help users update tags easily while still referencing published images, consider an env var:
-docker run --rm --ipc=host -it \ +TRTLLM_TAG=1.1.0rc0 # Replace with the latest published tag from NGC +docker run --rm --ipc=host -it \ @@ - nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \ + nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_TAG} \
118-135
: Terminology: clarify “max_batch_size” vs “concurrency” sentence.Minor phrasing to improve clarity (“could serve” → “can serve”; “is set to 10 times of” → “is set to 10×”).
-`--max_batch_size` controls the maximum batch size that the inference engine could serve, while `--concurrency` is the number of concurrent requests that the benchmarking client is sending. `--num_requests` is set to 10 times of `--concurrency` to run enough number of requests. +`--max_batch_size` controls the maximum batch size that the inference engine can serve, while `--concurrency` is the number of concurrent requests that the benchmarking client sends. `--num_requests` is set to 10× `--concurrency` to run a sufficient number of requests.
170-174
: Parallelism note: small style tweaks.Add articles and code formatting consistency.
-- set `enable_attention_dp` to `true` to use attention DP which is better for high throughput. -- set `stream_interval` to 10 to stream results to the client every 10 tokens. At high concurrency, the detokenization overhead of streaming mode cannot be hidden under GPU execution time, so `stream_interval` serves as a workaround to reduce this overhead. -- set `moe_config.backend` to `CUTLASS` to use the `CUTLASS` MoE kernels which are optimized for high throughput. +- Set `enable_attention_dp` to `true` to use attention DP, which is better for high throughput. +- Set `stream_interval` to `10` to stream results to the client every 10 tokens. At high concurrency, the detokenization overhead of streaming mode cannot be hidden under GPU execution time, so `stream_interval` reduces this overhead. +- Set `moe_config.backend` to `CUTLASS` to use the CUTLASS MoE kernels, which are optimized for high throughput.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
README.md
(2 hunks)docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
(7 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.427Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
📚 Learning: 2025-08-21T00:16:56.427Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.427Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
Applied to files:
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
[grammar] ~24-~24: There might be a mistake here.
Context: ...es) to keep track of the support status of latest releases. Run the following doc...
(QB_NEW_EN)
[grammar] ~26-~26: There might be a mistake here.
Context: ...orRT-LLM container in interactive mode (change image tag to match latest release): ``...
(QB_NEW_EN)
[grammar] ~26-~26: There might be a mistake here.
Context: ...n interactive mode (change image tag to match latest release): ```bash docker run --...
(QB_NEW_EN)
README.md
[grammar] ~21-~21: There might be a mistake here.
Context: ...-120B Inference Server with TensorRT-LLM ✨ [➡️ link](./docs/source/blogs/tech_blo...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
231-231: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (1)
141-147
: Verification ask: public performance numbers.The stated 420 tps/user and 19.5k–20k tps/gpu are strong claims. Please ensure they reflect the latest published benchmarks for the specified configs or add “measured internally” with date/context.
Would you like me to scan the repo for benchmark references and align phrasing?
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
Outdated
Show resolved
Hide resolved
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
Outdated
Show resolved
Hide resolved
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
Outdated
Show resolved
Hide resolved
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
Outdated
Show resolved
Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Farshad Ghodsian <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Farshad Ghodsian <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (3)
107-107
: Typo in bullet (“s-”).This renders oddly in markdown and should be a regular list dash.
-s- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph. +- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.
211-226
: Shell line-continuation bugs and duplicate command break copy/paste.
- “Note:” text is inside the code block mid-command; this breaks the backslash continuation.
- There’s a duplicate
trtllm-serve \
line.- Inline comment after a trailing backslash (
\ # E.g., 1
) is invalid; the backslash must be the last character.Fix by removing the inline note/duplicate and by moving the example comment outside the multiline command.
```bash trtllm-serve \ -Note: You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`). - -trtllm-serve \ openai/gpt-oss-120b \ --host 0.0.0.0 \ --port 8000 \ --backend pytorch \ --tp_size ${num_gpus} \ --ep_size 1 \ --extra_llm_api_options low_latency.yaml \ --kv_cache_free_gpu_memory_fraction 0.9 \ - --max_batch_size ${max_batch_size} \ # E.g., 1 + --max_batch_size ${max_batch_size} \ --trust_remote_codeAdd this explanatory note as plain text above the block (outside the command) to retain the guidance without breaking the shell: ```markdown Note: You can point to a local path containing the model weights instead of the HF repo (for example, `${local_model_path}`).
Optionally clarify the example value right below the code block:
Example: set `max_batch_size=1` for the low-latency case.
233-245
: Repeat of multiline shell issues in max-throughput block.Duplicate
trtllm-serve \
and inline comment after a trailing backslash will break the command. Apply the same fix pattern.```bash trtllm-serve \ -trtllm-serve \ openai/gpt-oss-120b \ --host 0.0.0.0 \ --port 8000 \ --backend pytorch \ --tp_size ${num_gpus} \ --ep_size ${num_gpus} \ --extra_llm_api_options max_throughput.yaml \ --kv_cache_free_gpu_memory_fraction 0.9 \ - --max_batch_size ${max_batch_size} \ # E.g., 640 + --max_batch_size ${max_batch_size} \ --trust_remote_codeSuggest adding a clarifying sentence below the block instead: ```markdown Example: set `max_batch_size=640` for the max-throughput case.
♻️ Duplicate comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)
24-26
: Copyedits: articles and “Docker” capitalization.These small nits improve clarity and follow standard terminology. This was flagged earlier and still applies.
-Visit the [NGC TensorRT-LLM Release page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) to find the most up-to-date NGC container image to use. You can also check the latest [release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status of latest releases. -Run the following docker command to start the TensorRT-LLM container in interactive mode (change image tag to match latest release): +Visit the [NGC TensorRT-LLM Release page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) to find the most up-to-date NGC container image to use. You can also check the latest [release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status of the latest releases. +Run the following Docker command to start the TensorRT-LLM container in interactive mode (change the image tag to match the latest release):
56-59
: Tighten section title and wording (“pip wheels”).Use “pip wheels” (or “Python wheels”), not both; also prefer “installation instructions.”
-### TensorRT-LLM PIP Wheel Install +### TensorRT-LLM pip wheel install -Regular releases of TensorRT-LLM are also provided as [pip Python wheels](https://pypi.org/project/tensorrt-llm/#history). You can find instructions on pip install [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html). +Regular releases of TensorRT-LLM are also provided as [pip wheels](https://pypi.org/project/tensorrt-llm/#history). You can find installation instructions [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
🧹 Nitpick comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)
22-22
: Fix heading spacing (markdownlint MD019).Remove the extra space after the hash marks in the heading.
-### NGC Docker Image +### NGC Docker Image
264-275
: Use robust quoting for sample curl payload; reintroduce proper apostrophe.The apostrophe in “NVIDIA’s” was removed to avoid breaking the single-quoted shell string. Prefer a here-document with
-d @-
so the JSON can contain quotes/apostrophes without shell-escaping. This is copy/paste safe.-curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ - "model": "openai/gpt-oss-120b", - "messages": [ - { - "role": "user", - "content": "What is NVIDIAs advantage for inference?" - } - ], - "max_tokens": 1024, - "top_p": 0.9 -}' -w "\n" +curl localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + --data @- <<'JSON' +{ + "model": "openai/gpt-oss-120b", + "messages": [ + { "role": "user", "content": "What is NVIDIA's advantage for inference?" } + ], + "max_tokens": 1024, + "top_p": 0.9 +} +JSON
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
(7 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.427Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
📚 Learning: 2025-08-21T00:16:56.427Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.427Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
Applied to files:
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
[grammar] ~24-~24: There might be a mistake here.
Context: ...es) to keep track of the support status of latest releases. Run the following doc...
(QB_NEW_EN)
[grammar] ~26-~26: There might be a mistake here.
Context: ...orRT-LLM container in interactive mode (change image tag to match latest release): ``...
(QB_NEW_EN)
[grammar] ~26-~26: There might be a mistake here.
Context: ...n interactive mode (change image tag to match latest release): ```bash docker run --...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
22-22: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (1)
36-36
: Please verify the NGC container release tag is up-to-dateIt looks like docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md is pinned to
release:1.1.0rc0
, but the README badge shows1.1.0rc1
. Since the badge may list a version ahead of the actual NGC publish, please:
- Confirm on NGC that
nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0
is indeed the latest published container image.- If
1.1.0rc1
has been published on NGC, update the image tag in blog9_Deploying_GPT_OSS_on_TRTLLM.md (line 36) torelease:1.1.0rc1
.- Ensure consistency across all docs—e.g., the quick-start guides currently reference
1.0.0rc6
.Diff suggestion (if updating to
rc1
):- nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \ + nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc1 \
Signed-off-by: Farshad Ghodsian <[email protected]>
efeebd2
to
c0188ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (3)
107-107
: Typo in bullet list.Extraneous “s-” prefix breaks the list formatting.
-s- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph. +- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.
211-226
: Broken bash block: stray prose inside command, duplicate command, and invalid line continuation.
- Plain “Note:” line inside a bash block breaks execution.
- Duplicate
trtllm-serve \
lines.\ # E.g., 1
has characters after backslash; line continuation fails.Fix by keeping the note as a bash comment, removing the duplicate command, and moving the example comment to its own line.
```bash -trtllm-serve \ -Note: You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`). - -trtllm-serve \ +# Note: You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`). +trtllm-serve \ openai/gpt-oss-120b \ --host 0.0.0.0 \ --port 8000 \ --backend pytorch \ --tp_size ${num_gpus} \ --ep_size 1 \ --extra_llm_api_options low_latency.yaml \ --kv_cache_free_gpu_memory_fraction 0.9 \ - --max_batch_size ${max_batch_size} \ # E.g., 1 + # Example: set max_batch_size=1 for low-latency + --max_batch_size ${max_batch_size} \ --trust_remote_code--- `232-245`: **Repeat of the bash issues in max‑throughput example.** - Duplicate `trtllm-serve \`. - Line continuation after `--max_batch_size` has trailing comment; breaks the command. ```diff ```bash -trtllm-serve \ -trtllm-serve \ +trtllm-serve \ openai/gpt-oss-120b \ --host 0.0.0.0 \ --port 8000 \ --backend pytorch \ --tp_size ${num_gpus} \ --ep_size ${num_gpus} \ --extra_llm_api_options max_throughput.yaml \ --kv_cache_free_gpu_memory_fraction 0.9 \ - --max_batch_size ${max_batch_size} \ # E.g., 640 + # Example: set max_batch_size=640 for max-throughput + --max_batch_size ${max_batch_size} \ --trust_remote_code
</blockquote></details> </blockquote></details>
🧹 Nitpick comments (6)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (6)
22-22
: Fix markdownlint MD019: single space after heading hash.There are two spaces after the atx heading markers.
-### NGC Docker Image +### NGC Docker Image
26-26
: Grammar: add missing article.“Tighten copy: ‘match latest release’ → ‘match the latest release’.”
-Run the following Docker command to start the TensorRT-LLM container in interactive mode (change the image tag to match latest release): +Run the following Docker command to start the TensorRT-LLM container in interactive mode (change the image tag to match the latest release):
56-59
: Tighten terminology and heading casing for pip wheels.Avoid “Python wheels” vs “pip install” mismatch; standardize on “pip wheels” and use “installation instructions.”
-### TensorRT-LLM Python Wheel Install - -Regular releases of TensorRT-LLM are also provided as [Python wheels](https://pypi.org/project/tensorrt-llm/#history). You can find instructions on the pip install [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html). +### TensorRT-LLM pip wheel install + +Regular releases of TensorRT-LLM are also provided as [pip wheels](https://pypi.org/project/tensorrt-llm/#history). You can find installation instructions [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
141-141
: Qualify benchmark claims with test context and date.Performance numbers age quickly. Add container tag, GPU model, driver/CUDA/TensorRT versions, and the date measured to avoid future confusion.
For example:
- “Measured on nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0, 8×B200 (HBM3e), CUDA 12.6, TensorRT‑LLM 1.1.0rc0, August 2025.”
- Link the emitted
low_latency_benchmark.json
/max_throughput_benchmark.json
as artifacts or include the exact command used.Also applies to: 203-204
353-365
: Hopper MoE backend statement likely outdated and conflicts with earlier guidance.Above you recommend
CUTLASS
for max‑throughput andTRTLLM
for low‑latency; here it says “TRTLLM
MoE backend is not supported on Hopper” and “CUTLASS
support is still ongoing.” These conflict and may no longer reflect current Hopper support.Consider replacing with release‑agnostic guidance and point to the official support matrix, e.g.:
-OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for optimal performance. `TRTLLM` MoE backend is not supported on Hopper, and `CUTLASS` backend support is still ongoing. Please follow the instructions in this [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe) to install and enable the `TRITON` MoE kernels on Hopper GPUs. +OpenAI ships Triton kernels optimized for its MoE models. On Hopper (H100/H200), we currently recommend using the `TRITON` MoE backend when targeting GPT‑OSS MXFP4 for best out‑of‑the‑box performance. TensorRT‑LLM and CUTLASS backends continue to evolve; consult the latest TensorRT‑LLM release notes for Hopper MoE support and recommended backends. Follow the instructions in this [guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe) to install and enable the `TRITON` MoE kernels on Hopper GPUs.If helpful, I can fetch and reconcile the exact support status from the latest release notes.
264-275
: Optional: preserve correct apostrophe in sample prompt without breaking quoting.Current workaround removes the apostrophe (NVIDIAs). Consider a here‑doc to keep natural text and valid JSON.
-curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ - "model": "openai/gpt-oss-120b", - "messages": [ - { - "role": "user", - "content": "What is NVIDIAs advantage for inference?" - } - ], - "max_tokens": 1024, - "top_p": 0.9 -}' -w "\n" +curl localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d @- -w "\n" <<'JSON' +{ + "model": "openai/gpt-oss-120b", + "messages": [ + { + "role": "user", + "content": "What is NVIDIA's advantage for inference?" + } + ], + "max_tokens": 1024, + "top_p": 0.9 +} +JSON📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
(7 hunks)🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: farshadghodsian PR: NVIDIA/TensorRT-LLM#7101 File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36 Timestamp: 2025-08-21T00:16:56.427Z Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
📚 Learning: 2025-08-21T00:16:56.427Z
Learnt from: farshadghodsian PR: NVIDIA/TensorRT-LLM#7101 File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36 Timestamp: 2025-08-21T00:16:56.427Z Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
Applied to files:
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99 PR: NVIDIA/TensorRT-LLM#6915 File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012 Timestamp: 2025-08-14T23:23:27.449Z Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
[grammar] ~26-~26: There might be a mistake here.
Context: ...teractive mode (change the image tag to match latest release): ```bash docker run --...(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
22-22: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (1)
36-36
: Audit and align NGC container tags across documentationPlease confirm that every hard-coded
nvcr.io/nvidia/tensorrt-llm/release:<tag>
reference corresponds to a published NGC image, and update any unreleased or placeholder tags to their correct, published values. In particular, the following occurrences were found:
- docker/release.md:21 —
nvcr.io/nvidia/tensorrt-llm/release:x.y.z
(placeholder)- docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md:50 —
…/release:1.0.0rc6
- docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md:41 —
…/release:1.0.0rc6
- docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md:42 —
…/release:1.0.0rc6
- docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36 —
…/release:1.1.0rc0
Action items:
- Verify each
<tag>
above is available on NGC; if not, replace with the correct, available version.- Replace the
x.y.z
placeholder in docker/release.md with the actual released tag.- Ensure consistency in README badges and all other docs to avoid referencing unreleased tags.
/bot skip --comment "No need to run full CI" |
PR_Github #16005 [ skip ] triggered by Bot |
PR_Github #16005 [ skip ] completed with state |
…VIDIA#7101) Signed-off-by: Farshad Ghodsian <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Yuxin <[email protected]>
Summary by CodeRabbit
Description
Updated GPT-OSS Deployment guide to use latest TensorRT-LLM blog. Also updated main Readme to fix GPT-OSS release date.
Test Coverage
None required as these are just doc changes.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.