Skip to content

Conversation

farshadghodsian
Copy link
Contributor

@farshadghodsian farshadghodsian commented Aug 20, 2025

Summary by CodeRabbit

  • Documentation
    • README: corrected Tech Blogs date and added 08/05 Latest News announcing Day‑0 support for GPT‑OSS‑120b and GPT‑OSS‑20b with links.
    • Deploying GPT‑OSS on TensorRT‑LLM guide: shifted to release-oriented instructions, updated image/tag and distribution guidance (favor Python wheels), updated model references to openai/gpt-oss-120b, simplified Triton MoE setup and added explicit Triton selection; minor chat payload wording fix.

Description

Updated GPT-OSS Deployment guide to use latest TensorRT-LLM blog. Also updated main Readme to fix GPT-OSS release date.

Test Coverage

None required as these are just doc changes.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@farshadghodsian farshadghodsian requested a review from a team as a code owner August 20, 2025 22:02
Copy link
Contributor

coderabbitai bot commented Aug 20, 2025

📝 Walkthrough

Walkthrough

Documentation updates: README tech-blog date and Latest News were adjusted; the GPT‑OSS deployment tech blog was revised to use release-oriented NGC image tags and pip‑wheel guidance, update model references to openai/gpt-oss-120b, simplify Triton MoE instructions, and add a concise TRITON backend YAML snippet and usage via --extra_llm_api_options.

Changes

Cohort / File(s) Summary
README updates
README.md
Updated Tech Blogs date [08/06] → [08/05], removed an extra blank line, and added a Latest News [08/05] entry announcing Day‑0 support for OpenAI gpt-oss-120b and gpt-oss-20b with HuggingFace links.
GPT‑OSS deployment blog revisions
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
Replaced dev‑branch NGC Docker references with release‑oriented guidance and example release tag (e.g., release:1.1.0rc0); updated docker run examples to use latest release tag; emphasized TensorRT‑LLM pip wheel install instead of container release; updated serve/benchmark examples and test payloads to openai/gpt-oss-120b and adjusted sample prompt text; removed detailed OpenAI Triton install steps and Triton root-path note; added concise "Selecting Triton as the MoE backend" subsection with moe_config: backend: TRITON YAML and instruction to apply via --extra_llm_api_options.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CI
    participant CLI as trtllm-serve CLI
    participant Serve as trtllm-serve
    participant Backend as MoE Backend (TRITON)

    rect #e8f5e9
      Note over User,CLI: Prepare config & model
    end

    User->>CLI: start serve --model openai/gpt-oss-120b --extra_llm_api_options "moe_config: {backend: TRITON}"
    CLI->>Serve: launch with extra_llm_api_options
    Serve->>Backend: initialize TRITON MoE kernels (select backend=TRITON)
    Backend-->>Serve: ready
    Serve-->>User: API ready (serving openai/gpt-oss-120b)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

Documentation

Suggested reviewers

  • QiJune
  • nv-guomingz
  • chzblych
  • juney-nvidia

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@farshadghodsian farshadghodsian force-pushed the feat/gpt-oss-guide-update branch from bb9b804 to 842d457 Compare August 20, 2025 22:03
@farshadghodsian farshadghodsian changed the title [None][doc] Update GPT-OSS deployment guide to point to latest release image [None] [doc] Update GPT-OSS deployment guide to point to latest release image Aug 20, 2025
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 20, 2025
@farshadghodsian farshadghodsian changed the title [None] [doc] Update GPT-OSS deployment guide to point to latest release image [None][doc] Update GPT-OSS deployment guide to point to latest release image Aug 20, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (3)

105-110: Remove stray “s-” bullet prefix.

There’s a leading “s-” before the first bullet in Key takeaways.

-Key takeaways:
-- `enable_attention_dp` is set to `false` to use TP instead of DP for attention.
-s- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.
+Key takeaways:
+- `enable_attention_dp` is set to `false` to use TP instead of DP for attention.
+- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.

213-223: Line continuation broken by inline comment after backslash.

The backslash must be the last non-whitespace character on the line. As written, it escapes the following space, not the newline, breaking the multiline command. Move the comment to its own line or after the command.

-trtllm-serve \
-  openai/gpt-oss-120b \  # Or ${local_model_path}
+trtllm-serve \
+  # Or use ${local_model_path}
+  openai/gpt-oss-120b \

231-241: Same line-continuation issue in max-throughput serve command.

-trtllm-serve \
-  openai/gpt-oss-120b \  # Or ${local_model_path}
+trtllm-serve \
+  # Or use ${local_model_path}
+  openai/gpt-oss-120b \
🧹 Nitpick comments (6)
README.md (1)

21-22: Hyphenate “High-Performance” in blog title for consistency.

Update the Tech Blogs entry to use the compound adjective form.

-* [08/05] Running a High Performance GPT-OSS-120B Inference Server with TensorRT-LLM
+* [08/05] Running a High-Performance GPT-OSS-120B Inference Server with TensorRT-LLM
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (5)

22-24: Fix heading spacing and minor grammar.

  • Remove the extra space after the hashes (MD019).
  • Add “the” in “status of the latest releases.”
-###  NGC Docker Image
+### NGC Docker Image
@@
-Visit the [NGC TensorRT-LLM Release page](...) to find the most up-to-date NGC container image to use. You can also check the latest [release notes](...) to keep track of the support status of latest releases.
+Visit the [NGC TensorRT-LLM Release page](...) to find the most up-to-date NGC container image to use. You can also check the latest [release notes](...) to keep track of the support status of the latest releases.

56-59: Polish section title and phrasing (“pip” lowercase; clearer wording).

-### TensorRT-LLM PIP Wheel Install
+### TensorRT-LLM pip wheel installation
@@
-Regular releases of TensorRT-LLM are also provided as [pip Python wheels](https://pypi.org/project/tensorrt-llm/#history). You can find instructions on pip install [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
+Regular releases of TensorRT-LLM are also provided as [pip wheels](https://pypi.org/project/tensorrt-llm/#history). You can find installation instructions for pip [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).

265-266: Add apostrophe: “NVIDIA’s”.

Keep request text polished and consistent with the example output.

-            "content": "What is NVIDIAs advantage for inference?"
+            "content": "What is NVIDIA's advantage for inference?"

36-37: Align release tag in blog9 with README badge

The blog9_Deploying_GPT_OSS_on_TRTLLM.md example is still using 1.1.0rc0, but our README badge and NGC release page have moved to 1.1.0rc1. To avoid confusion and drift, please update or parameterize this tag.

• File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
Line 36: change to the latest tag or use a variable

Option A — bump to rc1:

-  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
+  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc1 \

Option B — parameterize for future-proofing:

+ export TRTLLM_TAG=1.1.0rc1   # update to match latest from NGC
@@
-  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
+  nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_TAG} \

22-26: Refine documentation formatting and correct typos

Please address the following in docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (and similar occurrences at lines 36, 213–214, 231–232, 265):

• Heading spacing

  • Line 22: change
    - ###  NGC Docker Image
    + ### NGC Docker Image

• Typographical errors

  • Line 26: “matach” → “match”
  • Line 265: “NVIDIAs” → “NVIDIA’s”

• Trailing backslashes with inline comments (breaks Markdown code fences)

  • Lines 213, 231: remove comments after “\” or move them above, e.g.:
    -   trtllm-serve \  # Or ${local_model_path}
    +   # Use a local model if preferred:
    +   trtllm-serve \

• Hard-coded image tag

  • Line 36: replace nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 with a placeholder or template variable, e.g.:
    -  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0
    +  nvcr.io/nvidia/tensorrt-llm/release:${LATEST_TRTLLM_RELEASE}

These updates will improve consistency, readability, and future-proof the guide.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e5e4170 and 842d457.

📒 Files selected for processing (2)
  • README.md (2 hunks)
  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (7 hunks)
🧰 Additional context used
🪛 LanguageTool
README.md

[grammar] ~21-~21: There might be a mistake here.
Context: ...-120B Inference Server with TensorRT-LLM ✨ [➡️ link](./docs/source/blogs/tech_blo...

(QB_NEW_EN)

docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

[grammar] ~24-~24: There might be a mistake here.
Context: ...es) to keep track of the support status of latest releases. Run the following doc...

(QB_NEW_EN)


[grammar] ~26-~26: There might be a mistake here.
Context: ...orRT-LLM container in interactive mode (change image tag to matach latest release): `...

(QB_NEW_EN)


[grammar] ~26-~26: Ensure spelling is correct
Context: ...n interactive mode (change image tag to matach latest release): ```bash docker run --...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

22-22: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
README.md (1)

46-46: Looks good; links/readability are clear.

New Latest News item reads well and matches the Tech Blogs date. No action needed.

Copy link
Collaborator

@laikhtewari laikhtewari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good pending typo fixes

@farshadghodsian farshadghodsian changed the title [None][doc] Update GPT-OSS deployment guide to point to latest release image [None][doc] Update gpt-oss deployment guide to latest release image Aug 20, 2025
@farshadghodsian farshadghodsian force-pushed the feat/gpt-oss-guide-update branch from 3af8f19 to 4a59592 Compare August 20, 2025 22:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (11)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (10)

24-24: Minor grammar: add “the”.

-Visit the ... release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status of latest releases.
+Visit the ... release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status of the latest releases.

26-26: Clarify “latest release” substitution.

Since the example may be copy-pasted, explicitly indicate the placeholder for the image tag.

-Run the following docker command to start the TensorRT-LLM container in interactive mode (change image tag to match latest release):
+Run the following docker command to start the TensorRT-LLM container in interactive mode (replace <latest_tag> with the tag from the NGC release page):

33-37: Optional: allow gated HF models by passing HF token into the container.

If the HF repo requires acceptance/auth, users will hit 401/403 without a token. Consider documenting this toggle.

   -p 8000:8000 \
   -e TRTLLM_ENABLE_PDL=1 \
+  # Optional: pass your HF token if the model repo is gated
+  -e HF_TOKEN=$HF_TOKEN \
   -v ~/.cache:/root/.cache:rw \
   nvcr.io/nvidia/tensorrt-llm/release:<latest_tag> \

Add a note below the block: “If required, set HF_TOKEN in your shell (e.g., export HF_TOKEN=...) or run huggingface-cli login inside the container.”


107-107: Typo: stray “s-” bullet marker.

-s- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.
+- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.

137-137: Wording nit: “10 times of” → “10 times”.

-`--num_requests` is set to 10 times of `--concurrency` to run enough number of requests.
+`--num_requests` is set to 10 times `--concurrency` to run enough requests.

213-214: Bash gotcha: comment after line-continuation.

Having “\ # Or ${local_model_path}” on the same line can confuse readers and, in some shells/editors, cause copy-paste issues. Move the comment to its own line.

-trtllm-serve \
-  openai/gpt-oss-120b \  # Or ${local_model_path}
+trtllm-serve \
+  # Or use ${local_model_path} instead of the HF repo
+  openai/gpt-oss-120b \

Repeat the same change in the max-throughput command block below.


231-232: Mirror the serve command comment fix here as well.

-trtllm-serve \
-  openai/gpt-oss-120b \  # Or ${local_model_path}
+trtllm-serve \
+  # Or use ${local_model_path} instead of the HF repo
+  openai/gpt-oss-120b \

3-3: Terminology: “open-source” → “open-weights”.

README’s Latest News uses “open-weights models”; align here for consistency and accuracy.

-NVIDIA has announced day-0 support for OpenAI's new open-source model series,
+NVIDIA has announced day-0 support for OpenAI's new open-weights model series,

170-174: Hopper note for max-throughput: remind to use TRITON backend on H200/H100.

Earlier you note TRITON is recommended on Hopper; mirror that guidance here to prevent users from copying CUTLASS on H200/H100.

 Compared to the low-latency configuration, we:
 - set `enable_attention_dp` to `true` to use attention DP which is better for high throughput.
 - set `stream_interval` to 10 to stream results to the client every 10 tokens. At high concurrency, the detokenization overhead of streaming mode cannot be hidden under GPU execution time, so `stream_interval` serves as a workaround to reduce this overhead.
-- set `moe_config.backend` to `CUTLASS` to use the `CUTLASS` MoE kernels which are optimized for high throughput.
+- set `moe_config.backend` to `CUTLASS` to use the `CUTLASS` MoE kernels which are optimized for high throughput.
+
+> Note (H200/H100): On Hopper GPUs, prefer the `TRITON` MoE backend as described below. If you are on H200/H100, replace `CUTLASS` with `TRITON` here.

If this guidance is not universally accurate for current releases, please update or scope it (e.g., “as of 1.1.0rc1”).


260-271: Alternative curl payload quoting to allow natural apostrophes.

Current example avoids apostrophes to prevent shell-quoting pitfalls. Consider a variant using double-quoted payload with escaped quotes, so users can include apostrophes naturally.

For example:

curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{
  \"model\": \"openai/gpt-oss-120b\",
  \"messages\": [{\"role\": \"user\", \"content\": \"What is NVIDIA's advantage for inference?\"}],
  \"max_tokens\": 1024, \"top_p\": 0.9
}" -w "\n"
README.md (1)

45-47: Cross-check announcement wording and versioning with the blog.

Latest News uses “open-weights models” and the blog previously said “open-source”; also the blog’s Docker snippet pins release:1.1.0rc0 while the badge is 1.1.0rc1. Please align terminology and version references across README and the blog.

If rc1 is the intended canonical version for this PR, update the blog snippet per my other comment.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 842d457 and 4a59592.

📒 Files selected for processing (2)
  • README.md (2 hunks)
  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (7 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

[grammar] ~24-~24: There might be a mistake here.
Context: ...es) to keep track of the support status of latest releases. Run the following doc...

(QB_NEW_EN)


[grammar] ~26-~26: There might be a mistake here.
Context: ...orRT-LLM container in interactive mode (change image tag to match latest release): ``...

(QB_NEW_EN)


[grammar] ~26-~26: There might be a mistake here.
Context: ...n interactive mode (change image tag to match latest release): ```bash docker run --...

(QB_NEW_EN)

README.md

[grammar] ~21-~21: There might be a mistake here.
Context: ...-120B Inference Server with TensorRT-LLM ✨ [➡️ link](./docs/source/blogs/tech_blo...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
README.md (1)

21-23: Date update looks good; aligns with the Latest News entry.

@farshadghodsian farshadghodsian force-pushed the feat/gpt-oss-guide-update branch from 4a59592 to 845696c Compare August 21, 2025 01:02
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)

107-107: Stray “s-” before bullet point.

Typo breaks formatting: remove the extra “s-”.

Apply this diff:

-- `enable_attention_dp` is set to `false` to use TP instead of DP for attention.
-s- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.
+- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.

24-37: Fix trailing comments on backslashes in shell code blocks

The scan uncovered several inline comments immediately following a backslash, which will break shell line continuations. Please remove these comments from the continuation lines (or move them to a separate preceding line).

Flagged instances in docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:

  • Line 213:
      openai/gpt-oss-120b \  # Or ${local_model_path}
  • Line 221:
      --max_batch_size ${max_batch_size} \  # E.g., 1
  • Line 231:
      openai/gpt-oss-120b \  # Or ${local_model_path}
  • Line 239:
      --max_batch_size ${max_batch_size} \  # E.g., 640 

Similar patterns were also detected in examples/models/core/bert/README.md, examples/models/core/llama/README.md, and docs/source/performance/perf-analysis.md. Please audit all backslash continuations across the repo and ensure no trailing spaces or comments follow the “\” characters.

🧹 Nitpick comments (8)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (8)

36-36: Container tag pin is fine; consider a placeholder to avoid drift (optional).

Keeping release:1.1.0rc0 is correct if that’s the latest published image on NGC. To reduce future churn, you could swap in a placeholder like <latest_tag> and direct users to NGC above.

Apply this optional diff:

-  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
+  nvcr.io/nvidia/tensorrt-llm/release:<latest_tag> \

Given our prior learning on published tags, keep the pinned tag if <latest_tag> might confuse users. Your call.


265-265: Keep possessive apostrophe without breaking curl by using a here‑doc.

“NVIDIAs” is ungrammatical. Prefer “NVIDIA’s” and avoid shell quoting pitfalls by feeding JSON via a here-doc.

Apply this diff to replace the curl example:

-curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
-    "model": "openai/gpt-oss-120b",
-    "messages": [
-        {
-            "role": "user",
-            "content": "What is NVIDIAs advantage for inference?"
-        }
-    ],
-    "max_tokens": 1024,
-    "top_p": 0.9
-}' -w "\n"
+curl localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  --data @- -w "\n" <<'JSON'
+{
+  "model": "openai/gpt-oss-120b",
+  "messages": [
+    { "role": "user", "content": "What is NVIDIA's advantage for inference?" }
+  ],
+  "max_tokens": 1024,
+  "top_p": 0.9
+}
+JSON

351-351: Qualify backend support statement with version context.

Add “as of release 1.1.0rc0” (or similar) so the note ages gracefully if support lands later.

Apply this diff:

-OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for optimal performance. `TRTLLM` MoE backend is not supported on Hopper, and `CUTLASS` backend support is still ongoing.
+OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for optimal performance. As of the 1.1.0rc0 release, the `TRTLLM` MoE backend is not supported on Hopper, and `CUTLASS` backend support is still ongoing.

1-1: Optional: align with markdownlint list style (asterisks).

One list in this file appears with dashes; most elsewhere use asterisks. Consider standardizing list markers to avoid MD004 warnings.


273-341: Sample response block includes internal reasoning tokens.

The example output shows meta markers like “<|channel|>analysis” which may confuse readers; typical OpenAI-compatible responses don’t contain these. Consider trimming to a concise, realistic assistant message.


22-37: Optional: show env-substitution for image tag.

To help users update tags easily while still referencing published images, consider an env var:

-docker run --rm --ipc=host -it \
+TRTLLM_TAG=1.1.0rc0  # Replace with the latest published tag from NGC
+docker run --rm --ipc=host -it \
@@
-  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
+  nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_TAG} \

118-135: Terminology: clarify “max_batch_size” vs “concurrency” sentence.

Minor phrasing to improve clarity (“could serve” → “can serve”; “is set to 10 times of” → “is set to 10×”).

-`--max_batch_size` controls the maximum batch size that the inference engine could serve, while `--concurrency` is the number of concurrent requests that the benchmarking client is sending. `--num_requests` is set to 10 times of `--concurrency` to run enough number of requests.
+`--max_batch_size` controls the maximum batch size that the inference engine can serve, while `--concurrency` is the number of concurrent requests that the benchmarking client sends. `--num_requests` is set to 10× `--concurrency` to run a sufficient number of requests.

170-174: Parallelism note: small style tweaks.

Add articles and code formatting consistency.

-- set `enable_attention_dp` to `true` to use attention DP which is better for high throughput.
-- set `stream_interval` to 10 to stream results to the client every 10 tokens. At high concurrency, the detokenization overhead of streaming mode cannot be hidden under GPU execution time, so `stream_interval` serves as a workaround to reduce this overhead.
-- set `moe_config.backend` to `CUTLASS` to use the `CUTLASS` MoE kernels which are optimized for high throughput.
+- Set `enable_attention_dp` to `true` to use attention DP, which is better for high throughput.
+- Set `stream_interval` to `10` to stream results to the client every 10 tokens. At high concurrency, the detokenization overhead of streaming mode cannot be hidden under GPU execution time, so `stream_interval` reduces this overhead.
+- Set `moe_config.backend` to `CUTLASS` to use the CUTLASS MoE kernels, which are optimized for high throughput.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4a59592 and 845696c.

📒 Files selected for processing (2)
  • README.md (2 hunks)
  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (7 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.427Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
📚 Learning: 2025-08-21T00:16:56.427Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.427Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

[grammar] ~24-~24: There might be a mistake here.
Context: ...es) to keep track of the support status of latest releases. Run the following doc...

(QB_NEW_EN)


[grammar] ~26-~26: There might be a mistake here.
Context: ...orRT-LLM container in interactive mode (change image tag to match latest release): ``...

(QB_NEW_EN)


[grammar] ~26-~26: There might be a mistake here.
Context: ...n interactive mode (change image tag to match latest release): ```bash docker run --...

(QB_NEW_EN)

README.md

[grammar] ~21-~21: There might be a mistake here.
Context: ...-120B Inference Server with TensorRT-LLM ✨ [➡️ link](./docs/source/blogs/tech_blo...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

231-231: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (1)

141-147: Verification ask: public performance numbers.

The stated 420 tps/user and 19.5k–20k tps/gpu are strong claims. Please ensure they reflect the latest published benchmarks for the specified configs or add “measured internally” with date/context.

Would you like me to scan the repo for benchmark references and align phrasing?

farshadghodsian and others added 2 commits August 20, 2025 21:42
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Farshad Ghodsian <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Farshad Ghodsian <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (3)

107-107: Typo in bullet (“s-”).

This renders oddly in markdown and should be a regular list dash.

-s- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.
+- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.

211-226: Shell line-continuation bugs and duplicate command break copy/paste.

  • “Note:” text is inside the code block mid-command; this breaks the backslash continuation.
  • There’s a duplicate trtllm-serve \ line.
  • Inline comment after a trailing backslash (\ # E.g., 1) is invalid; the backslash must be the last character.

Fix by removing the inline note/duplicate and by moving the example comment outside the multiline command.

 ```bash
 trtllm-serve \
-Note: You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`).
-
-trtllm-serve \
   openai/gpt-oss-120b \
   --host 0.0.0.0 \
   --port 8000 \
   --backend pytorch \
   --tp_size ${num_gpus} \
   --ep_size 1  \
   --extra_llm_api_options low_latency.yaml \
   --kv_cache_free_gpu_memory_fraction 0.9 \
-  --max_batch_size ${max_batch_size} \  # E.g., 1
+  --max_batch_size ${max_batch_size} \
   --trust_remote_code

Add this explanatory note as plain text above the block (outside the command) to retain the guidance without breaking the shell:

```markdown
Note: You can point to a local path containing the model weights instead of the HF repo (for example, `${local_model_path}`).

Optionally clarify the example value right below the code block:

Example: set `max_batch_size=1` for the low-latency case.

233-245: Repeat of multiline shell issues in max-throughput block.

Duplicate trtllm-serve \ and inline comment after a trailing backslash will break the command. Apply the same fix pattern.

 ```bash
 trtllm-serve \
-trtllm-serve \
   openai/gpt-oss-120b \
   --host 0.0.0.0 \
   --port 8000 \
   --backend pytorch \
   --tp_size ${num_gpus} \
   --ep_size ${num_gpus} \
   --extra_llm_api_options max_throughput.yaml \
   --kv_cache_free_gpu_memory_fraction 0.9 \
-  --max_batch_size ${max_batch_size} \  # E.g., 640 
+  --max_batch_size ${max_batch_size} \
   --trust_remote_code

Suggest adding a clarifying sentence below the block instead:

```markdown
Example: set `max_batch_size=640` for the max-throughput case.
♻️ Duplicate comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)

24-26: Copyedits: articles and “Docker” capitalization.

These small nits improve clarity and follow standard terminology. This was flagged earlier and still applies.

-Visit the [NGC TensorRT-LLM Release page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) to find the most up-to-date NGC container image to use. You can also check the latest [release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status of latest releases.
-Run the following docker command to start the TensorRT-LLM container in interactive mode (change image tag to match latest release):
+Visit the [NGC TensorRT-LLM Release page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) to find the most up-to-date NGC container image to use. You can also check the latest [release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status of the latest releases.
+Run the following Docker command to start the TensorRT-LLM container in interactive mode (change the image tag to match the latest release):

56-59: Tighten section title and wording (“pip wheels”).

Use “pip wheels” (or “Python wheels”), not both; also prefer “installation instructions.”

-### TensorRT-LLM PIP Wheel Install
+### TensorRT-LLM pip wheel install

-Regular releases of TensorRT-LLM are also provided as [pip Python wheels](https://pypi.org/project/tensorrt-llm/#history). You can find instructions on pip install [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
+Regular releases of TensorRT-LLM are also provided as [pip wheels](https://pypi.org/project/tensorrt-llm/#history). You can find installation instructions [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
🧹 Nitpick comments (2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)

22-22: Fix heading spacing (markdownlint MD019).

Remove the extra space after the hash marks in the heading.

-###  NGC Docker Image
+### NGC Docker Image

264-275: Use robust quoting for sample curl payload; reintroduce proper apostrophe.

The apostrophe in “NVIDIA’s” was removed to avoid breaking the single-quoted shell string. Prefer a here-document with -d @- so the JSON can contain quotes/apostrophes without shell-escaping. This is copy/paste safe.

-curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
-    "model": "openai/gpt-oss-120b",
-    "messages": [
-        {
-            "role": "user",
-            "content": "What is NVIDIAs advantage for inference?"
-        }
-    ],
-    "max_tokens": 1024,
-    "top_p": 0.9
-}' -w "\n"
+curl localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  --data @- <<'JSON'
+{
+  "model": "openai/gpt-oss-120b",
+  "messages": [
+    { "role": "user", "content": "What is NVIDIA's advantage for inference?" }
+  ],
+  "max_tokens": 1024,
+  "top_p": 0.9
+}
+JSON
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 845696c and b1102d5.

📒 Files selected for processing (1)
  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (7 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.427Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
📚 Learning: 2025-08-21T00:16:56.427Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.427Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

[grammar] ~24-~24: There might be a mistake here.
Context: ...es) to keep track of the support status of latest releases. Run the following doc...

(QB_NEW_EN)


[grammar] ~26-~26: There might be a mistake here.
Context: ...orRT-LLM container in interactive mode (change image tag to match latest release): ``...

(QB_NEW_EN)


[grammar] ~26-~26: There might be a mistake here.
Context: ...n interactive mode (change image tag to match latest release): ```bash docker run --...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

22-22: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (1)

36-36: Please verify the NGC container release tag is up-to-date

It looks like docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md is pinned to release:1.1.0rc0, but the README badge shows 1.1.0rc1. Since the badge may list a version ahead of the actual NGC publish, please:

  • Confirm on NGC that nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 is indeed the latest published container image.
  • If 1.1.0rc1 has been published on NGC, update the image tag in blog9_Deploying_GPT_OSS_on_TRTLLM.md (line 36) to release:1.1.0rc1.
  • Ensure consistency across all docs—e.g., the quick-start guides currently reference 1.0.0rc6.

Diff suggestion (if updating to rc1):

-   nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
+   nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc1 \

@farshadghodsian farshadghodsian force-pushed the feat/gpt-oss-guide-update branch from efeebd2 to c0188ab Compare August 21, 2025 01:54
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (3)

107-107: Typo in bullet list.

Extraneous “s-” prefix breaks the list formatting.

-s- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.
+- `cuda_graph_config.max_batch_size` is the maximum batch size for CUDA graph.

211-226: Broken bash block: stray prose inside command, duplicate command, and invalid line continuation.

  • Plain “Note:” line inside a bash block breaks execution.
  • Duplicate trtllm-serve \ lines.
  • \ # E.g., 1 has characters after backslash; line continuation fails.

Fix by keeping the note as a bash comment, removing the duplicate command, and moving the example comment to its own line.

 ```bash
-trtllm-serve \
-Note: You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`).
-
-trtllm-serve \
+# Note: You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`).
+trtllm-serve \
   openai/gpt-oss-120b \
   --host 0.0.0.0 \
   --port 8000 \
   --backend pytorch \
   --tp_size ${num_gpus} \
   --ep_size 1  \
   --extra_llm_api_options low_latency.yaml \
   --kv_cache_free_gpu_memory_fraction 0.9 \
-  --max_batch_size ${max_batch_size} \  # E.g., 1
+  # Example: set max_batch_size=1 for low-latency
+  --max_batch_size ${max_batch_size} \
   --trust_remote_code

---

`232-245`: **Repeat of the bash issues in max‑throughput example.**

- Duplicate `trtllm-serve \`.
- Line continuation after `--max_batch_size` has trailing comment; breaks the command.


```diff
 ```bash
-trtllm-serve \
-trtllm-serve \
+trtllm-serve \
   openai/gpt-oss-120b \
   --host 0.0.0.0 \
   --port 8000 \
   --backend pytorch \
   --tp_size ${num_gpus} \
   --ep_size ${num_gpus} \
   --extra_llm_api_options max_throughput.yaml \
   --kv_cache_free_gpu_memory_fraction 0.9 \
-  --max_batch_size ${max_batch_size} \  # E.g., 640 
+  # Example: set max_batch_size=640 for max-throughput
+  --max_batch_size ${max_batch_size} \
   --trust_remote_code

</blockquote></details>

</blockquote></details>
🧹 Nitpick comments (6)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (6)

22-22: Fix markdownlint MD019: single space after heading hash.

There are two spaces after the atx heading markers.

-###  NGC Docker Image
+### NGC Docker Image

26-26: Grammar: add missing article.

“Tighten copy: ‘match latest release’ → ‘match the latest release’.”

-Run the following Docker command to start the TensorRT-LLM container in interactive mode (change the image tag to match latest release):
+Run the following Docker command to start the TensorRT-LLM container in interactive mode (change the image tag to match the latest release):

56-59: Tighten terminology and heading casing for pip wheels.

Avoid “Python wheels” vs “pip install” mismatch; standardize on “pip wheels” and use “installation instructions.”

-### TensorRT-LLM Python Wheel Install
-
-Regular releases of TensorRT-LLM are also provided as [Python wheels](https://pypi.org/project/tensorrt-llm/#history). You can find instructions on the pip install [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
+### TensorRT-LLM pip wheel install
+
+Regular releases of TensorRT-LLM are also provided as [pip wheels](https://pypi.org/project/tensorrt-llm/#history). You can find installation instructions [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).

141-141: Qualify benchmark claims with test context and date.

Performance numbers age quickly. Add container tag, GPU model, driver/CUDA/TensorRT versions, and the date measured to avoid future confusion.

For example:

  • “Measured on nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0, 8×B200 (HBM3e), CUDA 12.6, TensorRT‑LLM 1.1.0rc0, August 2025.”
  • Link the emitted low_latency_benchmark.json / max_throughput_benchmark.json as artifacts or include the exact command used.

Also applies to: 203-204


353-365: Hopper MoE backend statement likely outdated and conflicts with earlier guidance.

Above you recommend CUTLASS for max‑throughput and TRTLLM for low‑latency; here it says “TRTLLM MoE backend is not supported on Hopper” and “CUTLASS support is still ongoing.” These conflict and may no longer reflect current Hopper support.

Consider replacing with release‑agnostic guidance and point to the official support matrix, e.g.:

-OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for optimal performance. `TRTLLM` MoE backend is not supported on Hopper, and `CUTLASS` backend support is still ongoing. Please follow the instructions in this [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe) to install and enable the `TRITON` MoE kernels on Hopper GPUs.
+OpenAI ships Triton kernels optimized for its MoE models. On Hopper (H100/H200), we currently recommend using the `TRITON` MoE backend when targeting GPT‑OSS MXFP4 for best out‑of‑the‑box performance. TensorRT‑LLM and CUTLASS backends continue to evolve; consult the latest TensorRT‑LLM release notes for Hopper MoE support and recommended backends. Follow the instructions in this [guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe) to install and enable the `TRITON` MoE kernels on Hopper GPUs.

If helpful, I can fetch and reconcile the exact support status from the latest release notes.


264-275: Optional: preserve correct apostrophe in sample prompt without breaking quoting.

Current workaround removes the apostrophe (NVIDIAs). Consider a here‑doc to keep natural text and valid JSON.

-curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
-    "model": "openai/gpt-oss-120b",
-    "messages": [
-        {
-            "role": "user",
-            "content": "What is NVIDIAs advantage for inference?"
-        }
-    ],
-    "max_tokens": 1024,
-    "top_p": 0.9
-}' -w "\n"
+curl localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d @- -w "\n" <<'JSON'
+{
+  "model": "openai/gpt-oss-120b",
+  "messages": [
+    {
+      "role": "user",
+      "content": "What is NVIDIA's advantage for inference?"
+    }
+  ],
+  "max_tokens": 1024,
+  "top_p": 0.9
+}
+JSON
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between efeebd2 and c0188ab.

📒 Files selected for processing (1)
  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (7 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.427Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
📚 Learning: 2025-08-21T00:16:56.427Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.427Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md
🪛 LanguageTool
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

[grammar] ~26-~26: There might be a mistake here.
Context: ...teractive mode (change the image tag to match latest release): ```bash docker run --...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md

22-22: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (1)

36-36: Audit and align NGC container tags across documentation

Please confirm that every hard-coded nvcr.io/nvidia/tensorrt-llm/release:<tag> reference corresponds to a published NGC image, and update any unreleased or placeholder tags to their correct, published values. In particular, the following occurrences were found:

  • docker/release.md:21 — nvcr.io/nvidia/tensorrt-llm/release:x.y.z (placeholder)
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md:50 — …/release:1.0.0rc6
  • docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md:41 — …/release:1.0.0rc6
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md:42 — …/release:1.0.0rc6
  • docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36 — …/release:1.1.0rc0

Action items:

  • Verify each <tag> above is available on NGC; if not, replace with the correct, available version.
  • Replace the x.y.z placeholder in docker/release.md with the actual released tag.
  • Ensure consistency in README badges and all other docs to avoid referencing unreleased tags.

@juney-nvidia
Copy link
Collaborator

/bot skip --comment "No need to run full CI"

@juney-nvidia juney-nvidia enabled auto-merge (squash) August 21, 2025 06:06
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16005 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16005 [ skip ] completed with state SUCCESS
Skipping testing for commit c0188ab

@juney-nvidia juney-nvidia merged commit 2d40e87 into NVIDIA:main Aug 21, 2025
5 checks passed
zhou-yuxin pushed a commit to zhou-yuxin/TensorRT-LLM that referenced this pull request Aug 21, 2025
…VIDIA#7101)

Signed-off-by: Farshad Ghodsian <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Yuxin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community want to contribute PRs initiated from Community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants