-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[TRTLLM-7245][feat] add test_multi_nodes_eval tests #7108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds a parser for MMLU accuracy, a parameterized multi-node MMLU evaluation test that launches trtllm-llmapi-launch + trtllm-eval with kv-cache memory-fraction and batch settings, updates multinode QA test list to use the new test, and appends a SKIP waiver for one FP8 multi-node test. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant PyTest as PyTest (ranks 0..N-1)
participant Launcher as trtllm-llmapi-launch
participant Eval as trtllm-eval
participant Parser as get_mmlu_accuracy
rect rgba(200,230,255,0.20)
note left of PyTest: Parametrize model_path, tp/pp/ep, eval_task="mmlu"
PyTest->>Launcher: launch with tp/pp/ep, kv_cache_free_gpu_memory_fraction=0.8, max_batch_size=32
Launcher->>Eval: invoke evaluation across nodes
Eval-->>Launcher: stdout containing "MMLU weighted average accuracy: <value> (%)"
Launcher-->>PyTest: relay captured stdout
end
alt SLURM rank 0
PyTest->>Parser: get_mmlu_accuracy(stdout)
Parser-->>PyTest: parsed accuracy (float)
PyTest->>PyTest: assert accuracy > 81.5
else other ranks
PyTest->>PyTest: skip accuracy parsing/assertion
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
b5f5869
to
84c1e38
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tests/integration/defs/accuracy/accuracy_core.py (1)
345-363
: Rename duplicated MMLU class to MMLUMultimodal and update dispatchTo avoid shadowing the original
MMLU
class and ensure both variants are correctly handled:• In
tests/integration/defs/accuracy/accuracy_core.py
(lines 345–363):
– Rename the second class definition- class MMLU(AccuracyTask): + class MMLUMultimodal(AccuracyTask): DATASET = "MMLU" DATASET_DIR = f"{llm_models_root()}/datasets/MMLU" ALPHA = 0.05 BETA = 0.2 SIGMA = 50 NUM_SAMPLES = 900 MAX_BATCH_SIZE = 128 MAX_INPUT_LEN = 8192 MAX_OUTPUT_LEN = 512 - EVALUATOR_CLS = tensorrt_llm.evaluate.MMLU + EVALUATOR_CLS = tensorrt_llm.evaluate.MMLU EVALUATOR_KWARGS = dict(dataset_path=DATASET_DIR, random_seed=0, is_multimodal=True, apply_chat_template=True)– Also align the dataset name/directory casing to lowercase if that matches on-disk structure:
- DATASET = "MMLU" - DATASET_DIR = f"{llm_models_root()}/datasets/MMLU" + DATASET = "mmlu" + DATASET_DIR = f"{llm_models_root()}/datasets/mmlu"• In the same file (around lines 709–714), update the task dispatcher so both classes route to
mmlu()
:def evaluate(self): for task in self.tasks: - elif isinstance(task, MMLU): - self.mmlu(task) + elif isinstance(task, (MMLU, MMLUMultimodal)): + self.mmlu(task)These changes will:
- Prevent the second definition from overwriting the first.
- Maintain backward compatibility of the existing
MMLU
task.- Clearly distinguish between the standard and multimodal variants.
tensorrt_llm/commands/eval.py (1)
148-156
: Duplicate MMLU command registration (remove the second one).MMLU is added twice to the CLI, which will either be a no-op or cause confusing help output.
Apply this diff to remove the duplicate registration:
main.add_command(CnnDailymail.command) main.add_command(MMLU.command) main.add_command(GSM8K.command) main.add_command(GPQADiamond.command) main.add_command(GPQAMain.command) main.add_command(GPQAExtended.command) main.add_command(JsonModeEval.command) -main.add_command(MMLU.command)
tensorrt_llm/evaluate/lm_eval.py (1)
661-667
: MMLU is not a multimodal task—remove multimodal/chat overrides and the forced stop token.The comment and flags appear to be carried over from MMMU. Setting is_multimodal=True forces the multimodal wrapper and can break MMLU evaluation. Forcing chat templates and a hardcoded stop token is also unnecessary and may skew results.
Apply this diff to correct the behavior:
@staticmethod def command(ctx, **kwargs) -> None: - # NOTE: MMLU is a multimodal task, so we need to set the is_multimodal and apply_chat_template flags to True - kwargs["is_multimodal"] = True - kwargs["apply_chat_template"] = True - kwargs[ - "stop"] = "<|endoftext|>" # NOTE: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/MMLU/_template_yaml#L10 - MMLU.command_harness(ctx, **kwargs) + # MMLU is a text-only task; use the default (non-multimodal) wrapper and defaults. + MMLU.command_harness(ctx, **kwargs)
🧹 Nitpick comments (6)
docs/source/performance/perf-benchmarking.md (1)
556-556
: Dataset rename looks correct; consider minor CLI consistency nit.The switch to lmms-lab/MMLU aligns with the PR intent. As a minor consistency nit, elsewhere we sometimes quote keys (e.g., --dataset-prompt-key "question"); consider consistently quoting across docs to avoid shell interpretation edge cases.
tests/integration/defs/test_e2e.py (2)
2577-2584
: Consider adding NVLink guard to match other multi-node tests.Other multi-node tests include @skip_nvlink_inactive to avoid failures on nodes without NVLink. This test likely benefits from the same guard.
Suggested change:
+@skip_nvlink_inactive @pytest.mark.skip_less_device_memory(80000) @pytest.mark.skip_less_device(4) @pytest.mark.parametrize("eval_task", ["mmlu"]) @pytest.mark.parametrize("tp_size,pp_size,ep_size", [(16, 1, 8)], ids=["tp16"])
2589-2590
: Sanity-check threshold stability.mmlu_threshold = 80 may be tight depending on minor version/model variance or eval harness updates. If this runs in diverse CI farms, consider a slightly lower bound or making it configurable via env to reduce flakiness.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
2703-2716
: Avoid hardcoding the stop token; derive it from the tokenizer instead.Using a literal "<|endoftext|>" risks mismatch across tokenizers (e.g., Qwen variants may use a different EOS string). Construct SamplingParams after LLM is instantiated so you can pull llm.tokenizer.eos_token.
Apply this diff to move sampling_params creation into the method and fetch EOS dynamically:
@@ - # NOTE: MMLU adds <|endoftext|> to the stop token. - sampling_params = SamplingParams(max_tokens=MMLU.MAX_OUTPUT_LEN, - truncate_prompt_tokens=MMLU.MAX_INPUT_LEN, - stop="<|endoftext|>") + # NOTE: Build SamplingParams at runtime to use the model's EOS token. @@ def test_auto_dtype(self): - with LLM(self.MODEL_PATH, - max_num_tokens=16384, - kv_cache_config=self.kv_cache_config) as llm: - task = MMLU(self.MODEL_NAME) - task.evaluate(llm, sampling_params=self.sampling_params) + with LLM(self.MODEL_PATH, + max_num_tokens=16384, + kv_cache_config=self.kv_cache_config) as llm: + sampling_params = SamplingParams( + max_tokens=MMLU.MAX_OUTPUT_LEN, + truncate_prompt_tokens=MMLU.MAX_INPUT_LEN, + stop=getattr(llm.tokenizer, "eos_token", "<|endoftext|>") + ) + task = MMLU(self.MODEL_NAME) + task.evaluate(llm, sampling_params=sampling_params)tensorrt_llm/evaluate/lm_eval.py (2)
620-626
: Use a lowercase CLI command name for consistency (mmlu).All other commands use lowercase (e.g., gsm8k, gpqa_diamond). Uppercase “MMLU” is inconsistent and surprising.
Apply this diff:
-class MMLU(LmEvalEvaluator): +class MMLU(LmEvalEvaluator): @@ - @click.command("MMLU") + @click.command("mmlu")Note: Registration remains main.add_command(MMLU.command) and will pick up the new command name automatically.
653-657
: Nit: Update comment/link casing to match lm-eval’s layout.The reference path uses “MMLU” but the upstream repo typically uses lowercase task directories (e.g., tasks/mmlu). This is a doc comment only, but correcting casing helps future readers.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
docs/source/performance/perf-benchmarking.md
(1 hunks)examples/llm-api/out_of_tree_example/readme.md
(1 hunks)tensorrt_llm/commands/eval.py
(2 hunks)tensorrt_llm/evaluate/__init__.py
(1 hunks)tensorrt_llm/evaluate/lm_eval.py
(2 hunks)tests/integration/defs/accuracy/accuracy_core.py
(2 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(3 hunks)tests/integration/defs/test_e2e.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/test_e2e.py
tensorrt_llm/evaluate/__init__.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/defs/accuracy/accuracy_core.py
tensorrt_llm/evaluate/lm_eval.py
tensorrt_llm/commands/eval.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/test_e2e.py
tensorrt_llm/evaluate/__init__.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/defs/accuracy/accuracy_core.py
tensorrt_llm/evaluate/lm_eval.py
tensorrt_llm/commands/eval.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/test_e2e.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/commands/eval.py
🪛 Ruff (0.12.2)
tensorrt_llm/evaluate/__init__.py
19-19: Redefinition of unused MMLU
from line 18
(F811)
tests/integration/defs/accuracy/accuracy_core.py
345-345: Redefinition of unused MMLU
from line 276
(F811)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
examples/llm-api/out_of_tree_example/readme.md (1)
45-45
: Good update; mirrors the docs change.The dataset-name is updated to lmms-lab/MMLU and the rest of the invocation remains the same. Looks good.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
34-35
: MMLU import swap looks good.Switching to MMLU and importing JsonModeEval/LlmapiAccuracyTestHarness is consistent with the broader rename and test harness usage.
tensorrt_llm/commands/eval.py (1)
23-24
: Import updates align with the rename.Replacing MMMU with MMLU in the imports is consistent with the rest of the changes.
84c1e38
to
b1f8b1a
Compare
b1f8b1a
to
fdd2003
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/integration/defs/common.py (1)
961-979
: Harden MMLU accuracy parsing; accept bytes and fail fast with a clear errorThe chained splits are brittle and will break if the report format changes slightly (or if the output is bytes). Use a single regex, handle bytes, and raise a precise error.
Apply this diff:
def get_mmlu_accuracy(output): - mmlu_line = None - for line in output.split('\n'): - if "MMLU weighted average accuracy:" in line: - mmlu_line = line - break - - if mmlu_line is None: - raise Exception( - f"Could not find 'MMLU weighted average accuracy:' in output. Full output:\n{output}" - ) - - mmlu_accuracy = float( - mmlu_line.split("MMLU weighted average accuracy: ")[1].split(" (")[0]) - - print(f"MMLU weighted average accuracy is: {mmlu_accuracy}") - - return mmlu_accuracy + """ + Extract the MMLU weighted average accuracy from evaluator output. + + The function is robust to minor formatting changes and handles byte outputs. + + Args: + output: The captured stdout from `trtllm-eval` (str or bytes). + + Returns: + float: The parsed MMLU weighted average accuracy. + + Raises: + ValueError: If the expected accuracy line cannot be found/parsed. + """ + if isinstance(output, (bytes, bytearray)): + output = output.decode(errors="replace") + match = re.search( + r"MMLU weighted average accuracy:\s*([0-9]+(?:\.[0-9]+)?)", output + ) + if not match: + raise ValueError( + "Failed to parse 'MMLU weighted average accuracy' from output.\n" + f"Full output:\n{output}" + ) + mmlu_accuracy = float(match.group(1)) + print(f"MMLU weighted average accuracy is: {mmlu_accuracy}") + return mmlu_accuracy
🧹 Nitpick comments (3)
tests/integration/defs/common.py (1)
1-2
: Update copyright year to include 2025The coding guidelines ask to prepend NVIDIA copyright header with the current year. Please extend the range to 2025.
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.tests/integration/defs/test_e2e.py (2)
2579-2619
: Solid multi-node eval flow; add small robustness/debug nitsThe test is well-structured and uses the shared parser. A couple of small improvements would help triage flakes and make output clearer:
- Print the composed command and topology to the log before launching.
- Rely entirely on the captured output for parsing (your helper already handles mixed-rank logs), so the
SLURM_PROCID
gate is unnecessary in most CI contexts. If you keep it, consider documenting why.Minimal debug-friendly tweak:
def test_multi_nodes_eval(llm_venv, model_path, tp_size, pp_size, ep_size, eval_task): if "Llama-4" in model_path and tp_size == 16: pytest.skip("Llama-4 with tp16 is not supported") mmlu_threshold = 81.5 run_cmd = [ "trtllm-llmapi-launch", "trtllm-eval", f"--model={llm_models_root()}/{model_path}", f"--ep_size={ep_size}", f"--tp_size={tp_size}", f"--pp_size={pp_size}", f"--kv_cache_free_gpu_memory_fraction={_MEM_FRACTION_80}", "--max_batch_size=32", eval_task, ] + print(f"Launching eval for {model_path} with tp={tp_size}, pp={pp_size}, ep={ep_size}") + print("Command:", " ".join(run_cmd)) output = check_output(" ".join(run_cmd), shell=True, env=llm_venv._new_env) - if os.environ.get("SLURM_PROCID", '0') == '0': - mmlu_accuracy = get_mmlu_accuracy(output) - assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}" + # Parse and assert on captured output (rank-agnostic). + mmlu_accuracy = get_mmlu_accuracy(output) + assert mmlu_accuracy > mmlu_threshold, ( + f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}" + )Note: If different models have materially different expected MMLU, consider per-model thresholds to reduce flakes. Happy to help wire that if desired.
1-2
: Update copyright year to include 2025Align with the coding guideline that headers reflect the current year.
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tests/integration/defs/common.py
(1 hunks)tests/integration/defs/test_e2e.py
(3 hunks)tests/integration/test_lists/qa/llm_function_multinode.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/test_lists/qa/llm_function_multinode.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/common.py
tests/integration/defs/test_e2e.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/common.py
tests/integration/defs/test_e2e.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/test_e2e.py
🪛 Ruff (0.12.2)
tests/integration/defs/test_e2e.py
31-31: Redefinition of unused convert_weights
from line 27
(F811)
🔇 Additional comments (1)
tests/integration/defs/test_e2e.py (1)
46-46
: LGTM: memory fraction constantAdding
_MEM_FRACTION_80
is consistent with existing constants and improves clarity where 0.8 is used.
fdd2003
to
36698ac
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tests/integration/defs/test_e2e.py (2)
2616-2618
: Rank-0-only assertion is fine; make sure get_mmlu_accuracy is robust (already addressed).Only asserting on
SLURM_PROCID==0
avoids duplicate failures—good. Ensureget_mmlu_accuracy
handles bytes vs str and format drift via regex (looks addressed by moving parsing intocommon.get_mmlu_accuracy
per prior fix).
31-33
: Remove duplicate import of convert_weights; satisfy Ruff F811 and unify import style.
convert_weights
is imported on Line 27 (absolute) and again here (relative), triggering Ruff F811 and confusing readers. Keep the relative block and drop the absolute import. While here, maketrt_test_alternative
a relative import for consistency.Apply this diff:
-from defs.common import convert_weights -from defs.trt_test_alternative import (check_call, check_call_negative_test, - check_output) +from .trt_test_alternative import (check_call, check_call_negative_test, + check_output)
🧹 Nitpick comments (2)
tests/integration/defs/test_e2e.py (2)
2599-2601
: Prefer parametrized skip over runtime skip for unsupported combos (optional).Instead of an in-test
pytest.skip
, you can pre-mark the unsupported combination (Llama-4 withtp16
) as skipped at collection time usingpytest.param(..., marks=pytest.mark.skip(...))
. This reduces noise in reports and avoids scheduling work that will be skipped.Example (illustrative; adapt to your pattern):
@pytest.mark.parametrize( "model_path,tp_size,pp_size,ep_size", [ pytest.param("llama4-models/Llama-4-Maverick-17B-128E-Instruct", 8, 2, 8), pytest.param("llama4-models/Llama-4-Maverick-17B-128E-Instruct-FP8", 8, 2, 8), # Skip unsupported tp16 combos at collection time: pytest.param("llama4-models/Llama-4-Maverick-17B-128E-Instruct", 16, 1, 8, marks=pytest.mark.skip(reason="Llama-4 with tp16 is not supported")), ], )
1-14
: Optional: update copyright year range.Repo guideline says to prepend NVIDIA copyright header with current year. If the project wants ranges updated annually, consider bumping to 2022–2025 in the next touch.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tests/integration/defs/common.py
(1 hunks)tests/integration/defs/test_e2e.py
(3 hunks)tests/integration/test_lists/qa/llm_function_multinode.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/defs/common.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/test_e2e.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/test_e2e.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_multinode.txt
tests/integration/defs/test_e2e.py
🪛 Ruff (0.12.2)
tests/integration/defs/test_e2e.py
31-31: Redefinition of unused convert_weights
from line 27
(F811)
🔇 Additional comments (5)
tests/integration/test_lists/qa/llm_function_multinode.txt (1)
2-7
: Unable to verify test IDs in sandbox—please confirm locallyThe sandbox environment lacks
pytest
, so I couldn’t collect the actualtest_multi_nodes_eval
node IDs. To ensure the duplicated “Instruct” token is truly invalid, please run this locally and verify the list of collected tests:pytest --collect-only -q tests/integration/defs/test_e2e.py \ | sed -n 's/^\(.*test_multi_nodes_eval.*\)$/\1/p'Once you’ve confirmed that the FP8 variant is registered without the extra “Instruct”, apply the following change:
- test_e2e.py::test_multi_nodes_eval[llama4-models/Llama-4-Maverick-17B-128E-Instruct-FP8-Instruct-tp8pp2-mmlu] + test_e2e.py::test_multi_nodes_eval[llama4-models/Llama-4-Maverick-17B-128E-Instruct-FP8-tp8pp2-mmlu]tests/integration/defs/test_e2e.py (4)
46-46
: Good: add _MEM_FRACTION_80 for eval runs.Using a dedicated 0.8 fraction clarifies intent and avoids magic numbers later in the eval command. No action needed.
2579-2585
: Param ids produce stable test node ids; ensure test list uses the same order.Stacked parametrization and the function signature
(llm_venv, model_path, tp_size, pp_size, ep_size, eval_task)
will yield ids like[model_path]-[tp id]-[eval_task]
, e.g.,DeepSeek-R1/...-tp16-mmlu
. Your test list follows this pattern—good. Just be aware that changing the argument order in the signature will change ids.You can re-run the collection command from the previous comment to sanity-check the final ids emitted by pytest.
2586-2596
: Model set looks right; aligns with the new test list.The model paths enumerate Llama 3.3 70B, Llama-4 Maverick (BF16/FP8), Qwen3 235B (BF16/NVFP4), and DeepSeek-R1 FP4, with appropriate Hopper/Blackwell gating. Matches the entries in the multi-node list once the FP8 id typo is fixed.
2603-2613
: Command construction LGTM; explicit kv cache fraction and batch size improve reproducibility.The
trtllm-llmapi-launch trtllm-eval ... mmlu
invocation is clear, parameters are explicit, and_MEM_FRACTION_80
is applied consistently.
PR_Github #16042 [ run ] triggered by Bot |
PR_Github #16042 [ run ] completed with state |
36698ac
to
ab91f8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/integration/defs/test_e2e.py (1)
31-33
: Fix duplicate import of convert_weights (Ruff F811) and unify import style.You re-import convert_weights via a relative import while it’s already imported absolutely at Line 27, triggering F811 and confusing readers. Remove the absolute import and switch trt_test_alternative to a relative import for consistency with the rest of this block.
Apply this diff:
-from defs.common import convert_weights -from defs.trt_test_alternative import (check_call, check_call_negative_test, - check_output) +from .trt_test_alternative import (check_call, check_call_negative_test, + check_output)
🧹 Nitpick comments (3)
tests/integration/defs/test_e2e.py (3)
2719-2721
: Make rank detection robust beyond Slurm.Relying solely on SLURM_PROCID may cause all ranks to parse/assert when launched under non-Slurm MPI (OMPI/PMI) or PyTorch (RANK). Probe common env vars and default to rank 0 only if none are set.
Apply this diff:
- if os.environ.get("SLURM_PROCID", '0') == '0': + rank = (os.environ.get("SLURM_PROCID") + or os.environ.get("OMPI_COMM_WORLD_RANK") + or os.environ.get("PMI_RANK") + or os.environ.get("RANK") + or "0") + if str(rank) == "0": mmlu_accuracy = get_mmlu_accuracy(output) assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}"
2705-2705
: Nit: log the parsed accuracy to aid triage.Printing the observed accuracy helps quickly diagnose borderline failures without re-running the job.
Example:
- mmlu_threshold = 81.5 + mmlu_threshold = 81.5 + print(f"[test_multi_nodes_eval] threshold={mmlu_threshold}")And after parsing:
- assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}" + print(f"[test_multi_nodes_eval] observed MMLU accuracy={mmlu_accuracy}") + assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}"
2706-2718
: Use llm_venv.run_cmd for this evaluation step
Switching from a shell-joinedcheck_output(…, shell=True)
call tollm_venv.run_cmd
will
• Avoid quoting and escaping pitfalls when arguments contain spaces or special characters
• Centralize environment and subprocess handling in one place
• Match the established pattern used throughouttests/integration/defs/test_e2e.py
and other testsAffected location:
tests/integration/defs/test_e2e.py
, around lines 2706–2718Suggested diff:
- output = check_output(" ".join(run_cmd), shell=True, env=llm_venv._new_env) + output = llm_venv.run_cmd(run_cmd, caller=check_output)No additional changes to
get_mmlu_accuracy
are required—its parsing of thebytes
-decoded output remains valid when usingrun_cmd
.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tests/integration/defs/common.py
(1 hunks)tests/integration/defs/test_e2e.py
(3 hunks)tests/integration/test_lists/qa/llm_function_multinode.txt
(1 hunks)tests/integration/test_lists/waives.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration/defs/common.py
- tests/integration/test_lists/qa/llm_function_multinode.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/test_e2e.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/test_e2e.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/waives.txt
tests/integration/defs/test_e2e.py
🪛 Ruff (0.12.2)
tests/integration/defs/test_e2e.py
31-31: Redefinition of unused convert_weights
from line 27
(F811)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/integration/defs/test_e2e.py (1)
46-46
: LGTM: well-scoped constant and consistent naming._MEM_FRACTION_80 follows existing convention (_MEM_FRACTION_50/_95), improves readability, and avoids magic numbers.
tests/integration/test_lists/waives.txt (1)
324-324
: Waive entry matches generated pytest node id; OK to land.The node id ordering aligns with the function signature (model_path, tp/pp/ep id, eval_task). The scope is narrow (FP8 tp8pp2 mmlu), with a valid nvbugs reference.
/bot run |
PR_Github #16103 [ run ] triggered by Bot |
ab91f8e
to
a6b4728
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/integration/defs/test_e2e.py (1)
31-33
: Fix duplicate import of convert_weights (Ruff F811) — keep the relative import, drop the absolute one.
convert_weights
is imported both absolutely (Line 27) and relatively here, triggering F811 and breaking CI. Remove the absolute import and keep this relative import for consistency with other local imports.Apply outside-this-hunk change (Line 27 removal):
# Remove this absolute import at Line 27 from defs.common import convert_weightsOptional consistency cleanup (outside this hunk): switch
defs.trt_test_alternative
to a relative import like the rest of the file.from .trt_test_alternative import (check_call, check_call_negative_test, check_output)Quick check:
#!/bin/bash # Expect no output after the fix rg -nP 'from\s+defs\.common\s+import\s+convert_weights' tests/integration/defs/test_e2e.py || true
🧹 Nitpick comments (3)
tests/integration/defs/test_e2e.py (3)
2682-2688
: Add skip_nvlink_inactive to align gating with other multi-node tests.Other multi-node tests in this file are guarded with
@skip_nvlink_inactive
. This eval path likely uses similar interconnect assumptions; adding the marker prevents spurious failures on hosts without NVLink.Apply this diff:
@pytest.mark.timeout(5400) @pytest.mark.skip_less_device_memory(80000) @pytest.mark.skip_less_device(4) +@skip_nvlink_inactive @pytest.mark.parametrize("eval_task", ["mmlu"]) @pytest.mark.parametrize("tp_size,pp_size,ep_size", [(16, 1, 8), (8, 2, 8)], ids=["tp16", "tp8pp2"])
2705-2705
: Make the MMLU threshold configurable to reduce flakiness across models/runs.A single hard-coded threshold for diverse models/precisions can be brittle. Allow an env override to fine-tune per model/CI lane without code changes.
Apply this diff:
- mmlu_threshold = 81.5 + # Allow CI lanes to override if needed (default stays strict). + default_threshold = 81.5 + mmlu_threshold = float(os.getenv("MMLU_THRESHOLD", default_threshold))
2685-2685
: Minor: single-value parameterization for eval_task is fine, but optional to inline for clarity.If there’s no near-term plan to add more tasks, inlining
"mmlu"
can simplify the parametrization matrix slightly. If expansion is imminent, keeping it is fine.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tests/integration/defs/common.py
(1 hunks)tests/integration/defs/test_e2e.py
(3 hunks)tests/integration/test_lists/qa/llm_function_multinode.txt
(1 hunks)tests/integration/test_lists/waives.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/integration/defs/common.py
- tests/integration/test_lists/waives.txt
- tests/integration/test_lists/qa/llm_function_multinode.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/test_e2e.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/test_e2e.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/test_e2e.py
🪛 Ruff (0.12.2)
tests/integration/defs/test_e2e.py
31-31: Redefinition of unused convert_weights
from line 27
(F811)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/integration/defs/test_e2e.py (2)
46-46
: LGTM: dedicated 80% KV-cache free-memory fraction constant.Naming and placement are consistent with existing
_MEM_FRACTION_*
constants, and this improves readability where the value is used.
2713-2716
: No change needed: CLI flag existsThe
trtllm-eval
command defines the--kv_cache_free_gpu_memory_fraction
option in tensorrt_llm/commands/eval.py, so the integration test’s use of that flag is correct and will parse successfully.• tensorrt_llm/commands/eval.py – defines
@click.option("--kv_cache_free_gpu_memory_fraction", …)
• tests/integration/defs/test_e2e.py – uses--kv_cache_free_gpu_memory_fraction={_MEM_FRACTION_80}
PR_Github #16103 [ run ] completed with state |
a6b4728
to
85ad823
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/integration/defs/test_e2e.py (1)
2719-2721
: Don’t gate accuracy check on pytest’s SLURM_PROCID — causes false passes.
SLURM_PROCID
here refers to the pytest process environment, not ranks launched bytrtllm-llmapi-launch
. This can skip the assertion entirely when pytest runs under SLURM with a non-zero proc id. Always parse and assert.Apply this diff:
- if os.environ.get("SLURM_PROCID", '0') == '0': - mmlu_accuracy = get_mmlu_accuracy(output) - assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}" + mmlu_accuracy = get_mmlu_accuracy(output) + assert mmlu_accuracy > mmlu_threshold, ( + f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}" + )
🧹 Nitpick comments (1)
tests/integration/defs/test_e2e.py (1)
2682-2699
: Parameter grid looks fine; consider per-model timeouts to avoid flakiness.5400s can be tight for very large models (e.g., Qwen3-235B). If you see timeouts, bump timeout via per-param marks rather than globally.
Example tweak:
-@pytest.mark.timeout(5400) +@pytest.mark.timeout(5400) @pytest.mark.skip_less_device_memory(80000) @pytest.mark.skip_less_device(4) @pytest.mark.parametrize("eval_task", ["mmlu"]) @pytest.mark.parametrize("tp_size,pp_size,ep_size", [(16, 1, 8), (8, 2, 8)], ids=["tp16", "tp8pp2"]) @pytest.mark.parametrize("model_path", [ - pytest.param('llama-3.3-models/Llama-3.3-70B-Instruct', marks=skip_pre_hopper), - pytest.param('llama4-models/Llama-4-Maverick-17B-128E-Instruct', marks=skip_pre_hopper), - pytest.param('llama4-models/nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8', marks=skip_pre_hopper), - pytest.param('Qwen3/Qwen3-235B-A22B', marks=skip_pre_hopper), + pytest.param('llama-3.3-models/Llama-3.3-70B-Instruct', marks=skip_pre_hopper), + pytest.param('llama4-models/Llama-4-Maverick-17B-128E-Instruct', marks=skip_pre_hopper), + pytest.param('llama4-models/nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8', marks=skip_pre_hopper), + pytest.param('Qwen3/Qwen3-235B-A22B', marks=(skip_pre_hopper, pytest.mark.timeout(7200))), pytest.param('Qwen3/saved_models_Qwen3-235B-A22B_nvfp4_hf', marks=skip_pre_blackwell), pytest.param('DeepSeek-R1/DeepSeek-R1-0528-FP4', marks=skip_pre_blackwell), ])
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tests/integration/defs/common.py
(1 hunks)tests/integration/defs/test_e2e.py
(3 hunks)tests/integration/test_lists/qa/llm_function_multinode.txt
(1 hunks)tests/integration/test_lists/waives.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/integration/defs/common.py
- tests/integration/test_lists/waives.txt
- tests/integration/test_lists/qa/llm_function_multinode.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/test_e2e.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/test_e2e.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/test_e2e.py
🪛 Ruff (0.12.2)
tests/integration/defs/test_e2e.py
31-31: Redefinition of unused convert_weights
from line 27
(F811)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/integration/defs/test_e2e.py (2)
46-46
: LGTM: new KV cache memory fraction constant.
_MEM_FRACTION_80 = 0.8
is named consistently with existing constants and scoped for reuse.
2705-2717
: Command construction is reasonable; option set matches eval intent.Using
trtllm-llmapi-launch trtllm-eval ... mmlu
with--kv_cache_free_gpu_memory_fraction=0.8
and capped--max_batch_size=32
aligns with the goal of reproducible multi-node accuracy checks.
85ad823
to
ad2fee0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tests/integration/defs/test_e2e.py (2)
31-33
: Resolve Ruff F811 (duplicateconvert_weights
) and unify imports to relative.
convert_weights
is imported absolutely on Line 27 and again relatively here, tripping Ruff F811 and breaking CI. Drop the absolute import and convert thetrt_test_alternative
import to relative to match the module style.Apply:
@@ -from defs.common import convert_weights -from defs.trt_test_alternative import (check_call, check_call_negative_test, - check_output) +from .trt_test_alternative import (check_call, check_call_negative_test, + check_output)After this change, keep the relative import here as-is:
from .common import (PluginOptions, convert_weights, get_mmlu_accuracy, prune_checkpoint, quantize_data, refit_model, venv_check_call)
2719-2721
: Don’t gate MMLU accuracy assertion onSLURM_PROCID
; always assert.
SLURM_PROCID
here is for the pytest process, not ranks launched bytrtllm-llmapi-launch
. This can skip the check entirely and produce false passes.Apply:
- if os.environ.get("SLURM_PROCID", '0') == '0': - mmlu_accuracy = get_mmlu_accuracy(output) - assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}" + mmlu_accuracy = get_mmlu_accuracy(output) + assert mmlu_accuracy > mmlu_threshold, ( + f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}" + )
🧹 Nitpick comments (2)
tests/integration/defs/test_e2e.py (2)
2682-2699
: Consider adding NVLink guard and revisiting device gating for multinode configs.
- Other multinode tests in this file use
@skip_nvlink_inactive
; if these TP/PP/EP configs require NVLink for correctness/perf, add the same guard for consistency.- The world size implied by
(tp, pp, ep) = (16,1,8)
or(8,2,8)
is large. If CI scheduling expects external cluster resources viatrtllm-llmapi-launch
, that’s fine; otherwise, the current@skip_less_device(4)
may misleadingly permit local runs that can never satisfy the world size.Proposed decorator addition:
@@ +@skip_nvlink_inactive @pytest.mark.timeout(5400) @pytest.mark.skip_less_device_memory(80000) @pytest.mark.skip_less_device(4)
2705-2718
: Preferllm_venv.run_cmd
over shell-joined strings when capturing output.Using
llm_venv.run_cmd([...], caller=check_output)
avoids shell-quoting pitfalls and keeps parity with the rest of this module.- output = check_output(" ".join(run_cmd), shell=True, env=llm_venv._new_env) + output = llm_venv.run_cmd(run_cmd, caller=check_output)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tests/integration/defs/common.py
(1 hunks)tests/integration/defs/test_e2e.py
(3 hunks)tests/integration/test_lists/qa/llm_function_multinode.txt
(1 hunks)tests/integration/test_lists/waives.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/integration/defs/common.py
- tests/integration/test_lists/waives.txt
- tests/integration/test_lists/qa/llm_function_multinode.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/test_e2e.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/test_e2e.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/test_e2e.py
🪛 Ruff (0.12.2)
tests/integration/defs/test_e2e.py
31-31: Redefinition of unused convert_weights
from line 27
(F811)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/integration/defs/test_e2e.py (1)
46-46
: LGTM: memory-fraction constant added consistently.
_MEM_FRACTION_80
aligns with_MEM_FRACTION_50
and_MEM_FRACTION_95
naming. Clear, discoverable, and reused later.
Signed-off-by: Xin He (SW-GPU) <[email protected]>
ad2fee0
to
35846ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
tests/integration/defs/common.py (1)
961-979
: Harden MMLU accuracy parsing (regex, bytes-safe, last occurrence) and add a brief docstring.The split-chain is brittle and will break if the formatter adds a percent sign or extra text; it also grabs only the first match, which can be wrong in multi-rank logs. Decode bytes defensively, parse with a regex, and pick the last occurrence.
Apply this diff:
-def get_mmlu_accuracy(output): - mmlu_line = None - for line in output.split('\n'): - if "MMLU weighted average accuracy:" in line: - mmlu_line = line - break - - if mmlu_line is None: - raise Exception( - f"Could not find 'MMLU weighted average accuracy:' in output. Full output:\n{output}" - ) - - mmlu_accuracy = float( - mmlu_line.split("MMLU weighted average accuracy: ")[1].split(" (")[0]) - - print(f"MMLU weighted average accuracy is: {mmlu_accuracy}") - - return mmlu_accuracy +def get_mmlu_accuracy(output): + """Extract the MMLU weighted average accuracy from trtllm-eval output.""" + if isinstance(output, (bytes, bytearray)): + output = output.decode(errors="replace") + # Robustly match and take the last occurrence (multi-rank logs may print multiple times). + matches = re.findall( + r"MMLU weighted average accuracy:\s*([0-9]+(?:\.[0-9]+)?)%?", output + ) + if not matches: + tail = "\n".join(output.splitlines()[-200:]) + raise Exception( + "Could not find 'MMLU weighted average accuracy:' in output. " + f"Last 200 lines:\n{tail}" + ) + mmlu_accuracy = float(matches[-1].rstrip("%")) + print(f"MMLU weighted average accuracy is: {mmlu_accuracy}") + return mmlu_accuracytests/integration/defs/test_e2e.py (2)
27-33
: Fix Ruff F811 and unify on relative imports.
convert_weights
is imported twice (absolute on Line 27 and relative on Lines 31–33), tripping Ruff F811 and causing CI failures. Drop the absolute imports and keep the relative ones for consistency.Apply this diff:
@@ -from defs.common import convert_weights -from defs.trt_test_alternative import (check_call, check_call_negative_test, - check_output) +from .trt_test_alternative import (check_call, check_call_negative_test, + check_output) @@ -from .common import (PluginOptions, convert_weights, get_mmlu_accuracy, +from .common import (PluginOptions, convert_weights, get_mmlu_accuracy, prune_checkpoint, quantize_data, refit_model, venv_check_call)After the change, verify no remaining absolute imports in this module:
#!/bin/bash rg -n -e 'from\s+defs\.' tests/integration/defs/test_e2e.py || true
2682-2721
: Always assert accuracy; don’t gate on SLURM_PROCID.Gating the assertion on
SLURM_PROCID
can cause false passes when pytest itself runs under SLURM. Always parse and assert from the captured output.Apply this diff:
@@ - output = check_output(" ".join(run_cmd), shell=True, env=llm_venv._new_env) - - if os.environ.get("SLURM_PROCID", '0') == '0': - mmlu_accuracy = get_mmlu_accuracy(output) - assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}" + output = check_output(" ".join(run_cmd), shell=True, env=llm_venv._new_env) + mmlu_accuracy = get_mmlu_accuracy(output) + assert mmlu_accuracy > mmlu_threshold, ( + f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}" + )
🧹 Nitpick comments (1)
tests/integration/defs/test_e2e.py (1)
2702-2704
: Prefer param-level skip over runtime skip for unsupported TP/PP combos.Instead of a runtime
pytest.skip
, consider expressing the unsupported combination directly in the parametrization (e.g.,pytest.param(..., marks=pytest.mark.skip(reason="..."))
), which yields clearer collection-time reporting and avoids executing setup for doomed configs.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tests/integration/defs/common.py
(1 hunks)tests/integration/defs/test_e2e.py
(3 hunks)tests/integration/test_lists/qa/llm_function_multinode.txt
(1 hunks)tests/integration/test_lists/waives.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration/test_lists/waives.txt
- tests/integration/test_lists/qa/llm_function_multinode.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/common.py
tests/integration/defs/test_e2e.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/common.py
tests/integration/defs/test_e2e.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/test_e2e.py
🪛 Ruff (0.12.2)
tests/integration/defs/test_e2e.py
31-31: Redefinition of unused convert_weights
from line 27
(F811)
🔇 Additional comments (1)
tests/integration/defs/test_e2e.py (1)
46-46
: LGTM: add 0.8 KV-cache fraction constant.Naming aligns with existing constants (_MEM_FRACTION_50, _MEM_FRACTION_95). No action needed.
Summary by CodeRabbit
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.