[TRTLLM-7245][feat] add test_multi_nodes_eval tests #7108

xinhe-nv · 2025-08-21T05:02:30Z

Summary by CodeRabbit

Tests
- Added a consolidated multi-node MMLU evaluation suite across major models with multiple parallelism configurations and a quality gate requiring MMLU weighted-average accuracy >81.5; replaced several legacy model-specific multi-node tests and removed the prior multi-node chat test.
Chores
- Added a waiver to skip one FP8 multi-node evaluation due to a known platform issue.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-21T05:02:36Z

📝 Walkthrough

Walkthrough

Adds a parser for MMLU accuracy, a parameterized multi-node MMLU evaluation test that launches trtllm-llmapi-launch + trtllm-eval with kv-cache memory-fraction and batch settings, updates multinode QA test list to use the new test, and appends a SKIP waiver for one FP8 multi-node test.

Changes

Cohort / File(s)	Summary
Integration test utilities `tests/integration/defs/common.py`	Adds `get_mmlu_accuracy(output)` to parse "MMLU weighted average accuracy:" from evaluation stdout and return it as float; raises if metric not found.
E2E multi-node eval test `tests/integration/defs/test_e2e.py`	Imports `get_mmlu_accuracy`, adds `_MEM_FRACTION_80 = 0.8`, and introduces `test_multi_nodes_eval(...)` — a parameterized multi-node PyTest that runs `trtllm-llmapi-launch` + `trtllm-eval` with tp/pp/ep sizes, sets `kv_cache_free_gpu_memory_fraction=0.8` and `max_batch_size=32`, captures stdout, parses MMLU accuracy (only on SLURM rank 0) and asserts accuracy > 81.5; includes model/tp combo skips and parametrizations.
QA multinode list `tests/integration/test_lists/qa/llm_function_multinode.txt`	Removes several legacy multinode test invocations and adds six `test_multi_nodes_eval[...]` MMLU entries for selected models/configs, consolidating multi-node evaluations.
Test waivers `tests/integration/test_lists/waives.txt`	Appends a SKIP entry for the FP8 config of the new multi-node MMLU test: `test_e2e.py::test_multi_nodes_eval[...] SKIP (https://nvbugs/5473781)`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant PyTest as PyTest (ranks 0..N-1)
  participant Launcher as trtllm-llmapi-launch
  participant Eval as trtllm-eval
  participant Parser as get_mmlu_accuracy

  rect rgba(200,230,255,0.20)
    note left of PyTest: Parametrize model_path, tp/pp/ep, eval_task="mmlu"
    PyTest->>Launcher: launch with tp/pp/ep, kv_cache_free_gpu_memory_fraction=0.8, max_batch_size=32
    Launcher->>Eval: invoke evaluation across nodes
    Eval-->>Launcher: stdout containing "MMLU weighted average accuracy: <value> (%)"
    Launcher-->>PyTest: relay captured stdout
  end

  alt SLURM rank 0
    PyTest->>Parser: get_mmlu_accuracy(stdout)
    Parser-->>PyTest: parsed accuracy (float)
    PyTest->>PyTest: assert accuracy > 81.5
  else other ranks
    PyTest->>PyTest: skip accuracy parsing/assertion
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

test: waive failed cases #6394 — overlaps changes to integration test evaluation parameters (kv/free_gpu_memory_fraction and max_batch_size).
[TRTLLM-7205][feat] add llama4 tp4 tests #6989 — related multi-node Llama4 tests and introduction/use of kv_cache free-GPU-memory fraction (0.8) in tests.

Suggested reviewers

LarryXFly
StanleySun639

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tests/integration/defs/accuracy/accuracy_core.py (1)
345-363: Rename duplicated MMLU class to MMLUMultimodal and update dispatch

To avoid shadowing the original MMLU class and ensure both variants are correctly handled:

• In tests/integration/defs/accuracy/accuracy_core.py (lines 345–363):
– Rename the second class definition
- class MMLU(AccuracyTask):
+ class MMLUMultimodal(AccuracyTask):
    DATASET = "MMLU"
    DATASET_DIR = f"{llm_models_root()}/datasets/MMLU"

    ALPHA = 0.05
    BETA = 0.2
    SIGMA = 50
    NUM_SAMPLES = 900

    MAX_BATCH_SIZE = 128
    MAX_INPUT_LEN = 8192
    MAX_OUTPUT_LEN = 512

-   EVALUATOR_CLS = tensorrt_llm.evaluate.MMLU
+   EVALUATOR_CLS = tensorrt_llm.evaluate.MMLU
    EVALUATOR_KWARGS = dict(dataset_path=DATASET_DIR,
                            random_seed=0,
                            is_multimodal=True,
                            apply_chat_template=True)
– Also align the dataset name/directory casing to lowercase if that matches on-disk structure:
-    DATASET = "MMLU"
-    DATASET_DIR = f"{llm_models_root()}/datasets/MMLU"
+    DATASET = "mmlu"
+    DATASET_DIR = f"{llm_models_root()}/datasets/mmlu"
• In the same file (around lines 709–714), update the task dispatcher so both classes route to mmlu():
    def evaluate(self):
        for task in self.tasks:
-           elif isinstance(task, MMLU):
-               self.mmlu(task)
+           elif isinstance(task, (MMLU, MMLUMultimodal)):
+               self.mmlu(task)
These changes will:

Prevent the second definition from overwriting the first.

Maintain backward compatibility of the existing MMLU task.

Clearly distinguish between the standard and multimodal variants.
tensorrt_llm/commands/eval.py (1)
148-156: Duplicate MMLU command registration (remove the second one).

MMLU is added twice to the CLI, which will either be a no-op or cause confusing help output.

Apply this diff to remove the duplicate registration:
 main.add_command(CnnDailymail.command)
 main.add_command(MMLU.command)
 main.add_command(GSM8K.command)
 main.add_command(GPQADiamond.command)
 main.add_command(GPQAMain.command)
 main.add_command(GPQAExtended.command)
 main.add_command(JsonModeEval.command)
-main.add_command(MMLU.command)
tensorrt_llm/evaluate/lm_eval.py (1)
661-667: MMLU is not a multimodal task—remove multimodal/chat overrides and the forced stop token.

The comment and flags appear to be carried over from MMMU. Setting is_multimodal=True forces the multimodal wrapper and can break MMLU evaluation. Forcing chat templates and a hardcoded stop token is also unnecessary and may skew results.

Apply this diff to correct the behavior:
     @staticmethod
     def command(ctx, **kwargs) -> None:
-        # NOTE: MMLU is a multimodal task, so we need to set the is_multimodal and apply_chat_template flags to True
-        kwargs["is_multimodal"] = True
-        kwargs["apply_chat_template"] = True
-        kwargs[
-            "stop"] = "<|endoftext|>"  # NOTE: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/MMLU/_template_yaml#L10
-        MMLU.command_harness(ctx, **kwargs)
+        # MMLU is a text-only task; use the default (non-multimodal) wrapper and defaults.
+        MMLU.command_harness(ctx, **kwargs)

🧹 Nitpick comments (6)

docs/source/performance/perf-benchmarking.md (1)

556-556: Dataset rename looks correct; consider minor CLI consistency nit.

The switch to lmms-lab/MMLU aligns with the PR intent. As a minor consistency nit, elsewhere we sometimes quote keys (e.g., --dataset-prompt-key "question"); consider consistently quoting across docs to avoid shell interpretation edge cases.
tests/integration/defs/test_e2e.py (2)
2577-2584: Consider adding NVLink guard to match other multi-node tests.

Other multi-node tests include @skip_nvlink_inactive to avoid failures on nodes without NVLink. This test likely benefits from the same guard.

Suggested change:
+@skip_nvlink_inactive
 @pytest.mark.skip_less_device_memory(80000)
 @pytest.mark.skip_less_device(4)
 @pytest.mark.parametrize("eval_task", ["mmlu"])
 @pytest.mark.parametrize("tp_size,pp_size,ep_size", [(16, 1, 8)], ids=["tp16"])
2589-2590: Sanity-check threshold stability.

mmlu_threshold = 80 may be tight depending on minor version/model variance or eval harness updates. If this runs in diverse CI farms, consider a slightly lower bound or making it configurable via env to reduce flakiness.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
2703-2716: Avoid hardcoding the stop token; derive it from the tokenizer instead.

Using a literal "<|endoftext|>" risks mismatch across tokenizers (e.g., Qwen variants may use a different EOS string). Construct SamplingParams after LLM is instantiated so you can pull llm.tokenizer.eos_token.

Apply this diff to move sampling_params creation into the method and fetch EOS dynamically:
@@
-    # NOTE: MMLU adds <|endoftext|> to the stop token.
-    sampling_params = SamplingParams(max_tokens=MMLU.MAX_OUTPUT_LEN,
-                                     truncate_prompt_tokens=MMLU.MAX_INPUT_LEN,
-                                     stop="<|endoftext|>")
+    # NOTE: Build SamplingParams at runtime to use the model's EOS token.
@@
     def test_auto_dtype(self):
-        with LLM(self.MODEL_PATH,
-                 max_num_tokens=16384,
-                 kv_cache_config=self.kv_cache_config) as llm:
-            task = MMLU(self.MODEL_NAME)
-            task.evaluate(llm, sampling_params=self.sampling_params)
+        with LLM(self.MODEL_PATH,
+                 max_num_tokens=16384,
+                 kv_cache_config=self.kv_cache_config) as llm:
+            sampling_params = SamplingParams(
+                max_tokens=MMLU.MAX_OUTPUT_LEN,
+                truncate_prompt_tokens=MMLU.MAX_INPUT_LEN,
+                stop=getattr(llm.tokenizer, "eos_token", "<|endoftext|>")
+            )
+            task = MMLU(self.MODEL_NAME)
+            task.evaluate(llm, sampling_params=sampling_params)
tensorrt_llm/evaluate/lm_eval.py (2)
620-626: Use a lowercase CLI command name for consistency (mmlu).

All other commands use lowercase (e.g., gsm8k, gpqa_diamond). Uppercase “MMLU” is inconsistent and surprising.

Apply this diff:
-class MMLU(LmEvalEvaluator):
+class MMLU(LmEvalEvaluator):
@@
-    @click.command("MMLU")
+    @click.command("mmlu")
Note: Registration remains main.add_command(MMLU.command) and will pick up the new command name automatically.

653-657: Nit: Update comment/link casing to match lm-eval’s layout.

The reference path uses “MMLU” but the upstream repo typically uses lowercase task directories (e.g., tasks/mmlu). This is a doc comment only, but correcting casing helps future readers.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 21f4434 and 84c1e38.

📒 Files selected for processing (8)

docs/source/performance/perf-benchmarking.md (1 hunks)
examples/llm-api/out_of_tree_example/readme.md (1 hunks)
tensorrt_llm/commands/eval.py (2 hunks)
tensorrt_llm/evaluate/__init__.py (1 hunks)
tensorrt_llm/evaluate/lm_eval.py (2 hunks)
tests/integration/defs/accuracy/accuracy_core.py (2 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3 hunks)
tests/integration/defs/test_e2e.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/integration/defs/test_e2e.py
tensorrt_llm/evaluate/__init__.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/defs/accuracy/accuracy_core.py
tensorrt_llm/evaluate/lm_eval.py
tensorrt_llm/commands/eval.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/integration/defs/test_e2e.py
tensorrt_llm/evaluate/__init__.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/defs/accuracy/accuracy_core.py
tensorrt_llm/evaluate/lm_eval.py
tensorrt_llm/commands/eval.py

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/test_e2e.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/commands/eval.py

🪛 Ruff (0.12.2)

tensorrt_llm/evaluate/__init__.py

19-19: Redefinition of unused MMLU from line 18

(F811)

tests/integration/defs/accuracy/accuracy_core.py

345-345: Redefinition of unused MMLU from line 276

(F811)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

examples/llm-api/out_of_tree_example/readme.md (1)

45-45: Good update; mirrors the docs change.

The dataset-name is updated to lmms-lab/MMLU and the rest of the invocation remains the same. Looks good.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

34-35: MMLU import swap looks good.

Switching to MMLU and importing JsonModeEval/LlmapiAccuracyTestHarness is consistent with the broader rename and test harness usage.

tensorrt_llm/commands/eval.py (1)

23-24: Import updates align with the rename.

Replacing MMMU with MMLU in the imports is consistent with the rest of the changes.

tensorrt_llm/evaluate/__init__.py

tests/integration/defs/test_e2e.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/integration/defs/common.py (1)

961-979: Harden MMLU accuracy parsing; accept bytes and fail fast with a clear error

The chained splits are brittle and will break if the report format changes slightly (or if the output is bytes). Use a single regex, handle bytes, and raise a precise error.

Apply this diff:

 def get_mmlu_accuracy(output):
-    mmlu_line = None
-    for line in output.split('\n'):
-        if "MMLU weighted average accuracy:" in line:
-            mmlu_line = line
-            break
-
-    if mmlu_line is None:
-        raise Exception(
-            f"Could not find 'MMLU weighted average accuracy:' in output. Full output:\n{output}"
-        )
-
-    mmlu_accuracy = float(
-        mmlu_line.split("MMLU weighted average accuracy: ")[1].split(" (")[0])
-
-    print(f"MMLU weighted average accuracy is: {mmlu_accuracy}")
-
-    return mmlu_accuracy
+    """
+    Extract the MMLU weighted average accuracy from evaluator output.
+
+    The function is robust to minor formatting changes and handles byte outputs.
+
+    Args:
+        output: The captured stdout from `trtllm-eval` (str or bytes).
+
+    Returns:
+        float: The parsed MMLU weighted average accuracy.
+
+    Raises:
+        ValueError: If the expected accuracy line cannot be found/parsed.
+    """
+    if isinstance(output, (bytes, bytearray)):
+        output = output.decode(errors="replace")
+    match = re.search(
+        r"MMLU weighted average accuracy:\s*([0-9]+(?:\.[0-9]+)?)", output
+    )
+    if not match:
+        raise ValueError(
+            "Failed to parse 'MMLU weighted average accuracy' from output.\n"
+            f"Full output:\n{output}"
+        )
+    mmlu_accuracy = float(match.group(1))
+    print(f"MMLU weighted average accuracy is: {mmlu_accuracy}")
+    return mmlu_accuracy

🧹 Nitpick comments (3)

tests/integration/defs/common.py (1)
1-2: Update copyright year to include 2025

The coding guidelines ask to prepend NVIDIA copyright header with the current year. Please extend the range to 2025.
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tests/integration/defs/test_e2e.py (2)
2579-2619: Solid multi-node eval flow; add small robustness/debug nits

The test is well-structured and uses the shared parser. A couple of small improvements would help triage flakes and make output clearer:

Print the composed command and topology to the log before launching.

Rely entirely on the captured output for parsing (your helper already handles mixed-rank logs), so the SLURM_PROCID gate is unnecessary in most CI contexts. If you keep it, consider documenting why.

Minimal debug-friendly tweak:
 def test_multi_nodes_eval(llm_venv, model_path, tp_size, pp_size, ep_size,
                           eval_task):
     if "Llama-4" in model_path and tp_size == 16:
         pytest.skip("Llama-4 with tp16 is not supported")

     mmlu_threshold = 81.5
     run_cmd = [
         "trtllm-llmapi-launch",
         "trtllm-eval",
         f"--model={llm_models_root()}/{model_path}",
         f"--ep_size={ep_size}",
         f"--tp_size={tp_size}",
         f"--pp_size={pp_size}",
         f"--kv_cache_free_gpu_memory_fraction={_MEM_FRACTION_80}",
         "--max_batch_size=32",
         eval_task,
     ]
+    print(f"Launching eval for {model_path} with tp={tp_size}, pp={pp_size}, ep={ep_size}")
+    print("Command:", " ".join(run_cmd))
     output = check_output(" ".join(run_cmd), shell=True, env=llm_venv._new_env)

-    if os.environ.get("SLURM_PROCID", '0') == '0':
-        mmlu_accuracy = get_mmlu_accuracy(output)
-        assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}"
+    # Parse and assert on captured output (rank-agnostic).
+    mmlu_accuracy = get_mmlu_accuracy(output)
+    assert mmlu_accuracy > mmlu_threshold, (
+        f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}"
+    )
Note: If different models have materially different expected MMLU, consider per-model thresholds to reduce flakes. Happy to help wire that if desired.

1-2: Update copyright year to include 2025

Align with the coding guideline that headers reflect the current year.
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b1f8b1a and fdd2003.

📒 Files selected for processing (3)

tests/integration/defs/common.py (1 hunks)
tests/integration/defs/test_e2e.py (3 hunks)
tests/integration/test_lists/qa/llm_function_multinode.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/integration/test_lists/qa/llm_function_multinode.txt

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/integration/defs/common.py
tests/integration/defs/test_e2e.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/integration/defs/common.py
tests/integration/defs/test_e2e.py

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/test_e2e.py

🪛 Ruff (0.12.2)

tests/integration/defs/test_e2e.py

31-31: Redefinition of unused convert_weights from line 27

(F811)

🔇 Additional comments (1)

tests/integration/defs/test_e2e.py (1)

46-46: LGTM: memory fraction constant

Adding _MEM_FRACTION_80 is consistent with existing constants and improves clarity where 0.8 is used.

xinhe-nv · 2025-08-21T09:47:32Z

/bot run

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

tests/integration/defs/test_e2e.py (2)
2616-2618: Rank-0-only assertion is fine; make sure get_mmlu_accuracy is robust (already addressed).

Only asserting on SLURM_PROCID==0 avoids duplicate failures—good. Ensure get_mmlu_accuracy handles bytes vs str and format drift via regex (looks addressed by moving parsing into common.get_mmlu_accuracy per prior fix).

31-33: Remove duplicate import of convert_weights; satisfy Ruff F811 and unify import style.

convert_weights is imported on Line 27 (absolute) and again here (relative), triggering Ruff F811 and confusing readers. Keep the relative block and drop the absolute import. While here, make trt_test_alternative a relative import for consistency.

Apply this diff:
-from defs.common import convert_weights
-from defs.trt_test_alternative import (check_call, check_call_negative_test,
-                                       check_output)
+from .trt_test_alternative import (check_call, check_call_negative_test,
+                                   check_output)

🧹 Nitpick comments (2)

tests/integration/defs/test_e2e.py (2)
2599-2601: Prefer parametrized skip over runtime skip for unsupported combos (optional).

Instead of an in-test pytest.skip, you can pre-mark the unsupported combination (Llama-4 with tp16) as skipped at collection time using pytest.param(..., marks=pytest.mark.skip(...)). This reduces noise in reports and avoids scheduling work that will be skipped.

Example (illustrative; adapt to your pattern):
@pytest.mark.parametrize(
    "model_path,tp_size,pp_size,ep_size",
    [
        pytest.param("llama4-models/Llama-4-Maverick-17B-128E-Instruct", 8, 2, 8),
        pytest.param("llama4-models/Llama-4-Maverick-17B-128E-Instruct-FP8", 8, 2, 8),
        # Skip unsupported tp16 combos at collection time:
        pytest.param("llama4-models/Llama-4-Maverick-17B-128E-Instruct", 16, 1, 8,
                     marks=pytest.mark.skip(reason="Llama-4 with tp16 is not supported")),
    ],
)
1-14: Optional: update copyright year range.

Repo guideline says to prepend NVIDIA copyright header with current year. If the project wants ranges updated annually, consider bumping to 2022–2025 in the next touch.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fdd2003 and 36698ac.

📒 Files selected for processing (3)

tests/integration/defs/common.py (1 hunks)
tests/integration/defs/test_e2e.py (3 hunks)
tests/integration/test_lists/qa/llm_function_multinode.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/integration/defs/common.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/integration/defs/test_e2e.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/integration/defs/test_e2e.py

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/test_lists/qa/llm_function_multinode.txt
tests/integration/defs/test_e2e.py

🪛 Ruff (0.12.2)

tests/integration/defs/test_e2e.py

31-31: Redefinition of unused convert_weights from line 27

(F811)

🔇 Additional comments (5)

tests/integration/test_lists/qa/llm_function_multinode.txt (1)
2-7: Unable to verify test IDs in sandbox—please confirm locally

The sandbox environment lacks pytest, so I couldn’t collect the actual test_multi_nodes_eval node IDs. To ensure the duplicated “Instruct” token is truly invalid, please run this locally and verify the list of collected tests:
pytest --collect-only -q tests/integration/defs/test_e2e.py \
  | sed -n 's/^$.*test_multi_nodes_eval.*$$/\1/p'
Once you’ve confirmed that the FP8 variant is registered without the extra “Instruct”, apply the following change:
- test_e2e.py::test_multi_nodes_eval[llama4-models/Llama-4-Maverick-17B-128E-Instruct-FP8-Instruct-tp8pp2-mmlu]
+ test_e2e.py::test_multi_nodes_eval[llama4-models/Llama-4-Maverick-17B-128E-Instruct-FP8-tp8pp2-mmlu]
tests/integration/defs/test_e2e.py (4)

46-46: Good: add _MEM_FRACTION_80 for eval runs.

Using a dedicated 0.8 fraction clarifies intent and avoids magic numbers later in the eval command. No action needed.

2579-2585: Param ids produce stable test node ids; ensure test list uses the same order.

Stacked parametrization and the function signature (llm_venv, model_path, tp_size, pp_size, ep_size, eval_task) will yield ids like [model_path]-[tp id]-[eval_task], e.g., DeepSeek-R1/...-tp16-mmlu. Your test list follows this pattern—good. Just be aware that changing the argument order in the signature will change ids.

You can re-run the collection command from the previous comment to sanity-check the final ids emitted by pytest.

2586-2596: Model set looks right; aligns with the new test list.

The model paths enumerate Llama 3.3 70B, Llama-4 Maverick (BF16/FP8), Qwen3 235B (BF16/NVFP4), and DeepSeek-R1 FP4, with appropriate Hopper/Blackwell gating. Matches the entries in the multi-node list once the FP8 id typo is fixed.

2603-2613: Command construction LGTM; explicit kv cache fraction and batch size improve reproducibility.

The trtllm-llmapi-launch trtllm-eval ... mmlu invocation is clear, parameters are explicit, and _MEM_FRACTION_80 is applied consistently.

tensorrt-cicd · 2025-08-21T09:53:02Z

PR_Github #16042 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-21T12:22:52Z

PR_Github #16042 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12063 completed with status: 'FAILURE'

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tests/integration/defs/test_e2e.py (1)
31-33: Fix duplicate import of convert_weights (Ruff F811) and unify import style.

You re-import convert_weights via a relative import while it’s already imported absolutely at Line 27, triggering F811 and confusing readers. Remove the absolute import and switch trt_test_alternative to a relative import for consistency with the rest of this block.

Apply this diff:
-from defs.common import convert_weights
-from defs.trt_test_alternative import (check_call, check_call_negative_test,
-                                       check_output)
+from .trt_test_alternative import (check_call, check_call_negative_test,
+                                   check_output)

🧹 Nitpick comments (3)

tests/integration/defs/test_e2e.py (3)
2719-2721: Make rank detection robust beyond Slurm.

Relying solely on SLURM_PROCID may cause all ranks to parse/assert when launched under non-Slurm MPI (OMPI/PMI) or PyTorch (RANK). Probe common env vars and default to rank 0 only if none are set.

Apply this diff:
-    if os.environ.get("SLURM_PROCID", '0') == '0':
+    rank = (os.environ.get("SLURM_PROCID")
+            or os.environ.get("OMPI_COMM_WORLD_RANK")
+            or os.environ.get("PMI_RANK")
+            or os.environ.get("RANK")
+            or "0")
+    if str(rank) == "0":
         mmlu_accuracy = get_mmlu_accuracy(output)
         assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}"
2705-2705: Nit: log the parsed accuracy to aid triage.

Printing the observed accuracy helps quickly diagnose borderline failures without re-running the job.

Example:
-    mmlu_threshold = 81.5
+    mmlu_threshold = 81.5
+    print(f"[test_multi_nodes_eval] threshold={mmlu_threshold}")
And after parsing:
-        assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}"
+        print(f"[test_multi_nodes_eval] observed MMLU accuracy={mmlu_accuracy}")
+        assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}"
2706-2718: Use llm_venv.run_cmd for this evaluation step
Switching from a shell-joined check_output(…, shell=True) call to llm_venv.run_cmd will
• Avoid quoting and escaping pitfalls when arguments contain spaces or special characters
• Centralize environment and subprocess handling in one place
• Match the established pattern used throughout tests/integration/defs/test_e2e.py and other tests

Affected location: tests/integration/defs/test_e2e.py, around lines 2706–2718

Suggested diff:
-    output = check_output(" ".join(run_cmd), shell=True, env=llm_venv._new_env)
+    output = llm_venv.run_cmd(run_cmd, caller=check_output)
No additional changes to get_mmlu_accuracy are required—its parsing of the bytes-decoded output remains valid when using run_cmd.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 36698ac and ab91f8e.

📒 Files selected for processing (4)

tests/integration/defs/common.py (1 hunks)
tests/integration/defs/test_e2e.py (3 hunks)
tests/integration/test_lists/qa/llm_function_multinode.txt (1 hunks)
tests/integration/test_lists/waives.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tests/integration/defs/common.py
tests/integration/test_lists/qa/llm_function_multinode.txt

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/integration/defs/test_e2e.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/integration/defs/test_e2e.py

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/test_lists/waives.txt
tests/integration/defs/test_e2e.py

🪛 Ruff (0.12.2)

tests/integration/defs/test_e2e.py

31-31: Redefinition of unused convert_weights from line 27

(F811)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tests/integration/defs/test_e2e.py (1)

46-46: LGTM: well-scoped constant and consistent naming.

_MEM_FRACTION_80 follows existing convention (_MEM_FRACTION_50/_95), improves readability, and avoids magic numbers.

tests/integration/test_lists/waives.txt (1)

324-324: Waive entry matches generated pytest node id; OK to land.

The node id ordering aligns with the function signature (model_path, tp/pp/ep id, eval_task). The scope is narrow (FP8 tp8pp2 mmlu), with a valid nvbugs reference.

xinhe-nv · 2025-08-22T02:37:52Z

/bot run

tensorrt-cicd · 2025-08-22T02:42:55Z

PR_Github #16103 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/integration/defs/test_e2e.py (1)
31-33: Fix duplicate import of convert_weights (Ruff F811) — keep the relative import, drop the absolute one.

convert_weights is imported both absolutely (Line 27) and relatively here, triggering F811 and breaking CI. Remove the absolute import and keep this relative import for consistency with other local imports.

Apply outside-this-hunk change (Line 27 removal):
# Remove this absolute import at Line 27
from defs.common import convert_weights
Optional consistency cleanup (outside this hunk): switch defs.trt_test_alternative to a relative import like the rest of the file.
from .trt_test_alternative import (check_call, check_call_negative_test,
                                   check_output)
Quick check:
#!/bin/bash
# Expect no output after the fix
rg -nP 'from\s+defs\.common\s+import\s+convert_weights' tests/integration/defs/test_e2e.py || true

🧹 Nitpick comments (3)

tests/integration/defs/test_e2e.py (3)
2682-2688: Add skip_nvlink_inactive to align gating with other multi-node tests.

Other multi-node tests in this file are guarded with @skip_nvlink_inactive. This eval path likely uses similar interconnect assumptions; adding the marker prevents spurious failures on hosts without NVLink.

Apply this diff:
 @pytest.mark.timeout(5400)
 @pytest.mark.skip_less_device_memory(80000)
 @pytest.mark.skip_less_device(4)
+@skip_nvlink_inactive
 @pytest.mark.parametrize("eval_task", ["mmlu"])
 @pytest.mark.parametrize("tp_size,pp_size,ep_size", [(16, 1, 8), (8, 2, 8)],
                          ids=["tp16", "tp8pp2"])
2705-2705: Make the MMLU threshold configurable to reduce flakiness across models/runs.

A single hard-coded threshold for diverse models/precisions can be brittle. Allow an env override to fine-tune per model/CI lane without code changes.

Apply this diff:
-    mmlu_threshold = 81.5
+    # Allow CI lanes to override if needed (default stays strict).
+    default_threshold = 81.5
+    mmlu_threshold = float(os.getenv("MMLU_THRESHOLD", default_threshold))
2685-2685: Minor: single-value parameterization for eval_task is fine, but optional to inline for clarity.

If there’s no near-term plan to add more tasks, inlining "mmlu" can simplify the parametrization matrix slightly. If expansion is imminent, keeping it is fine.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ab91f8e and a6b4728.

📒 Files selected for processing (4)

tests/integration/defs/common.py (1 hunks)
tests/integration/defs/test_e2e.py (3 hunks)
tests/integration/test_lists/qa/llm_function_multinode.txt (1 hunks)
tests/integration/test_lists/waives.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tests/integration/defs/common.py
tests/integration/test_lists/waives.txt
tests/integration/test_lists/qa/llm_function_multinode.txt

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/integration/defs/test_e2e.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/integration/defs/test_e2e.py

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/test_e2e.py

🪛 Ruff (0.12.2)

tests/integration/defs/test_e2e.py

31-31: Redefinition of unused convert_weights from line 27

(F811)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tests/integration/defs/test_e2e.py (2)

46-46: LGTM: dedicated 80% KV-cache free-memory fraction constant.

Naming and placement are consistent with existing _MEM_FRACTION_* constants, and this improves readability where the value is used.

2713-2716: No change needed: CLI flag exists

The trtllm-eval command defines the --kv_cache_free_gpu_memory_fraction option in tensorrt_llm/commands/eval.py, so the integration test’s use of that flag is correct and will parse successfully.

• tensorrt_llm/commands/eval.py – defines @click.option("--kv_cache_free_gpu_memory_fraction", …)
• tests/integration/defs/test_e2e.py – uses --kv_cache_free_gpu_memory_fraction={_MEM_FRACTION_80}

tests/integration/defs/test_e2e.py

tensorrt-cicd · 2025-08-22T06:53:39Z

PR_Github #16103 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12111 completed with status: 'FAILURE'

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/integration/defs/test_e2e.py (1)
2719-2721: Don’t gate accuracy check on pytest’s SLURM_PROCID — causes false passes.

SLURM_PROCID here refers to the pytest process environment, not ranks launched by trtllm-llmapi-launch. This can skip the assertion entirely when pytest runs under SLURM with a non-zero proc id. Always parse and assert.

Apply this diff:
-    if os.environ.get("SLURM_PROCID", '0') == '0':
-        mmlu_accuracy = get_mmlu_accuracy(output)
-        assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}"
+    mmlu_accuracy = get_mmlu_accuracy(output)
+    assert mmlu_accuracy > mmlu_threshold, (
+        f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}"
+    )

🧹 Nitpick comments (1)

tests/integration/defs/test_e2e.py (1)

2682-2699: Parameter grid looks fine; consider per-model timeouts to avoid flakiness.

5400s can be tight for very large models (e.g., Qwen3-235B). If you see timeouts, bump timeout via per-param marks rather than globally.

Example tweak:

-@pytest.mark.timeout(5400)
+@pytest.mark.timeout(5400)
 @pytest.mark.skip_less_device_memory(80000)
 @pytest.mark.skip_less_device(4)
 @pytest.mark.parametrize("eval_task", ["mmlu"])
 @pytest.mark.parametrize("tp_size,pp_size,ep_size", [(16, 1, 8), (8, 2, 8)],
                          ids=["tp16", "tp8pp2"])
 @pytest.mark.parametrize("model_path", [
-    pytest.param('llama-3.3-models/Llama-3.3-70B-Instruct', marks=skip_pre_hopper),
-    pytest.param('llama4-models/Llama-4-Maverick-17B-128E-Instruct', marks=skip_pre_hopper),
-    pytest.param('llama4-models/nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8', marks=skip_pre_hopper),
-    pytest.param('Qwen3/Qwen3-235B-A22B', marks=skip_pre_hopper),
+    pytest.param('llama-3.3-models/Llama-3.3-70B-Instruct', marks=skip_pre_hopper),
+    pytest.param('llama4-models/Llama-4-Maverick-17B-128E-Instruct', marks=skip_pre_hopper),
+    pytest.param('llama4-models/nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8', marks=skip_pre_hopper),
+    pytest.param('Qwen3/Qwen3-235B-A22B', marks=(skip_pre_hopper, pytest.mark.timeout(7200))),
     pytest.param('Qwen3/saved_models_Qwen3-235B-A22B_nvfp4_hf',
                  marks=skip_pre_blackwell),
     pytest.param('DeepSeek-R1/DeepSeek-R1-0528-FP4', marks=skip_pre_blackwell),
 ])

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a6b4728 and 85ad823.

📒 Files selected for processing (4)

tests/integration/defs/common.py (1 hunks)
tests/integration/defs/test_e2e.py (3 hunks)
tests/integration/test_lists/qa/llm_function_multinode.txt (1 hunks)
tests/integration/test_lists/waives.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tests/integration/defs/common.py
tests/integration/test_lists/waives.txt
tests/integration/test_lists/qa/llm_function_multinode.txt

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/integration/defs/test_e2e.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/integration/defs/test_e2e.py

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/test_e2e.py

🪛 Ruff (0.12.2)

tests/integration/defs/test_e2e.py

31-31: Redefinition of unused convert_weights from line 27

(F811)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tests/integration/defs/test_e2e.py (2)

46-46: LGTM: new KV cache memory fraction constant.

_MEM_FRACTION_80 = 0.8 is named consistently with existing constants and scoped for reuse.

2705-2717: Command construction is reasonable; option set matches eval intent.

Using trtllm-llmapi-launch trtllm-eval ... mmlu with --kv_cache_free_gpu_memory_fraction=0.8 and capped --max_batch_size=32 aligns with the goal of reproducible multi-node accuracy checks.

tests/integration/defs/test_e2e.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

tests/integration/defs/test_e2e.py (2)
31-33: Resolve Ruff F811 (duplicate convert_weights) and unify imports to relative.

convert_weights is imported absolutely on Line 27 and again relatively here, tripping Ruff F811 and breaking CI. Drop the absolute import and convert the trt_test_alternative import to relative to match the module style.

Apply:
@@
-from defs.common import convert_weights
-from defs.trt_test_alternative import (check_call, check_call_negative_test,
-                                       check_output)
+from .trt_test_alternative import (check_call, check_call_negative_test,
+                                   check_output)
After this change, keep the relative import here as-is:
from .common import (PluginOptions, convert_weights, get_mmlu_accuracy,
                     prune_checkpoint, quantize_data, refit_model, venv_check_call)
2719-2721: Don’t gate MMLU accuracy assertion on SLURM_PROCID; always assert.

SLURM_PROCID here is for the pytest process, not ranks launched by trtllm-llmapi-launch. This can skip the check entirely and produce false passes.

Apply:
-    if os.environ.get("SLURM_PROCID", '0') == '0':
-        mmlu_accuracy = get_mmlu_accuracy(output)
-        assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}"
+    mmlu_accuracy = get_mmlu_accuracy(output)
+    assert mmlu_accuracy > mmlu_threshold, (
+        f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}"
+    )

🧹 Nitpick comments (2)

tests/integration/defs/test_e2e.py (2)
2682-2699: Consider adding NVLink guard and revisiting device gating for multinode configs.

Other multinode tests in this file use @skip_nvlink_inactive; if these TP/PP/EP configs require NVLink for correctness/perf, add the same guard for consistency.

The world size implied by (tp, pp, ep) = (16,1,8) or (8,2,8) is large. If CI scheduling expects external cluster resources via trtllm-llmapi-launch, that’s fine; otherwise, the current @skip_less_device(4) may misleadingly permit local runs that can never satisfy the world size.

Proposed decorator addition:
@@
+@skip_nvlink_inactive
 @pytest.mark.timeout(5400)
 @pytest.mark.skip_less_device_memory(80000)
 @pytest.mark.skip_less_device(4)
2705-2718: Prefer llm_venv.run_cmd over shell-joined strings when capturing output.

Using llm_venv.run_cmd([...], caller=check_output) avoids shell-quoting pitfalls and keeps parity with the rest of this module.
-    output = check_output(" ".join(run_cmd), shell=True, env=llm_venv._new_env)
+    output = llm_venv.run_cmd(run_cmd, caller=check_output)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 85ad823 and ad2fee0.

📒 Files selected for processing (4)

tests/integration/defs/common.py (1 hunks)
tests/integration/defs/test_e2e.py (3 hunks)
tests/integration/test_lists/qa/llm_function_multinode.txt (1 hunks)
tests/integration/test_lists/waives.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tests/integration/defs/common.py
tests/integration/test_lists/waives.txt
tests/integration/test_lists/qa/llm_function_multinode.txt

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/integration/defs/test_e2e.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/integration/defs/test_e2e.py

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/test_e2e.py

🪛 Ruff (0.12.2)

tests/integration/defs/test_e2e.py

31-31: Redefinition of unused convert_weights from line 27

(F811)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tests/integration/defs/test_e2e.py (1)

46-46: LGTM: memory-fraction constant added consistently.

_MEM_FRACTION_80 aligns with _MEM_FRACTION_50 and _MEM_FRACTION_95 naming. Clear, discoverable, and reused later.

Signed-off-by: Xin He (SW-GPU) <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

tests/integration/defs/common.py (1)

961-979: Harden MMLU accuracy parsing (regex, bytes-safe, last occurrence) and add a brief docstring.

The split-chain is brittle and will break if the formatter adds a percent sign or extra text; it also grabs only the first match, which can be wrong in multi-rank logs. Decode bytes defensively, parse with a regex, and pick the last occurrence.

Apply this diff:

-def get_mmlu_accuracy(output):
-    mmlu_line = None
-    for line in output.split('\n'):
-        if "MMLU weighted average accuracy:" in line:
-            mmlu_line = line
-            break
-
-    if mmlu_line is None:
-        raise Exception(
-            f"Could not find 'MMLU weighted average accuracy:' in output. Full output:\n{output}"
-        )
-
-    mmlu_accuracy = float(
-        mmlu_line.split("MMLU weighted average accuracy: ")[1].split(" (")[0])
-
-    print(f"MMLU weighted average accuracy is: {mmlu_accuracy}")
-
-    return mmlu_accuracy
+def get_mmlu_accuracy(output):
+    """Extract the MMLU weighted average accuracy from trtllm-eval output."""
+    if isinstance(output, (bytes, bytearray)):
+        output = output.decode(errors="replace")
+    # Robustly match and take the last occurrence (multi-rank logs may print multiple times).
+    matches = re.findall(
+        r"MMLU weighted average accuracy:\s*([0-9]+(?:\.[0-9]+)?)%?", output
+    )
+    if not matches:
+        tail = "\n".join(output.splitlines()[-200:])
+        raise Exception(
+            "Could not find 'MMLU weighted average accuracy:' in output. "
+            f"Last 200 lines:\n{tail}"
+        )
+    mmlu_accuracy = float(matches[-1].rstrip("%"))
+    print(f"MMLU weighted average accuracy is: {mmlu_accuracy}")
+    return mmlu_accuracy

tests/integration/defs/test_e2e.py (2)

27-33: Fix Ruff F811 and unify on relative imports.

convert_weights is imported twice (absolute on Line 27 and relative on Lines 31–33), tripping Ruff F811 and causing CI failures. Drop the absolute imports and keep the relative ones for consistency.

Apply this diff:

@@
-from defs.common import convert_weights
-from defs.trt_test_alternative import (check_call, check_call_negative_test,
-                                       check_output)
+from .trt_test_alternative import (check_call, check_call_negative_test,
+                                   check_output)
@@
-from .common import (PluginOptions, convert_weights, get_mmlu_accuracy,
+from .common import (PluginOptions, convert_weights, get_mmlu_accuracy,
                      prune_checkpoint, quantize_data, refit_model,
                      venv_check_call)

After the change, verify no remaining absolute imports in this module:

#!/bin/bash
rg -n -e 'from\s+defs\.' tests/integration/defs/test_e2e.py || true

2682-2721: Always assert accuracy; don’t gate on SLURM_PROCID.

Gating the assertion on SLURM_PROCID can cause false passes when pytest itself runs under SLURM. Always parse and assert from the captured output.

Apply this diff:

@@
-    output = check_output(" ".join(run_cmd), shell=True, env=llm_venv._new_env)
-
-    if os.environ.get("SLURM_PROCID", '0') == '0':
-        mmlu_accuracy = get_mmlu_accuracy(output)
-        assert mmlu_accuracy > mmlu_threshold, f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}"
+    output = check_output(" ".join(run_cmd), shell=True, env=llm_venv._new_env)
+    mmlu_accuracy = get_mmlu_accuracy(output)
+    assert mmlu_accuracy > mmlu_threshold, (
+        f"MMLU accuracy {mmlu_accuracy} is less than threshold {mmlu_threshold}"
+    )

🧹 Nitpick comments (1)

tests/integration/defs/test_e2e.py (1)

2702-2704: Prefer param-level skip over runtime skip for unsupported TP/PP combos.

Instead of a runtime pytest.skip, consider expressing the unsupported combination directly in the parametrization (e.g., pytest.param(..., marks=pytest.mark.skip(reason="..."))), which yields clearer collection-time reporting and avoids executing setup for doomed configs.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ad2fee0 and 35846ae.

📒 Files selected for processing (4)

tests/integration/defs/common.py (1 hunks)
tests/integration/defs/test_e2e.py (3 hunks)
tests/integration/test_lists/qa/llm_function_multinode.txt (1 hunks)
tests/integration/test_lists/waives.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tests/integration/test_lists/waives.txt
tests/integration/test_lists/qa/llm_function_multinode.txt

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/integration/defs/common.py
tests/integration/defs/test_e2e.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tests/integration/defs/common.py
tests/integration/defs/test_e2e.py

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/test_e2e.py

🪛 Ruff (0.12.2)

tests/integration/defs/test_e2e.py

31-31: Redefinition of unused convert_weights from line 27

(F811)

🔇 Additional comments (1)

tests/integration/defs/test_e2e.py (1)

46-46: LGTM: add 0.8 KV-cache fraction constant.

Naming aligns with existing constants (_MEM_FRACTION_50, _MEM_FRACTION_95). No action needed.

xinhe-nv force-pushed the user/xinhe/feat branch from b5f5869 to 84c1e38 Compare August 21, 2025 05:05

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

tensorrt_llm/evaluate/__init__.py Outdated Show resolved Hide resolved

tests/integration/defs/test_e2e.py Outdated Show resolved Hide resolved

xinhe-nv force-pushed the user/xinhe/feat branch from 84c1e38 to b1f8b1a Compare August 21, 2025 08:31

xinhe-nv changed the title ~~add test_qwen_multi_nodes_eval tests~~ add test_multi_nodes_eval tests Aug 21, 2025

xinhe-nv force-pushed the user/xinhe/feat branch from b1f8b1a to fdd2003 Compare August 21, 2025 08:36

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

xinhe-nv force-pushed the user/xinhe/feat branch from fdd2003 to 36698ac Compare August 21, 2025 09:24

xinhe-nv marked this pull request as ready for review August 21, 2025 09:46

xinhe-nv changed the title ~~add test_multi_nodes_eval tests~~ [TRTLLM-7245][feat]add test_multi_nodes_eval tests Aug 21, 2025

xinhe-nv changed the title ~~[TRTLLM-7245][feat]add test_multi_nodes_eval tests~~ [TRTLLM-7245][feat] add test_multi_nodes_eval tests Aug 21, 2025

xinhe-nv enabled auto-merge (squash) August 21, 2025 09:48

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

xinhe-nv force-pushed the user/xinhe/feat branch from 36698ac to ab91f8e Compare August 22, 2025 02:24

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

xinhe-nv force-pushed the user/xinhe/feat branch from ab91f8e to a6b4728 Compare August 22, 2025 03:01

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

tests/integration/defs/test_e2e.py Show resolved Hide resolved

xinhe-nv force-pushed the user/xinhe/feat branch from a6b4728 to 85ad823 Compare August 22, 2025 07:38

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

tests/integration/defs/test_e2e.py Show resolved Hide resolved

xinhe-nv requested review from crazydemo and LarryXFly August 22, 2025 08:00

xinhe-nv force-pushed the user/xinhe/feat branch from 85ad823 to ad2fee0 Compare August 22, 2025 08:00

crazydemo reviewed Aug 22, 2025

View reviewed changes

tests/integration/defs/test_e2e.py Show resolved Hide resolved

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

add multi nodes eval tests

35846ae

Signed-off-by: Xin He (SW-GPU) <[email protected]>

xinhe-nv force-pushed the user/xinhe/feat branch from ad2fee0 to 35846ae Compare August 22, 2025 09:10

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

LarryXFly approved these changes Aug 22, 2025

View reviewed changes

LarryXFly disabled auto-merge August 22, 2025 09:16

LarryXFly merged commit b8b2bd4 into NVIDIA:main Aug 22, 2025
4 checks passed

xinhe-nv deleted the user/xinhe/feat branch August 22, 2025 09:19

[TRTLLM-7245][feat] add test_multi_nodes_eval tests #7108

[TRTLLM-7245][feat] add test_multi_nodes_eval tests #7108

Uh oh!

Conversation

xinhe-nv commented Aug 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

xinhe-nv commented Aug 21, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 21, 2025

Uh oh!

tensorrt-cicd commented Aug 21, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

xinhe-nv commented Aug 22, 2025

Uh oh!

tensorrt-cicd commented Aug 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xinhe-nv commented Aug 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 21, 2025 •

edited

Loading