Skip to content

Conversation

@reasonsolo
Copy link
Collaborator

@reasonsolo reasonsolo commented Sep 2, 2025

Summary by CodeRabbit

  • Tests

    • Added multi-node disaggregated end-to-end tests covering multiple context/generation configurations with resource-based skips and coordinated health checks.
  • Utilities

    • Added helpers to discover local IPs/interfaces and expand SLURM nodelists for multi-node orchestration.
  • Test servers

    • Added a disaggregated test server harness and improved remote server lifecycle, startup robustness, and support for custom host/env/extra configuration.
  • Style

    • Minor formatting cleanup with no behavioral impact.

Description

Test Coverage

My test cmd is


prepare_cmd="command -v trtllm-serve > /dev/null 2>&1 || { pip install -r requirements-dev.txt || true; python3 -m pip install -e . || true; python3 -m pip install --force-reinstall pytest || true; }"

test_cmd="echo \`hostname\` ; cd /code/tensorrt_llm; ls ${LLM_MODES_ROOT}/llama-3.1-model/Llama-3.1-8B-Instruct; ${prepare_cmd}; python3 -m pytest -s ${TEST}"

srun -l --mpi=pmix -t 4:00:00 ${ACCOUNT}  ${PARTITION} ${NODELIST} -N 2 -n 2 --ntasks-per-node=2  --oversubscribe -J ${JOBNAME} \
		--container-image=${IMAGE} \
        --container-remap-root \
        --container-mounts=${MOUNT} \
         bash -c "export LLM_MODELS_ROOT=${LLM_MODELS_ROOT}; ${test_cmd}" 2>&1 | tee ${LOG_FILE}

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 2, 2025

📝 Walkthrough

Walkthrough

Adds SLURM/MPI-aware multi-node disaggregated-serving tests and supporting test utilities: an integration trigger, a new unit test module coordinating context/gen/disagg servers across nodes, extensions to test server launchers (host/env/rank/extra-config and RemoteDisaggOpenAIServer), and low-level networking/SLURM nodelist helpers; plus a tiny formatting cleanup.

Changes

Cohort / File(s) Summary
Integration trigger
tests/integration/defs/test_e2e.py
Adds test_openai_disaggregated_serving_multi_nodes(...) which invokes the unit multi-node test _test_disagg_serving_multi_nodes.py::test_completion[{ctx_config}-{gen_config}] under parameterized ctx/gen TP-PP configs and is gated by device count/memory pytest marks.
Multi-node unit test module
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
New SLURM-aware test module declaring RANK/NODE_RANK/NODE_LIST, NIC discovery, role helpers and port constants (8001/8002/8000); provides fixtures for model/backend/TP-PP sizes; launches RemoteOpenAIServer instances (ctx/gen) and a rank-0 RemoteDisaggOpenAIServer; coordinates readiness via health checks and runs completion test on rank 0 while other ranks poll health.
Test server launchers
tests/unittest/llmapi/apps/openai_server.py
Expands RemoteOpenAIServer.__init__ to accept host, env, rank, and extra_config (writes YAML tempfile and appends --extra_llm_api_options); ensures env passed to subprocess; improves health-wait and process termination; adds RemoteDisaggOpenAIServer that assembles aggregation YAML and launches trtllm-serve disaggregated (optionally via trtllm-llmapi-launch).
Networking / SLURM utilities
tests/unittest/llmapi/apps/utils.py
Adds low-level network helpers and SLURM nodelist expansion: get_local_ip(test_ip), get_local_interfaces(), and expand_slurm_nodelist(nodelist_str) plus imports (array, fcntl, re, socket, struct).
Formatting cleanup
tensorrt_llm/serve/openai_disagg_server.py
Removes an extra blank line in OpenAIDisaggServer.__init__; no behavioral or API changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant PyTest as PyTest (rank 0 / pytest node)
  participant CtxSrv as Context Server (8001)
  participant GenSrv as Generation Server (8002)
  participant DisaggSrv as Disagg Server (8000)

  rect rgba(220,235,255,0.6)
    note right of PyTest: Launch & config assembly (rank 0)
    PyTest->>CtxSrv: start RemoteOpenAIServer(ctx, host/env/rank, extra_config)
    PyTest->>GenSrv: start RemoteOpenAIServer(gen, host/env/rank, extra_config)
    PyTest->>DisaggSrv: start RemoteDisaggOpenAIServer(config with ctx/gen URLs)
    DisaggSrv-->>PyTest: health check → ready
  end

  rect rgba(235,255,220,0.6)
    note right of DisaggSrv: Disaggregated request flow
    PyTest->>DisaggSrv: completions.create(model, prompt)
    DisaggSrv->>CtxSrv: request context / prefill
    CtxSrv-->>DisaggSrv: context response
    DisaggSrv->>GenSrv: generation request (with context)
    GenSrv-->>DisaggSrv: generated tokens/text
    DisaggSrv-->>PyTest: aggregated completion response
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description largely contains the repository template and an example srun/test command but omits a substantive Description and concrete Test Coverage details for the actual code changes; it does not summarize the added test files, fixtures, or the rationale, which reviewers need to evaluate the change. Given the raw changes add multiple new test modules and fixtures, the current PR body is incomplete and insufficient for review. Please replace the placeholder text with a concise Description that lists the files added/modified and the motivation for the multi-node disagg-serving tests, expand Test Coverage to name the new test modules and provide exact commands/CI stages and environment requirements to run them (including srun options), and note any new dependencies, CODEOWNERS, or documentation updates so reviewers can reproduce and validate the changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 13.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "[TRTLLM-7182][test] add multi-nodes test for disagg-serving" is concise, follows the repo ticket/type convention, and accurately summarizes the main change in the changeset (adding multi-node disaggregated-serving tests and related test modules/fixtures). It is specific enough for a reviewer scanning history to understand the primary intent.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@reasonsolo reasonsolo requested a review from xinhe-nv September 2, 2025 07:15
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (6)
tests/unittest/llmapi/apps/openai_server.py (3)

52-72: Non-zero ranks sleep MAX_SERVER_START_WAIT_S seconds unnecessarily.

Else branch sleeps for the full timeout (600s) then breaks, slowing tests. Consider a short sleep (e.g., 1–2s) or immediate break for non-zero ranks.

-                else:
-                    time.sleep(timeout)
-                    break
+                else:
+                    time.sleep(1.0)
+                    break

25-27: Cast SLURM_PROCID to int to fix rank detection.

Comparing a string env var to int 0 makes rank-0 logic skip the health check.

-        self.rank = os.environ.get("SLURM_PROCID", 0)
+        self.rank = int(os.environ.get("SLURM_PROCID", "0"))

1-1: Missing NVIDIA header.

Per repository guidelines, prepend the NVIDIA copyright header.

tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (3)

9-14: SLURM_NODELIST parsing is simplistic and HOSTNAME debug prints are broken.

  • NODE_LIST from SLURM often uses bracket notation; naive split can mislead.
  • Debug print references HOSTNAME which doesn’t exist.

At minimum, fix the print:

-    print(f"start worker NODE_LIST: {NODE_LIST} {RANK} {HOSTNAME}")
+    print(f"start worker NODE_LIST: {NODE_LIST} RANK={RANK} HOST={socket.gethostname()}")

For robustness, consider expanding SLURM_NODELIST (e.g., via scontrol show hostnames) if available; otherwise document the expected format.


93-116: Overly strict output assertion; likely to flake.

With max_tokens=1, most models return "2" (or lowercase "two"). Assert a tolerant match.

-        assert message.content == 'Two'
+        assert message.content.strip().lower() in {'2', 'two'}

43-56: Avoid hard-coded ports to reduce flakes.

8000/8001/8002 may collide in CI. Consider using find_free_port() and pass ports through fixtures to compose URLs.

If you want, I can draft a small helper fixture to allocate and share ports.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ff2439f and fc0d763.

📒 Files selected for processing (3)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1 hunks)
  • tests/unittest/llmapi/apps/openai_server.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/integration/defs/test_e2e.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/integration/defs/test_e2e.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs (Python)
Maintain module namespace on import: prefer from package.subpackage import foo; use foo.Symbol()
Python filenames use snake_case
Python class names use PascalCase
Python functions and methods use snake_case
Python local variables use snake_case; if starting with a number concept, prefix with k (e.g., k_99th_percentile)
Python global variables use G_ prefix with UPPER_SNAKE_CASE
Python constants use UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes
Initialize all externally visible class members in init
For public interfaces, prefer docstrings over comments; comments should be for in-function or file-local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes and variables inline with docstrings immediately after assignment
Avoid reflection when a non-reflective approach suffices
Limit except clauses to specific exceptions where possible
When using try/except for duck-typing, keep try body minimal and move logic to else

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/integration/defs/test_e2e.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/integration/defs/test_e2e.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/test_e2e.py
🧬 Code graph analysis (3)
tests/unittest/llmapi/apps/openai_server.py (1)
tensorrt_llm/llmapi/mpi_session.py (1)
  • find_free_port (451-454)
tests/integration/defs/test_e2e.py (1)
tests/integration/defs/conftest.py (3)
  • llm_root (180-181)
  • llm_venv (707-723)
  • unittest_path (90-91)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)
tests/unittest/llmapi/apps/openai_server.py (4)
  • RemoteDisaggOpenAIServer (92-132)
  • RemoteOpenAIServer (15-89)
  • get_client (80-84)
  • get_async_client (86-89)
🪛 Ruff (0.12.2)
tests/unittest/llmapi/apps/openai_server.py

112-112: Undefined name tempfile

(F821)


121-121: Undefined name yaml

(F821)

tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py

44-44: Undefined name HOSTNAME

(F821)


115-115: Undefined name time

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@reasonsolo reasonsolo force-pushed the tllm7182_disaggmultinodetest branch from fc0d763 to 4d7259d Compare September 2, 2025 08:11
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
tests/unittest/llmapi/apps/openai_server.py (1)

92-118: Fix broken extra-config handling: undefined helper, wrong temp file, missing imports.

current code can’t start the server; it references self.extra_config_file (undefined), writes YAML to a different temp file, and lacks yaml/tempfile imports.

Apply:

@@
-class RemoteDisaggOpenAIServer(RemoteOpenAIServer):
+class RemoteDisaggOpenAIServer(RemoteOpenAIServer):
@@
-        self.extra_config_file = self._get_extra_config_file()
+        # Write a durable temp config file and keep its path for the child process.
+        self.extra_config_path = self._write_extra_config_file()
@@
-        launch_cmd = [
+        launch_cmd = [
             "trtllm-serve", "disaggregated", "--host", f"{self.host}", "--port",
-            f"{self.port}", "--extra-config", self.extra_config_file.name
+            f"{self.port}", "--extra-config", self.extra_config_path
         ]
@@
-        with tempfile.NamedTemporaryFile() as f:
-            f.write(self._get_extra_config())
-            self.proc = subprocess.Popen(launch_cmd,
-                                         stdout=sys.stdout,
-                                         stderr=sys.stderr)
+        self.proc = subprocess.Popen(launch_cmd,
+                                     stdout=sys.stdout,
+                                     stderr=sys.stderr)
@@
     def _get_extra_config(self):
         return yaml.dump({
@@
             "hostname": self.host,
         })
+
+    def _write_extra_config_file(self) -> str:
+        # Create a durable temp file so the spawned process can read it.
+        f = tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False)
+        try:
+            f.write(self._get_extra_config())
+            f.flush()
+            return f.name
+        finally:
+            f.close()
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        try:
+            os.remove(getattr(self, "extra_config_path", ""))
+        except Exception:
+            pass
+        return super().__exit__(exc_type, exc_value, traceback)

And add missing imports near the top:

 import os
 import subprocess
 import sys
 import time
+import tempfile
+import yaml
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (4)

1-4: Add NVIDIA header and missing import.

Header is required; time is used later.

+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
 import os
+ import time

42-56: Bind workers to 0.0.0.0 so other nodes can reach them.

Current workers inherit "localhost" from RemoteOpenAIServer, breaking cross-node connectivity.

-        args = ["--tp_size", str(tp_size), "--pp_size", str(pp_size)]
+        args = ["--host", "0.0.0.0", "--tp_size", str(tp_size), "--pp_size", str(pp_size)]
         with RemoteOpenAIServer(model_path, port=8001, cli_args=args) as server:
             yield server
@@
-        args = ["--tp_size", str(tp_size), "--pp_size", str(pp_size)]
+        args = ["--host", "0.0.0.0", "--tp_size", str(tp_size), "--pp_size", str(pp_size)]
         with RemoteOpenAIServer(model_path, port=8002, cli_args=args) as server:
             yield server

59-67: Fix NODE_LIST logic and selection of ctx/gen hosts.

Using worker.host ("localhost") and list.remove() (returns None) breaks host selection.

-    if NODE_RANK == 0 and RANK == 0:
-        ctx_host = worker.host
-        assert len(NODE_LIST) == 2
-        assert ctx_host in NODE_LIST
-        print(f"ctx_host: {ctx_host} {NODE_LIST}")
-        gen_host = NODE_LIST.remove(ctx_host)[0]
+    if NODE_RANK == 0 and RANK == 0:
+        nodes = [n for n in NODE_LIST if n]
+        assert len(nodes) >= 2, f"Expected at least 2 nodes, got: {nodes}"
+        ctx_host, gen_host = nodes[0], nodes[1]
+        print(f"ctx_host={ctx_host}, gen_host={gen_host}")
         ctx_url = f"http://{ctx_host}:8001"
         gen_url = f"http://{gen_host}:8002"

76-89: Gate client fixtures by both NODE_RANK and RANK; guard None.

Prevents AttributeError when disagg_server is None.

 def client(disagg_server: RemoteDisaggOpenAIServer):
-    if NODE_RANK == 0:
-        return disagg_server.get_client()
+    if NODE_RANK == 0 and RANK == 0 and disagg_server is not None:
+        return disagg_server.get_client()
     else:
         return None
@@
 def async_client(disagg_server: RemoteDisaggOpenAIServer):
-    if NODE_RANK == 0:
-        return disagg_server.get_async_client()
+    if NODE_RANK == 0 and RANK == 0 and disagg_server is not None:
+        return disagg_server.get_async_client()
     else:
         return None
🧹 Nitpick comments (2)
tests/unittest/llmapi/apps/openai_server.py (1)

101-107: Allow binding to 0.0.0.0 for cross-node access.

Binding to "localhost" limits reachability across nodes. Make host configurable or default to "0.0.0.0" in multi-node tests. Example:

-        self.host = "localhost"
+        self.host = os.environ.get("TLLM_BIND_HOST", "0.0.0.0")
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)

92-116: Make the assertion tolerant; model may return "2" vs "Two".

Avoid flakiness while still validating behavior.

-        assert message.content == 'Two'
+        out = (message.content or "").strip().lower()
+        assert out in ('two', '2')
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fc0d763 and 4d7259d.

📒 Files selected for processing (3)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1 hunks)
  • tests/unittest/llmapi/apps/openai_server.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/defs/test_e2e.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs (Python)
Maintain module namespace on import: prefer from package.subpackage import foo; use foo.Symbol()
Python filenames use snake_case
Python class names use PascalCase
Python functions and methods use snake_case
Python local variables use snake_case; if starting with a number concept, prefix with k (e.g., k_99th_percentile)
Python global variables use G_ prefix with UPPER_SNAKE_CASE
Python constants use UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes
Initialize all externally visible class members in init
For public interfaces, prefer docstrings over comments; comments should be for in-function or file-local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes and variables inline with docstrings immediately after assignment
Avoid reflection when a non-reflective approach suffices
Limit except clauses to specific exceptions where possible
When using try/except for duck-typing, keep try body minimal and move logic to else

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
🧬 Code graph analysis (2)
tests/unittest/llmapi/apps/openai_server.py (1)
tensorrt_llm/llmapi/mpi_session.py (1)
  • find_free_port (451-454)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)
tests/unittest/llmapi/apps/openai_server.py (4)
  • RemoteDisaggOpenAIServer (92-132)
  • RemoteOpenAIServer (15-89)
  • get_client (80-84)
  • get_async_client (86-89)
🪛 Ruff (0.12.2)
tests/unittest/llmapi/apps/openai_server.py

112-112: Undefined name tempfile

(F821)


121-121: Undefined name yaml

(F821)

tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py

114-114: Undefined name time

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@reasonsolo reasonsolo force-pushed the tllm7182_disaggmultinodetest branch from 4d7259d to 367a78c Compare September 2, 2025 08:23
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unittest/llmapi/apps/openai_server.py (1)

20-31: Rank handling bug and slow non-zero ranks; add request timeout.

Cast SLURM_PROCID to int; don’t sleep 600s on non-zero ranks; set a short HTTP timeout.

-        self.rank = os.environ.get("SLURM_PROCID", 0)
+        self.rank = int(os.environ.get("SLURM_PROCID", 0))
@@
-                if self.rank == 0:
-                    if requests.get(url).status_code == 200:
+                if self.rank == 0:
+                    if requests.get(url, timeout=1).status_code == 200:
                         break
                 else:
-                    time.sleep(timeout)
-                    break
+                    break

Also applies to: 54-65

♻️ Duplicate comments (7)
tests/unittest/llmapi/apps/openai_server.py (2)

1-2: Add required NVIDIA SPDX header.

Prepend the standard NVIDIA header at the very top.

+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
 # Adapted from
 # https://github.com/vllm-project/vllm/blob/baaedfdb2d3f1d70b7dbcde08b083abfe6017a92/tests/utils.py

96-121: Broken extra-config lifecycle and undefined helper (server won’t start).

_get_extra_config_file is undefined; writing to a different temp file; default binary mode write will raise; ensure durable temp file and cleanup.

+import yaml
@@
-        self.extra_config_file = self._get_extra_config_file()
+        self.extra_config_path = self._write_extra_config_file()
@@
-        launch_cmd = [
+        launch_cmd = [
             "trtllm-serve", "disaggregated", "--host", f"{self.host}", "--port",
-            f"{self.port}", "--extra-config", self.extra_config_file.name
+            f"{self.port}", "--extra-config", self.extra_config_path
         ]
@@
-        with tempfile.NamedTemporaryFile() as f:
-            f.write(self._get_extra_config())
-            self.proc = subprocess.Popen(launch_cmd,
-                                         stdout=sys.stdout,
-                                         stderr=sys.stderr)
+        self.proc = subprocess.Popen(launch_cmd,
+                                     stdout=sys.stdout,
+                                     stderr=sys.stderr)
@@
         self._wait_for_server(url=self.url_for("health"),
                               timeout=self.MAX_SERVER_START_WAIT_S)
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        try:
+            os.remove(getattr(self, "extra_config_path", ""))
+        except Exception:
+            pass
+        return super().__exit__(exc_type, exc_value, traceback)
+
+    def _write_extra_config_file(self) -> str:
+        f = tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False)
+        try:
+            f.write(self._get_extra_config())
+            f.flush()
+            return f.name
+        finally:
+            f.close()
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (5)

1-1: Add required NVIDIA SPDX header.

Per project guidelines, prepend the header.

+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
 import os
 import socket

1-6: Missing import for time.

time.sleep is used later.

 import os
 import socket
+import time

Also applies to: 122-122


43-63: Workers bind to localhost/host; cross-node access will fail.

Bind servers to 0.0.0.0 so the aggregator can reach them across nodes.

-        with RemoteOpenAIServer(model_path,
-                                port=8001,
-                                cli_args=args,
-                                host="localhost") as server:
+        with RemoteOpenAIServer(model_path,
+                                port=8001,
+                                cli_args=args,
+                                host="0.0.0.0") as server:
             yield server
@@
-        with RemoteOpenAIServer(model_path, port=8002, cli_args=args,
-                                host=host) as server:
+        with RemoteOpenAIServer(model_path,
+                                port=8002,
+                                cli_args=args,
+                                host="0.0.0.0") as server:
             yield server

66-79: Host selection and NODE_LIST misuse (logic errors).

remove() returns None; pick nodes explicitly and construct URLs.

-    if NODE_RANK == 0 and RANK == 0:
-        ctx_host = socket.gethostname()  # start ctx worker on NODE_RANK 0
-        assert len(NODE_LIST) == 2
-        assert ctx_host in NODE_LIST
-        print(f"ctx_host: {ctx_host} {NODE_LIST}")
-        gen_host = NODE_LIST.remove(ctx_host)[0]
+    if NODE_RANK == 0 and RANK == 0:
+        nodes = [n for n in NODE_LIST if n]
+        assert len(nodes) >= 2, f"Expected 2 nodes, got: {nodes}"
+        ctx_host, gen_host = nodes[0], nodes[1]
+        print(f"ctx_host={ctx_host}, gen_host={gen_host}")
         ctx_url = f"http://{ctx_host}:8001"
         gen_url = f"http://{gen_host}:8002"

84-97: Fixture gating: avoid AttributeError on non-zero ranks.

Only call get_client on rank 0 of node 0.

 def client(disagg_server: RemoteDisaggOpenAIServer):
-    if NODE_RANK == 0:
+    if NODE_RANK == 0 and RANK == 0 and disagg_server is not None:
         return disagg_server.get_client()
     else:
         return None
@@
 def async_client(disagg_server: RemoteDisaggOpenAIServer):
-    if NODE_RANK == 0:
+    if NODE_RANK == 0 and RANK == 0 and disagg_server is not None:
         return disagg_server.get_async_client()
     else:
         return None
🧹 Nitpick comments (2)
tests/unittest/llmapi/apps/openai_server.py (1)

96-103: Make host configurable for the disaggregated server.

Allow binding to 0.0.0.0 when needed.

-    def __init__(self,
+    def __init__(self,
                  ctx_servers: List[str],
                  gen_servers: List[str],
                  port: int = None,
-                 llmapi_launch: bool = False) -> None:
+                 host: str = "localhost",
+                 llmapi_launch: bool = False) -> None:
@@
-        self.host = "localhost"
+        self.host = host
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)

100-121: Reduce assertion flakiness for model output.

Accept common one-token answers case-insensitively.

-        assert message.content == 'Two'
+        assert message.content.strip().lower() in ('two', '2')
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4d7259d and 367a78c.

📒 Files selected for processing (3)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1 hunks)
  • tests/unittest/llmapi/apps/openai_server.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/defs/test_e2e.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
**/*.{h,hpp,hh,hxx,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces, not tabs; indent 4 spaces

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs (Python)
Maintain module namespace on import: prefer from package.subpackage import foo; use foo.Symbol()
Python filenames use snake_case
Python class names use PascalCase
Python functions and methods use snake_case
Python local variables use snake_case; if starting with a number concept, prefix with k (e.g., k_99th_percentile)
Python global variables use G_ prefix with UPPER_SNAKE_CASE
Python constants use UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes
Initialize all externally visible class members in init
For public interfaces, prefer docstrings over comments; comments should be for in-function or file-local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes and variables inline with docstrings immediately after assignment
Avoid reflection when a non-reflective approach suffices
Limit except clauses to specific exceptions where possible
When using try/except for duck-typing, keep try body minimal and move logic to else

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
**/*.{cpp,cc,cxx,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
🧬 Code graph analysis (2)
tests/unittest/llmapi/apps/openai_server.py (1)
tensorrt_llm/llmapi/mpi_session.py (1)
  • find_free_port (451-454)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)
tests/unittest/llmapi/apps/openai_server.py (4)
  • RemoteDisaggOpenAIServer (94-134)
  • RemoteOpenAIServer (16-91)
  • get_client (82-86)
  • get_async_client (88-91)
🪛 Ruff (0.12.2)
tests/unittest/llmapi/apps/openai_server.py

123-123: Undefined name yaml

(F821)

tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py

122-122: Undefined name time

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@reasonsolo reasonsolo force-pushed the tllm7182_disaggmultinodetest branch 2 times, most recently from 05391bd to f5f90e0 Compare September 10, 2025 06:45
@reasonsolo reasonsolo marked this pull request as ready for review September 10, 2025 06:49
@reasonsolo reasonsolo requested a review from a team as a code owner September 10, 2025 06:49
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unittest/llmapi/apps/openai_server.py (1)

67-89: Bound health-check request to avoid hanging.

Add a small timeout to the GET to respect the overall wait window.

-                    if requests.get(url).status_code == 200:
+                    if requests.get(url, timeout=1.0).status_code == 200:
                         break
♻️ Duplicate comments (2)
tests/unittest/llmapi/apps/openai_server.py (2)

1-3: Add required NVIDIA SPDX header at top.

File is missing the mandated header.

+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+ 
 # Adapted from
 # https://github.com/vllm-project/vllm/blob/baaedfdb2d3f1d70b7dbcde08b083abfe6017a92/tests/utils.py

59-66: Clean up temporary config files on exit.

Delete any temp YAML created for extra_config or disagg config.

     def __exit__(self, exc_type, exc_value, traceback):
         self.proc.terminate()
         try:
             self.proc.wait(timeout=30)
         except subprocess.TimeoutExpired as e:
             self.proc.kill()
             self.proc.wait(timeout=30)
+        # Remove temp files if created
+        for tmp in (getattr(self, "extra_config_file", None),
+                    getattr(self, "config_file", None)):
+            if tmp:
+                try:
+                    os.remove(tmp)
+                except Exception:
+                    pass
🧹 Nitpick comments (9)
tests/unittest/llmapi/apps/openai_server.py (4)

21-29: Type hints: avoid implicit Optional.

Annotate cli_args and extra_config properly for readability and static linting.

-    def __init__(self,
-                 model: str,
-                 cli_args: List[str] = None,
+    def __init__(self,
+                 model: str,
+                 cli_args: Optional[List[str]] = None,
                  llmapi_launch: bool = False,
                  port: int = None,
                  host: str = "localhost",
-                 env: Optional[dict] = None,
-                 rank: int = -1,
-                 extra_config: Optional[dict] = None) -> None:
+                 env: Optional[dict] = None,
+                 rank: int = -1,
+                 extra_config: Optional[dict] = None) -> None:

47-51: Only default env when None.

if not env replaces an explicitly passed empty env. Use is None.

-        if not env:
-            env = os.environ.copy()
+        if env is None:
+            env = os.environ.copy()

43-47: Style nit: avoid list concatenation when building commands.

Minor readability tweak per Ruff (RUF005).

-        launch_cmd = ["trtllm-serve"] + [model] + args
+        launch_cmd = ["trtllm-serve", model, *args]
@@
-            launch_cmd = ["trtllm-llmapi-launch"] + launch_cmd
+            launch_cmd = ["trtllm-llmapi-launch", *launch_cmd]

128-132: Style nit: command build.

-        launch_cmd = ["trtllm-serve", "disaggregated", "-c", self.config_file]
+        launch_cmd = ["trtllm-serve", "disaggregated", "-c", self.config_file]
         if llmapi_launch:
-            # start server with llmapi-launch on multi nodes
-            launch_cmd = ["trtllm-llmapi-launch"] + launch_cmd
+            launch_cmd = ["trtllm-llmapi-launch", *launch_cmd]
tests/integration/defs/test_e2e.py (1)

1625-1641: Silence unused arg and make the invocation robust.

llm_root isn’t used; prefix with _ to satisfy linters. The test node path looks good.

-def test_openai_disaggregated_serving_multi_nodes(llm_root, llm_venv,
+def test_openai_disaggregated_serving_multi_nodes(_llm_root, llm_venv,
                                                   ctx_config, gen_config):
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (4)

1-7: Import missing modules used in fixes below.

You’ll need re for parsing and typing for hints in this test module.

Apply:

 import os
 import socket
 import subprocess
 import time
+import re
+from typing import Optional

167-174: Gate clients on ctx node’s local rank 0 to match disagg_server.

Prevents AttributeError and accidental client use on the gen node.

Apply:

 def client(disagg_server: RemoteDisaggOpenAIServer):
-    if RANK == 0:
+    if NODE_RANK == 0 and LOCAL_RANK == 0 and disagg_server is not None:
         return disagg_server.get_client()
     else:
         print(f"skipping client for rank {RANK} node rank {NODE_RANK}")
         return None

199-214: Relax overly strict assertion on completion text.

Servers often return “2” or “2.”; normalize whitespace and assert prefix “2”.

Apply:

-        message = completion.choices[0].text
-        assert message.startswith('2.')
+        message = completion.choices[0].text.strip()
+        assert message.startswith('2')

25-27: Optional: avoid hard-coded ports to reduce flakiness.

If multiple tests run on shared runners, 8000–8002 can collide. Consider dynamic ports and pass them through the config/URLs.

Example:

-CTX_SERVER_PORT = 8001
-GEN_SERVER_PORT = 8002
-DISAGG_SERVER_PORT = 8000
+CTX_SERVER_PORT = int(os.environ.get("CTX_SERVER_PORT", "8001"))
+GEN_SERVER_PORT = int(os.environ.get("GEN_SERVER_PORT", "8002"))
+DISAGG_SERVER_PORT = int(os.environ.get("DISAGG_SERVER_PORT", "8000"))

Then export these in CI or compute free ports on NODE_RANK==0 and broadcast via a simple file or env.

Also applies to: 149-165

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05391bd and f5f90e0.

📒 Files selected for processing (5)
  • tensorrt_llm/serve/openai_disagg_server.py (0 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1 hunks)
  • tests/unittest/llmapi/apps/openai_server.py (4 hunks)
  • tests/unittest/llmapi/apps/utils.py (2 hunks)
💤 Files with no reviewable changes (1)
  • tensorrt_llm/serve/openai_disagg_server.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tests/unittest/llmapi/apps/utils.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
  • tests/integration/defs/test_e2e.py
  • tests/unittest/llmapi/apps/openai_server.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tests/unittest/llmapi/apps/utils.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
  • tests/integration/defs/test_e2e.py
  • tests/unittest/llmapi/apps/openai_server.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tests/unittest/llmapi/apps/utils.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
  • tests/integration/defs/test_e2e.py
  • tests/unittest/llmapi/apps/openai_server.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/test_e2e.py
🧬 Code graph analysis (3)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (2)
tests/unittest/llmapi/apps/openai_server.py (3)
  • RemoteDisaggOpenAIServer (109-153)
  • RemoteOpenAIServer (17-106)
  • get_client (97-101)
tests/unittest/llmapi/apps/utils.py (3)
  • expand_slurm_nodelist (221-280)
  • get_local_interfaces (190-218)
  • get_local_ip (183-186)
tests/integration/defs/test_e2e.py (1)
tests/integration/defs/conftest.py (4)
  • llm_root (180-181)
  • llm_venv (707-723)
  • test_root (2189-2190)
  • unittest_path (90-91)
tests/unittest/llmapi/apps/openai_server.py (2)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)
  • env (63-76)
tensorrt_llm/llmapi/mpi_session.py (1)
  • find_free_port (451-454)
🪛 Ruff (0.12.2)
tests/unittest/llmapi/apps/utils.py

204-204: Local variable bytes_out is assigned to but never used

Remove assignment to unused variable bytes_out

(F841)

tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py

44-44: subprocess call: check for execution of untrusted input

(S603)


48-48: Consider moving this statement to an else block

(TRY300)


49-49: Do not catch blind exception: Exception

(BLE001)


128-128: Possible binding to all interfaces

(S104)


150-150: Unused function argument: worker

(ARG001)


153-153: f-string without any placeholders

Remove extraneous f prefix

(F541)


181-181: Probable use of requests call without timeout

(S113)


184-185: try-except-pass detected, consider logging the exception

(S110)


184-184: Do not catch blind exception: Exception

(BLE001)


192-192: Probable use of requests call without timeout

(S113)


194-194: Do not catch blind exception: Exception

(BLE001)

tests/integration/defs/test_e2e.py

1631-1631: Unused function argument: llm_root

(ARG001)

tests/unittest/llmapi/apps/openai_server.py

25-25: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


43-43: Consider ["trtllm-serve", model, *args] instead of concatenation

Replace with ["trtllm-serve", model, *args]

(RUF005)


46-46: Consider ["trtllm-llmapi-launch", *launch_cmd] instead of concatenation

Replace with ["trtllm-llmapi-launch", *launch_cmd]

(RUF005)


49-49: subprocess call: check for execution of untrusted input

(S603)


131-131: Consider ["trtllm-llmapi-launch", *launch_cmd] instead of concatenation

Replace with ["trtllm-llmapi-launch", *launch_cmd]

(RUF005)


134-134: subprocess call: check for execution of untrusted input

(S603)

🔇 Additional comments (3)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (3)

71-76: Set UCX interface only when discovered; document rationale.

Good call setting TRTLLM_UCX_INTERFACE. Consider also logging chosen NIC for debug.


149-165: Note: fixture arg ‘worker’ is intentionally unused (side-effect fixture).

Ruff ARG001 can be ignored here; the server lifecycle is the point.

If Ruff fails CI on ARG001, add “# noqa: ARG001” to the parameter.


1-1: Add required NVIDIA Apache-2.0 header.

Per repo guidelines, prepend the SPDX header.

Apply:

+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
 import os
⛔ Skipped due to learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

@reasonsolo reasonsolo force-pushed the tllm7182_disaggmultinodetest branch 2 times, most recently from 816558d to 8d7b94b Compare September 10, 2025 07:35
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (10)
tests/unittest/llmapi/apps/utils.py (2)

183-186: LGTM!

The socket resource leak issue has been addressed by using a context manager.


191-219: Fix ioctl packing for 64-bit platforms and remove unused variable.

The function has platform compatibility issues and an unused variable that should be addressed.

The ioctl packing uses 'iL' which can break on 64-bit platforms where pointers are 8 bytes. Also, bytes_out is assigned but never used.

Apply this diff to fix the issues:

 def get_local_interfaces():
     """ Returns a dictionary of name:ip key value pairs. """
     MAX_BYTES = 4096
     FILL_CHAR = b'\0'
     SIOCGIFCONF = 0x8912
     sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
     names = array.array('B', MAX_BYTES * FILL_CHAR)
     names_address, names_length = names.buffer_info()
-    mutable_byte_buffer = struct.pack('iL', MAX_BYTES, names_address)
+    mutable_byte_buffer = struct.pack('iP', MAX_BYTES, names_address)
     mutated_byte_buffer = fcntl.ioctl(sock.fileno(), SIOCGIFCONF,
                                       mutable_byte_buffer)
-    max_bytes_out, names_address_out = struct.unpack('iL', mutated_byte_buffer)
+    max_bytes_out, names_address_out = struct.unpack('iP', mutated_byte_buffer)
     namestr = names.tobytes()
-    namestr[:max_bytes_out]
-    bytes_out = namestr[:max_bytes_out]
     ip_dict = {}

Also, the socket should be closed properly. Consider using a context manager:

-    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
-    names = array.array('B', MAX_BYTES * FILL_CHAR)
-    names_address, names_length = names.buffer_info()
-    mutable_byte_buffer = struct.pack('iP', MAX_BYTES, names_address)
-    mutated_byte_buffer = fcntl.ioctl(sock.fileno(), SIOCGIFCONF,
-                                      mutable_byte_buffer)
+    with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as sock:
+        names = array.array('B', MAX_BYTES * FILL_CHAR)
+        names_address, names_length = names.buffer_info()
+        mutable_byte_buffer = struct.pack('iP', MAX_BYTES, names_address)
+        mutated_byte_buffer = fcntl.ioctl(sock.fileno(), SIOCGIFCONF,
+                                          mutable_byte_buffer)
tests/unittest/llmapi/apps/openai_server.py (4)

25-25: Fix type casting issue with SLURM_PROCID environment variable.

os.environ.get("SLURM_PROCID", 0) returns a string when the env var exists, but an integer (0) when it doesn't, which can cause type comparison issues later.

Apply this fix to ensure consistent typing:

-        self.rank = rank if rank != -1 else os.environ.get("SLURM_PROCID", 0)
+        self.rank = rank if rank != -1 else int(os.environ.get("SLURM_PROCID", "0"))

Also applies to: 32-32


38-43: Invalid NamedTemporaryFile parameter causes AttributeError.

delete_on_close is not a valid parameter for NamedTemporaryFile and will cause an AttributeError. This parameter doesn't exist in the Python tempfile module.

Remove the invalid parameter:

         if extra_config:
-            with tempfile.NamedTemporaryFile(mode="w+",
+            with tempfile.NamedTemporaryFile(mode="w",
                                              delete=False,
-                                             delete_on_close=False) as f:
+                                             suffix=".yaml") as f:
                 f.write(yaml.dump(extra_config))
                 self.extra_config_file = f.name

126-132: Handle negative port values properly to avoid binding to invalid ports.

When port=-1 is passed, the server will try to bind to port -1 which is invalid.

Apply this fix to treat negative values as unset:

-        self.port = port if port is not None else find_free_port()
+        self.port = find_free_port() if (port is None or port < 0) else port

134-139: Invalid NamedTemporaryFile parameter in RemoteDisaggOpenAIServer.

Same issue as in the parent class - delete_on_close is not a valid parameter.

Apply this fix:

         with tempfile.NamedTemporaryFile(mode="w+",
                                          delete=False,
-                                         delete_on_close=False) as f:
+                                         suffix=".yaml") as f:
             f.write(self._get_extra_config())
             f.flush()
             self.extra_config_file = f.name
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (4)

14-17: Consider using SLURM_LOCALID for per-node rank.

Using SLURM_NODEID for NODE_RANK is correct, but for per-node local rank operations, SLURM_LOCALID would be more appropriate than calculating RANK % SLURM_NTASKS_PER_NODE.

Consider adding:

LOCAL_RANK = int(os.environ.get("SLURM_LOCALID", "0"))

Then use LOCAL_RANK instead of RANK % SLURM_NTASKS_PER_NODE in lines 153 and 166.


30-35: Host matching logic is brittle with FQDN vs shortname mismatches.

The assertion socket.gethostname() in NODE_LIST can fail if SLURM uses FQDNs while socket.gethostname() returns a shortname or vice versa.

Make the host selection more robust:

 def get_the_other_host(idx=0):
-    assert len(NODE_LIST) >= 2
-    assert socket.gethostname() in NODE_LIST
-    node_list = NODE_LIST.copy()
-    node_list.remove(socket.gethostname())
-    return node_list[idx]
+    nodes = [n for n in NODE_LIST if n]
+    if len(nodes) < 2:
+        pytest.skip(f"disagg multi-node test requires at least 2 nodes; got {nodes}")
+    # Use NODE_RANK to identify our position instead of hostname matching
+    assert NODE_RANK in (0, 1), f"Unexpected NODE_RANK={NODE_RANK}"
+    # Return the other node
+    return nodes[1 - NODE_RANK] if idx == 0 else nodes[(NODE_RANK + 1 + idx) % len(nodes)]

178-178: Add timeouts to requests.get calls to prevent indefinite hangs.

Network requests without timeouts can hang indefinitely, causing CI/test failures.

Add timeouts to all requests:

-            if requests.get(url).status_code == 200:
+            if requests.get(url, timeout=5).status_code == 200:
-            if requests.get(url).status_code >= 100:
+            if requests.get(url, timeout=5).status_code >= 100:
                 print(
-                    f"endpoint {url} returned status code {requests.get(url).status_code}"
+                    f"endpoint {url} is still up"
                 )

Note: Line 191 makes a redundant second request just to get the status code for logging. Consider storing the response:

resp = requests.get(url, timeout=5)
if resp.status_code >= 100:
    print(f"endpoint {url} returned status code {resp.status_code}")

Also applies to: 189-191


1-13: Add missing NVIDIA copyright header.

Per the coding guidelines, all source files must include the NVIDIA Apache-2.0 copyright header.

Add the required header at the top of the file:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import socket
🧹 Nitpick comments (4)
tests/unittest/llmapi/apps/utils.py (1)

251-256: Redundant bracket parsing logic should be removed.

Lines 251-256 parse bracket notation but the results are unused. This appears to be leftover code from an earlier implementation.

Remove the redundant loop that doesn't contribute to the function:

     if buf:
         groups.append(''.join(buf).strip())

-    for group in groups:
-        bracket_match = re.match(r'^([^\[]+)\[(.+?)\]$', group)
-        if bracket_match:
-            prefix = bracket_match.group(1)
-            range_part = bracket_match.group(2)
-
     expanded_nodes = []
tests/unittest/llmapi/apps/openai_server.py (1)

69-71: Unused exception variable should be removed.

The variable e is assigned but never used in the exception handler.

Remove the unused variable:

-        except subprocess.TimeoutExpired as e:
+        except subprocess.TimeoutExpired:
             self.proc.kill()
             self.proc.wait(timeout=30)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (2)

98-98: Remove extraneous f-string prefix.

The string doesn't contain any placeholders, so the f-prefix is unnecessary.

-        print(f"Failed to find NIC, will use default UCX interface")
+        print("Failed to find NIC, will use default UCX interface")

244-244: Avoid calling terminate() on disagg_server in test.

The server's lifecycle is managed by the fixture's context manager. Calling terminate() manually can interfere with proper cleanup.

Remove the manual termination since the fixture handles cleanup:

         message = completion.choices[0].text
         assert message.startswith('2.')
-        disagg_server.terminate()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d7b94b and bfcb1e0.

📒 Files selected for processing (3)
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1 hunks)
  • tests/unittest/llmapi/apps/openai_server.py (5 hunks)
  • tests/unittest/llmapi/apps/utils.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
  • tests/unittest/llmapi/apps/utils.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
  • tests/unittest/llmapi/apps/utils.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
  • tests/unittest/llmapi/apps/utils.py
🧬 Code graph analysis (2)
tests/unittest/llmapi/apps/openai_server.py (2)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)
  • env (83-99)
tensorrt_llm/llmapi/mpi_session.py (1)
  • find_free_port (451-454)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (2)
tests/unittest/llmapi/apps/openai_server.py (4)
  • RemoteDisaggOpenAIServer (121-167)
  • RemoteOpenAIServer (17-118)
  • get_client (109-113)
  • terminate (63-77)
tests/unittest/llmapi/apps/utils.py (3)
  • expand_slurm_nodelist (222-302)
  • get_local_interfaces (191-219)
  • get_local_ip (183-186)
🪛 Ruff (0.12.2)
tests/unittest/llmapi/apps/openai_server.py

25-25: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


44-44: Consider ["trtllm-serve", model, *args] instead of concatenation

Replace with ["trtllm-serve", model, *args]

(RUF005)


47-47: Consider ["trtllm-llmapi-launch", *launch_cmd] instead of concatenation

Replace with ["trtllm-llmapi-launch", *launch_cmd]

(RUF005)


50-50: subprocess call: check for execution of untrusted input

(S603)


69-69: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


75-75: Do not catch blind exception: Exception

(BLE001)


145-145: Consider ["trtllm-llmapi-launch", *launch_cmd] instead of concatenation

Replace with ["trtllm-llmapi-launch", *launch_cmd]

(RUF005)


148-148: subprocess call: check for execution of untrusted input

(S603)

tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py

60-60: subprocess call with shell=True identified, security issue

(S602)


67-67: Consider moving this statement to an else block

(TRY300)


68-68: Do not catch blind exception: Exception

(BLE001)


98-98: f-string without any placeholders

Remove extraneous f prefix

(F541)


150-150: Possible binding to all interfaces

(S104)


164-164: Possible binding to all interfaces

(S104)


178-178: Probable use of requests call without timeout

(S113)


181-181: Do not catch blind exception: Exception

(BLE001)


189-189: Probable use of requests call without timeout

(S113)


191-191: Probable use of requests call without timeout

(S113)


194-194: Do not catch blind exception: Exception

(BLE001)


200-200: Unused function argument: worker

(ARG001)


203-203: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/unittest/llmapi/apps/utils.py

205-205: Local variable bytes_out is assigned to but never used

Remove assignment to unused variable bytes_out

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/unittest/llmapi/apps/openai_server.py (1)

79-101: Harden health check: timeout and monotonic clock.

Prevent indefinite hangs and clock skew issues; only rank 0 polls with a short per-request timeout.

Apply:

-        start = time.time()
+        start = time.monotonic()
         while True:
             try:
                 if self.rank == 0:
-                    if requests.get(url).status_code == 200:
+                    if requests.get(url, timeout=5).status_code == 200:
                         break
                     else:
                         time.sleep(0.5)
                 else:
                     time.sleep(timeout)
                     break
             except Exception as err:
                 result = self.proc.poll()
                 if result is not None and result != 0:
                     raise RuntimeError("Server exited unexpectedly.") from err

                 time.sleep(0.5)
-                if time.time() - start > timeout:
+                if time.monotonic() - start > timeout:
                     raise RuntimeError(
                         "Server failed to start in time.") from err
tests/unittest/llmapi/apps/utils.py (1)

21-25: Add typing imports for 3.8 compatibility.

You use built-in generics (list[str]) later; target is Python 3.8+. Import and use typing.List/Dict.

Apply:

-from typing import Any, Callable
+from typing import Any, Callable, Dict, List
♻️ Duplicate comments (17)
tests/unittest/llmapi/apps/openai_server.py (4)

1-1: Add required NVIDIA SPDX header.

Per repo guidelines, prepend the NVIDIA Apache-2.0 header to all Python sources.

Apply:

+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0

31-33: Rank type bug and port sentinel handling.

  • SLURM_PROCID is str; comparing to int 0 fails, breaking health-wait logic.
  • Treat negative ports as “unset” to avoid binding to -1.

Apply:

-        self.port = port if port is not None else find_free_port()
-        self.rank = rank if rank != -1 else os.environ.get("SLURM_PROCID", 0)
+        self.port = find_free_port() if (port is None or port < 0) else port
+        self.rank = rank if rank != -1 else int(os.environ.get("SLURM_PROCID", "0"))

37-43: Broken temp-file usage: invalid argument and persistence.

NamedTemporaryFile has no delete_on_close; this raises TypeError and prevents server start. Also use text mode and flush.

Apply:

-        if extra_config:
-            with tempfile.NamedTemporaryFile(mode="w+",
-                                             delete=False,
-                                             delete_on_close=False) as f:
-                f.write(yaml.dump(extra_config))
-                self.extra_config_file = f.name
-            args += ["--extra_llm_api_options", self.extra_config_file]
+        if extra_config:
+            with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
+                f.write(yaml.dump(extra_config))
+                f.flush()
+                self.extra_config_file = f.name
+            args += ["--extra_llm_api_options", self.extra_config_file]

134-141: Same temp-file bug in disagg server init.

delete_on_close is invalid. Use text mode, suffix, and flush.

Apply:

-        with tempfile.NamedTemporaryFile(mode="w+",
-                                         delete=False,
-                                         delete_on_close=False) as f:
-            f.write(self._get_extra_config())
-            f.flush()
-            self.extra_config_file = f.name
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
+            f.write(self._get_extra_config())
+            f.flush()
+            self.extra_config_file = f.name
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (12)

224-232: Guard client fixture on both node and local rank.

Prevents AttributeError when disagg_server is None on non-ctx ranks.

Apply:

 def client(disagg_server: RemoteDisaggOpenAIServer):
-    if is_pytest_node():
+    if is_pytest_node() and disagg_server is not None:
         return disagg_server.get_client()
     else:
         print(f"skipping client for rank {RANK} node rank {NODE_RANK}")
         return None

14-18: Use SLURM_LOCALID and avoid fragile parsing of SLURM_NTASKS_PER_NODE.

LOCAL_RANK is the canonical per-node rank; SLURM_NTASKS_PER_NODE may be "2(x2)" and int() will fail.

Apply:

 RANK = int(os.environ.get("SLURM_PROCID", 0))
 NODE_RANK = int(os.environ.get("SLURM_NODEID", 0))
 NODE_LIST = expand_slurm_nodelist(os.environ.get("SLURM_NODELIST", ""))
-SLURM_NTASKS_PER_NODE = int(os.environ.get("SLURM_NTASKS_PER_NODE", 1))
+LOCAL_RANK = int(os.environ.get("SLURM_LOCALID", "0"))

29-36: Brittle host resolution (can crash on FQDN vs shortname).

Do not assert membership; select peers by NODE_RANK and skip if <2 nodes.

Apply:

-def get_the_other_host(idx=0):
-    assert len(NODE_LIST) >= 2
-    assert socket.gethostname() in NODE_LIST
-    node_list = NODE_LIST.copy()
-    node_list.remove(socket.gethostname())
-    return node_list[idx]
+def get_the_other_host(idx: int = 0) -> str:
+    nodes = [n for n in NODE_LIST if n]
+    if len(nodes) < 2:
+        pytest.skip(f"disagg multi-node test requires 2 nodes; got {nodes}")
+    assert NODE_RANK in (0, 1), f"Unexpected NODE_RANK={NODE_RANK}"
+    # For now: two-node scenario; pick the non-local node.
+    return nodes[1 - NODE_RANK]

55-66: Shell misuse and potential injection; also TypeError with list+shell=True.

You pass a list with shell=True; this raises TypeError. Remove shell=True, add timeout, and parse robustly.

Apply:

-    try:
-        # iproute2 may not be installed
-        proc = subprocess.run(["ip", "route", "get", test_ip],
-                              capture_output=True,
-                              text=True,
-                              shell=True,
-                              check=True)
+    try:
+        # iproute2 may not be installed
+        proc = subprocess.run(["ip", "route", "get", test_ip],
+                              capture_output=True,
+                              text=True,
+                              check=True,
+                              timeout=5)

151-160: Use LOCAL_RANK for rank gating; bind to all interfaces already OK.

Avoid RANK % SLURM_NTASKS_PER_NODE; pass LOCAL_RANK.

Apply:

         with RemoteOpenAIServer(model_path,
                                 port=CTX_SERVER_PORT,
                                 cli_args=args,
                                 host="0.0.0.0",
                                 env=env(),
                                 llmapi_launch=True,
-                                rank=RANK % SLURM_NTASKS_PER_NODE,
+                                rank=LOCAL_RANK,
                                 extra_config=extra_config) as server:

166-173: Same: use LOCAL_RANK for gen server.

Mirror ctx fix.

Apply:

         with RemoteOpenAIServer(model_path,
                                 port=GEN_SERVER_PORT,
                                 cli_args=args,
                                 host="0.0.0.0",
                                 env=env(),
-                                rank=RANK % SLURM_NTASKS_PER_NODE,
+                                rank=LOCAL_RANK,
                                 extra_config=extra_config) as server:

178-188: Add request timeout and keep a single GET per loop.

Prevents CI hangs and reduces load.

Apply:

-    while time.monotonic() - start < timeout:
+    while time.monotonic() - start < timeout:
         try:
             time.sleep(1)
-            if requests.get(url).status_code == 200:
+            resp = requests.get(url, timeout=5)
+            if resp.status_code == 200:
                 print(f"endpoint {url} is ready")
                 return
-        except Exception as err:
+        except requests.exceptions.RequestException as err:
             print(f"endpoint {url} is not ready, with exception: {err}")

190-201: Likewise for “down” wait: timeout + narrower exception + single GET.

Apply:

-    while time.monotonic() - start < timeout:
+    while time.monotonic() - start < timeout:
         try:
-            if requests.get(url).status_code >= 100:
-                print(
-                    f"endpoint {url} returned status code {requests.get(url).status_code}"
-                )
+            resp = requests.get(url, timeout=5)
+            if resp.status_code >= 100:
+                print(f"endpoint {url} returned status code {resp.status_code}")
                 time.sleep(1)
-        except Exception as err:
+        except requests.exceptions.RequestException as err:
             print(f"endpoint {url} is down, with exception: {err}")
             return

205-217: Fix URLs to include scheme and silence unused fixture arg.

  • Disagg server expects full URLs (http://...).
  • The worker fixture is intentionally unused; rename to _worker to satisfy linters.

Apply:

-@pytest.fixture(scope="module")
-def disagg_server(worker: RemoteOpenAIServer):
+@pytest.fixture(scope="module")
+def disagg_server(_worker: RemoteOpenAIServer):
     if is_disagg_node():
         print(f"starting disagg_server for rank {RANK} node rank {NODE_RANK}")
-        ctx_url = f"localhost:8001"  # Use localhost since the ctx server is on the same node
+        ctx_url = f"http://localhost:{CTX_SERVER_PORT}"  # ctx on same node
         # TODO: Hopefully the NODE_LIST is ordered by NODE_RANK, this test is only expected to run with 2 nodes now
         # We need to test with 4 nodes or more in the future, which should be easier with service discovery
-        gen_url = f"{get_the_other_host(0)}:8002"
+        gen_url = f"http://{get_the_other_host(0)}:{GEN_SERVER_PORT}"
         with RemoteDisaggOpenAIServer(ctx_servers=[ctx_url],
                                       gen_servers=[gen_url],
                                       port=DISAGG_SERVER_PORT,
                                       llmapi_launch=True,
                                       env=env()) as server:
             yield server

135-137: Avoid fragile node membership assert; prefer skip.

Tests should skip gracefully if SLURM env is not as expected.

Apply:

-    host = socket.gethostname()
-    assert host in NODE_LIST
+    host = socket.gethostname()
+    if host not in NODE_LIST or len(NODE_LIST) < 2:
+        pytest.skip(f"disagg multi-node test requires 2 nodes; NODE_LIST={NODE_LIST}, host={host}")

1-1: Add required NVIDIA SPDX header.

Tests are Python sources too; add the NVIDIA Apache-2.0 header.

Apply:

+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0

65-75: Narrow exception type and minor print fix.

Catch subprocess errors explicitly and avoid bare Exception; also remove unnecessary f-prefix elsewhere.

Apply:

-        else:
-            raise ValueError("Could not find 'dev' in ip route output")
+        else:
+            raise RuntimeError("NIC not found in `ip route get` output")
         print(f"get NIC name from ip route, result: {nic_name}")
         return nic_name
-    except Exception as e:
+    except (subprocess.SubprocessError, FileNotFoundError, RuntimeError) as e:
         print(f"Failed to find NIC from ip route: {e}")
@@
-        except OSError as e:
-            print(f"Failed to find NIC from local interfaces: {e}")
+        except OSError as e:
+            print(f"Failed to find NIC from local interfaces: {e}")

Also change line 103 to remove an f-string without placeholders:

-        print(f"Failed to find NIC, will use default UCX interface")
+        print("Failed to find NIC, will use default UCX interface")
tests/unittest/llmapi/apps/utils.py (1)

189-220: Fix resource leak and 64-bit ioctl packing; simplify IP parsing.

  • Close the socket via context manager.
  • Use 'iP' for pointer-sized packing/unpacking.
  • Remove dead code and build IP strings directly.

Apply:

-def get_local_interfaces():
-    """ Returns a dictionary of name:ip key value pairs. """
-    MAX_BYTES = 4096
-    FILL_CHAR = b'\0'
-    SIOCGIFCONF = 0x8912
-    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
-    names = array.array('B', MAX_BYTES * FILL_CHAR)
-    names_address, names_length = names.buffer_info()
-    mutable_byte_buffer = struct.pack('iL', MAX_BYTES, names_address)
-    mutated_byte_buffer = fcntl.ioctl(sock.fileno(), SIOCGIFCONF,
-                                      mutable_byte_buffer)
-    max_bytes_out, names_address_out = struct.unpack('iL', mutated_byte_buffer)
-    namestr = names.tobytes()
-    namestr[:max_bytes_out]
-    bytes_out = namestr[:max_bytes_out]
-    ip_dict = {}
+def get_local_interfaces() -> Dict[str, str]:
+    """Returns a dictionary of name: ip key-value pairs."""
+    MAX_BYTES = 4096
+    FILL_CHAR = b'\0'
+    SIOCGIFCONF = 0x8912
+    with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as sock:
+        names = array.array('B', FILL_CHAR * MAX_BYTES)
+        names_address, _ = names.buffer_info()
+        mutable_byte_buffer = struct.pack('iP', MAX_BYTES, names_address)
+        mutated_byte_buffer = fcntl.ioctl(sock.fileno(), SIOCGIFCONF, mutable_byte_buffer)
+    max_bytes_out, _ = struct.unpack('iP', mutated_byte_buffer)
+    namestr = names.tobytes()
+    ip_dict: Dict[str, str] = {}
     for i in range(0, max_bytes_out, 40):
         name = namestr[i:i + 16].split(FILL_CHAR, 1)[0]
         name = name.decode('utf-8')
         ip_bytes = namestr[i + 20:i + 24]
-        full_addr = []
-        for netaddr in ip_bytes:
-            if isinstance(netaddr, int):
-                full_addr.append(str(netaddr))
-            elif isinstance(netaddr, str):
-                full_addr.append(str(ord(netaddr)))
-        ip_dict[name] = '.'.join(full_addr)
+        ip_dict[name] = '.'.join(str(b) for b in ip_bytes)
 
     return ip_dict
🧹 Nitpick comments (6)
tests/unittest/llmapi/apps/openai_server.py (4)

21-29: Fix type hints: make cli_args Optional.

PEP 484: a parameter with default None must be Optional[List[str]].

Apply:

-                 cli_args: List[str] = None,
+                 cli_args: Optional[List[str]] = None,

44-47: Minor: build launch_cmd with unpacking.

Cleaner and avoids unnecessary concatenations.

Apply:

-        launch_cmd = ["trtllm-serve"] + [model] + args
+        launch_cmd = ["trtllm-serve", model, *args]
         if llmapi_launch:
-            # start server with llmapi-launch on multi nodes
-            launch_cmd = ["trtllm-llmapi-launch"] + launch_cmd
+            # start server with llmapi-launch on multi nodes
+            launch_cmd = ["trtllm-llmapi-launch", *launch_cmd]

63-77: Tighten exception handling in terminate().

  • Remove unused variable e.
  • Avoid blanket Exception (BLE001). Narrow to OSError for cleanup.

Apply:

-        except subprocess.TimeoutExpired as e:
+        except subprocess.TimeoutExpired:
             self.proc.kill()
             self.proc.wait(timeout=30)
-        try:
+        try:
             if self.extra_config_file:
                 os.remove(self.extra_config_file)
-        except Exception as e:
-            print(f"Error removing extra config file: {e}")
+        except OSError as err:
+            print(f"Error removing extra config file: {err}")

141-147: Minor: consistent command building.

Use unpacking consistently.

Apply:

-        launch_cmd = [
-            "trtllm-serve", "disaggregated", "-c", self.extra_config_file
-        ]
+        launch_cmd = ["trtllm-serve", "disaggregated", "-c", self.extra_config_file]
         if llmapi_launch:
-            # start server with llmapi-launch on multi nodes
-            launch_cmd = ["trtllm-llmapi-launch"] + launch_cmd
+            # start server with llmapi-launch on multi nodes
+            launch_cmd = ["trtllm-llmapi-launch", *launch_cmd]
tests/unittest/llmapi/apps/utils.py (2)

183-187: Annotate get_local_ip signature (minor).

Add type hints for clarity.

Apply:

-def get_local_ip(test_ip):
+def get_local_ip(test_ip: str) -> str:

250-257: Remove dead code block.

This preliminary loop does nothing and can be deleted.

Apply:

-    for group in groups:
-        bracket_match = re.match(r'^([^\[]+)\[(.+?)\]$', group)
-        if bracket_match:
-            prefix = bracket_match.group(1)
-            range_part = bracket_match.group(2)
-
-    expanded_nodes = []
+    expanded_nodes: List[str] = []
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bfcb1e0 and 6e3dc2c.

📒 Files selected for processing (3)
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1 hunks)
  • tests/unittest/llmapi/apps/openai_server.py (5 hunks)
  • tests/unittest/llmapi/apps/utils.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
  • tests/unittest/llmapi/apps/utils.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
  • tests/unittest/llmapi/apps/utils.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tests/unittest/llmapi/apps/openai_server.py
  • tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py
  • tests/unittest/llmapi/apps/utils.py
🧬 Code graph analysis (2)
tests/unittest/llmapi/apps/openai_server.py (2)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)
  • env (88-104)
tensorrt_llm/llmapi/mpi_session.py (1)
  • find_free_port (451-454)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (2)
tests/unittest/llmapi/apps/openai_server.py (4)
  • RemoteDisaggOpenAIServer (121-167)
  • RemoteOpenAIServer (17-118)
  • get_client (109-113)
  • terminate (63-77)
tests/unittest/llmapi/apps/utils.py (3)
  • expand_slurm_nodelist (222-302)
  • get_local_interfaces (191-219)
  • get_local_ip (183-186)
🪛 Ruff (0.12.2)
tests/unittest/llmapi/apps/openai_server.py

25-25: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


44-44: Consider ["trtllm-serve", model, *args] instead of concatenation

Replace with ["trtllm-serve", model, *args]

(RUF005)


47-47: Consider ["trtllm-llmapi-launch", *launch_cmd] instead of concatenation

Replace with ["trtllm-llmapi-launch", *launch_cmd]

(RUF005)


50-50: subprocess call: check for execution of untrusted input

(S603)


69-69: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


75-75: Do not catch blind exception: Exception

(BLE001)


145-145: Consider ["trtllm-llmapi-launch", *launch_cmd] instead of concatenation

Replace with ["trtllm-llmapi-launch", *launch_cmd]

(RUF005)


148-148: subprocess call: check for execution of untrusted input

(S603)

tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py

60-60: subprocess call with shell=True identified, security issue

(S602)


60-60: Starting a process with a partial executable path

(S607)


70-70: Abstract raise to an inner function

(TRY301)


70-70: Avoid specifying long messages outside the exception class

(TRY003)


72-72: Consider moving this statement to an else block

(TRY300)


73-73: Do not catch blind exception: Exception

(BLE001)


103-103: f-string without any placeholders

Remove extraneous f prefix

(F541)


155-155: Possible binding to all interfaces

(S104)


169-169: Possible binding to all interfaces

(S104)


183-183: Probable use of requests call without timeout

(S113)


186-186: Do not catch blind exception: Exception

(BLE001)


194-194: Probable use of requests call without timeout

(S113)


196-196: Probable use of requests call without timeout

(S113)


199-199: Do not catch blind exception: Exception

(BLE001)


205-205: Unused function argument: worker

(ARG001)


208-208: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/unittest/llmapi/apps/utils.py

205-205: Local variable bytes_out is assigned to but never used

Remove assignment to unused variable bytes_out

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/unittest/llmapi/apps/utils.py (1)

302-303: Return type consistency (minor).

Ensure return type matches annotation (already fine after prior change). No code change needed.

@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18573 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18573 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13943 completed with status: 'FAILURE'

@reasonsolo reasonsolo force-pushed the tllm7182_disaggmultinodetest branch from 6e3dc2c to a9b0b8f Compare September 15, 2025 07:49
@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18594 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18594 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13957 completed with status: 'FAILURE'

@xinhe-nv xinhe-nv force-pushed the tllm7182_disaggmultinodetest branch from a9b0b8f to a71d12a Compare September 16, 2025 09:24
@reasonsolo reasonsolo force-pushed the tllm7182_disaggmultinodetest branch from a71d12a to 68b5fd5 Compare September 19, 2025 01:36
@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19237 [ run ] triggered by Bot

@reasonsolo reasonsolo force-pushed the tllm7182_disaggmultinodetest branch from ceeeeb1 to bac699b Compare September 19, 2025 06:53
@reasonsolo
Copy link
Collaborator Author

/bot run

@reasonsolo reasonsolo force-pushed the tllm7182_disaggmultinodetest branch from bac699b to 02229d8 Compare September 19, 2025 06:54
@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19307 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19308 [ ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19307 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #14497 completed with status: 'FAILURE'

@reasonsolo reasonsolo force-pushed the tllm7182_disaggmultinodetest branch 2 times, most recently from 53d869d to a51f2c4 Compare September 19, 2025 09:41
@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19344 [ run ] triggered by Bot

@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19462 [ run ] triggered by Bot

@reasonsolo reasonsolo force-pushed the tllm7182_disaggmultinodetest branch from a51f2c4 to ba2b86e Compare September 22, 2025 09:31
@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19558 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19462 [ run ] completed with state ABORTED
/LLM/main/L0_MergeRequest_PR pipeline #14631 completed with status: 'FAILURE'

@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19633 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19558 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #14705 (Blue Ocean) completed with status: ABORTED

@reasonsolo reasonsolo enabled auto-merge (squash) September 23, 2025 04:05
@tensorrt-cicd
Copy link
Collaborator

PR_Github #19633 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14768 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@reasonsolo reasonsolo merged commit 7550251 into NVIDIA:main Sep 24, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants