Skip to content

[Core][AMD] Propagate shutdown timeout to MultiprocExecutor#43154

Merged
njhill merged 16 commits into
vllm-project:mainfrom
rjrock:rocprof-worker-shutdown
Jun 12, 2026
Merged

[Core][AMD] Propagate shutdown timeout to MultiprocExecutor#43154
njhill merged 16 commits into
vllm-project:mainfrom
rjrock:rocprof-worker-shutdown

Conversation

@rjrock

@rjrock rjrock commented May 19, 2026

Copy link
Copy Markdown
Contributor

Purpose

rocprofv3 requires a grace period during process shutdown in order to emit trace data. This PR adds the environment variable VLLM_WORKER_SHUTDOWN_TIMEOUT_SECONDS that sets a shutdown grace period for worker processes of MultiProcExecutor. The env var is also passed to the engine manager shutdown.

Previously, running a command like the below would fail.

rocprofv3 \
  --disable-signal-handlers \
  --output-format pftrace \
  -r -- \
    vllm \
      bench throughput \
      --shutdown-timeout 60 \
      --model Qwen/Qwen3-32B \
      --num-prompts=1 \
      --tensor-parallel-size 2

Similarly, any rocprofv3 trace command that took longer than the 4 second shutdown period in multiproc_executor.py::_ensure_worker_termination would fail.

With this change merged, a successful run would look like the below.

export VLLM_WORKER_SHUTDOWN_TIMEOUT_SECONDS=120
rocprofv3 \
  --disable-signal-handlers \
  --output-format pftrace \
  -r -- \
    vllm \
      bench throughput \
      --shutdown-timeout 60 \
      --model Qwen/Qwen3-32B \
      --num-prompts=1 \
      --tensor-parallel-size 2

Test Plan

  • pytest tests/v1/executor/test_executor.py::test_multiproc_executor_worker_termination_timeout
  • pytest -s -v tests/v1/engine/test_core_engine_actor_manager.py::test_background_resources_passes_worker_shutdown_timeout

Test Result

Success


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@mergify mergify Bot added rocm Related to AMD ROCm v1 labels May 19, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD May 19, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a configurable shutdown timeout for the V1 engine and multiprocess executor. It adds a shutdown_timeout attribute to BackgroundResources and updates the MultiprocExecutor to use this value, ensuring a minimum grace period during worker termination. A review comment correctly identified a potential TypeError in multiproc_executor.py that could occur if the timeout configuration is None, suggesting a default value to prevent the crash.

Comment thread vllm/v1/executor/multiproc_executor.py Outdated
@rjrock rjrock force-pushed the rocprof-worker-shutdown branch from a947bd5 to d895571 Compare May 20, 2026 17:50
@rjrock

rjrock commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a configurable shutdown timeout for the MultiprocExecutor in the V1 engine. Changes include adding a shutdown_timeout field to BackgroundResources, passing this value to the engine manager during shutdown, and updating MultiprocExecutor to use the configured timeout with a 4-second minimum. Unit tests were added to verify worker termination behavior. Feedback points out a potential TypeError in MultiprocExecutor if the shutdown_timeout is None and provides a suggestion to handle this case safely.

Comment thread vllm/v1/executor/multiproc_executor.py Outdated
@rjrock rjrock force-pushed the rocprof-worker-shutdown branch from 1a048aa to dbb1bf8 Compare May 20, 2026 18:18
@rjrock rjrock marked this pull request as ready for review May 20, 2026 18:36
@rjrock rjrock requested a review from njhill as a code owner May 20, 2026 18:36
@mergify

mergify Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Hi @rjrock, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@rjrock rjrock force-pushed the rocprof-worker-shutdown branch from dbb1bf8 to eaf54b2 Compare May 20, 2026 19:33
@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 1, 2026
@AndreasKaratzas

Copy link
Copy Markdown
Member

cc @njhill PTAL

@dllehr-amd dllehr-amd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take a quick peak at my note? I'm trying to confirm that we won't negatively impact the default operation mode if we don't set the time ourselves.

Comment thread vllm/v1/engine/core_client.py Outdated
rjrock and others added 5 commits June 1, 2026 15:52
rocprofv3 requires a grace period during process shutdown in
order to emit trace data.

Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
This reverts commit c20b9a8.

Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
@rjrock rjrock force-pushed the rocprof-worker-shutdown branch from eaf54b2 to 0a59310 Compare June 1, 2026 20:52
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
@rjrock

rjrock commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Added a max call to BackgroundResources to maintain the previous behavior.

@rjrock rjrock requested a review from dllehr-amd June 1, 2026 21:47
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
@njhill

njhill commented Jun 3, 2026

Copy link
Copy Markdown
Member

Thanks @rjrock. The shutdown_timeout option in the config is for a global graceful shutdown where we wait for in-fight requests to complete rather than immediately aborting them.

So I'm not sure we should use that value here. By the time we are shutting down the executor we are in tear-down mode and the 4 second timeout is just to allow the resources to be released/process to exit cleanly. Perhaps for this purpose it would be better to just add a new VLLM_WORKER_SHUTDOWN_TIMEOUT env var in envs.py?

@mergify

mergify Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rjrock.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 4, 2026
rjrock added 4 commits June 4, 2026 19:48
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
@mergify mergify Bot removed the needs-rebase label Jun 5, 2026
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
@rjrock

rjrock commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @rjrock. The shutdown_timeout option in the config is for a global graceful shutdown where we wait for in-fight requests to complete rather than immediately aborting them.

So I'm not sure we should use that value here. By the time we are shutting down the executor we are in tear-down mode and the 4 second timeout is just to allow the resources to be released/process to exit cleanly. Perhaps for this purpose it would be better to just add a new VLLM_WORKER_SHUTDOWN_TIMEOUT env var in envs.py?

That makes sense. I rewrote it to use the env var VLLM_WORKER_SHUTDOWN_TIMEOUT_SECONDS. Please take another look when you get a chance, @njhill.

@njhill njhill left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjrock just have a couple of minor comments

Comment thread vllm/envs.py Outdated
Comment thread vllm/envs.py Outdated
rjrock and others added 2 commits June 11, 2026 16:52
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
@rjrock rjrock requested a review from njhill June 11, 2026 22:03
@njhill njhill enabled auto-merge (squash) June 11, 2026 22:25
@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hi @rjrock, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@dllehr-amd dllehr-amd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjrock! approving as well!

@njhill njhill merged commit aab639c into vllm-project:main Jun 12, 2026
80 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Jun 12, 2026
@rjrock rjrock deleted the rocprof-worker-shutdown branch June 12, 2026 20:14
nv-nedelman-1 added a commit to nv-nedelman-1/vllm that referenced this pull request Jun 12, 2026
Co-authored-by: Claude
Signed-off-by: Nicholas Edelman <nedelman@nvidia.com>

[Core][AMD] Propagate shutdown timeout to MultiprocExecutor (vllm-project#43154)

Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[Refactor] Deprecate ResponsesParser wrapper, inline parsing into ParsableContext (vllm-project#45431)

Signed-off-by: sfeng33 <4florafeng@gmail.com>

[ROCm] Bump Torch to 2.11 (vllm-project#45362)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

[Attention] Improve attention benchmarks: configs and profiling (vllm-project#39336)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
nv-nedelman-1 pushed a commit to nv-nedelman-1/vllm that referenced this pull request Jun 12, 2026
…ject#43154)

Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
…ject#43154)

Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants