Skip to content

Remove unused RolloutServer.all_engines for encapsulations#939

Open
fzyzcjy wants to merge 17 commits intorollout_ft/21from
rollout_ft/22
Open

Remove unused RolloutServer.all_engines for encapsulations#939
fzyzcjy wants to merge 17 commits intorollout_ft/21from
rollout_ft/22

Conversation

@fzyzcjy
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy commented Apr 7, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the all_engines property from the RolloutServer class. Feedback indicates that this removal is premature, as the property is necessary for maintaining consistency across all ranks in multi-node configurations and provides better encapsulation for external callers compared to accessing internal server groups directly.

I am having trouble creating individual review comments. Click here to see my feedback.

miles/ray/rollout/rollout_server.py (161-163)

high

Removing all_engines from RolloutServer appears to be premature and potentially counter-productive to the goal of encapsulation for the following reasons:

  1. Correctness in Multi-node: Currently, methods like offload, onload, and check_weights in ServerGroup (and the weight update logic in RolloutManager) rely on the engines property, which only includes node-0 actors. In multi-node Tensor Parallel (TP) configurations where nodes_per_engine > 1, this results in operations only being applied to a subset of actors, leaving others (on different nodes) in an inconsistent state. all_engines is the correct collection to use for these operations to ensure all ranks are reached.
  2. Encapsulation: In RolloutManager._try_ci_fault_injection, the code currently reaches into self._server.server_groups[0].all_engines. If RolloutServer.all_engines were available, it would provide a cleaner, higher-level interface that hides the internal group structure. Removing it forces external callers to depend on the internal server_groups implementation details.

I recommend retaining this property and using it to address the existing multi-node inconsistencies in the rollout orchestration logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant