Skip to content

Commit ac14e96

Browse files
authored
torchrun defaults for concurrent distributed training jobs (#2015)
1 parent bca5899 commit ac14e96

File tree

1 file changed

+6
-0
lines changed

1 file changed

+6
-0
lines changed

torchtune/_cli/run.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,12 @@ def _run_distributed(self, args: argparse.Namespace, is_builtin: bool):
8888
# Have to reset the argv so that the recipe can be run with the correct arguments
8989
args.training_script = args.recipe
9090
args.training_script_args = args.recipe_args
91+
92+
# If the user does not explicitly pass a rendezvous endpoint, run in standalone mode.
93+
# This allows running multiple distributed training jobs simultaneously.
94+
if not args.rdzv_endpoint:
95+
args.standalone = True
96+
9197
# torchtune built-in recipes are specified with an absolute posix path, but
9298
# custom recipes are specified as a relative module dot path and need to be
9399
# run with python -m

0 commit comments

Comments
 (0)