From 6d45662a6337638714f9ea8d504bcde2faa605f6 Mon Sep 17 00:00:00 2001 From: Kiuk Chung Date: Tue, 8 Mar 2022 12:20:10 -0800 Subject: [PATCH] (torchx/components) pass --tee=3 to dist.ddp to prefix local_rank on the worker's stdout and stderr streams Summary: Addresses the QOL issue around SLURM logs mentioned in https://github.com/pytorch/torchx/issues/405 TL;DR - since torchx launches nodes (not tasks) in SLURM, the stdout and stderr logs are combined for all 8 workers on the node (versus having separate ones for each worker when launched as task). This makes `dist.ddp` set `--tee=3` flag to torchelastic which prefixes each line of stderr and stdout of the workers with the local_rank of that worker so that the user can easily grep out the logs for a particular worker. Differential Revision: D34726681 fbshipit-source-id: 24500a68981db7671f2f57961cc7dfa96c8a45e8 --- torchx/components/dist.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/torchx/components/dist.py b/torchx/components/dist.py index afc628a83..03036c6f3 100644 --- a/torchx/components/dist.py +++ b/torchx/components/dist.py @@ -219,6 +219,8 @@ def ddp( str(nnodes), "--nproc_per_node", str(nproc_per_node), + "--tee", + "3", ] if script is not None: cmd += [script]