## 🐛 Bug Running DDP on a devgpu with 4 GPUs with `--nprocs_per_node=2` and `--nnodes=2` does not work when the script uses `LOCAL_RANK` to set the cuda device. ``` torchx run dist.ddp -j 2x2 ``` Module (check all that applies): * [ ] `torchx.spec` * [ ] `torchx.component` * [ ] `torchx.apps` * [ ] `torchx.runtime` * [ ] `torchx.cli` * [x] `torchx.schedulers` * [ ] `torchx.pipelines` * [ ] `torchx.aws` * [ ] `torchx.examples` * [ ] `other` ## To Reproduce See description above, easily repros with a training script: ``` if __name__ == "__main__": torch.cuda.set_device(int(os.environ["LOCAL_RANK"])) ``` try running the above with ``` torchx run dist.ddp -j 2x2 main.py ``` ## Expected behavior TorchX local scheduler should set CUDA_VISIBLE_DEVICE=0,1 on the first two workers, and CUDA_VISIBLE_DEVICE=2,3 on the next two workers. ## Environment - torchx version (e.g. 0.1.0rc1): - Python version: - OS (e.g., Linux): - How you installed torchx (`conda`, `pip`, source, `docker`): - Docker image and tag (if using docker): - Git commit (if installed from source): - Execution environment (on-prem, AWS, GCP, Azure etc): - Any other relevant information: ## Additional context <!-- Add any other context about the problem here. -->