Adding torch accelerator to ddp-tutorial-series example #38

dggaytan · 2025-08-08T15:37:51Z

No description provided.

jafraustro · 2025-08-08T16:46:16Z

distributed/ddp-tutorial-series/multigpu.py

+        device = torch.device("cpu")
+        print(f"Running on device {device}")


does the example have 'gloo' support?

I think it is only for GPU distributed workloads

no it does not work on cpu, should I leave this part as it is?, since while testing on XPU the "CPU Not supported error" is printed? , what do you suggest?

remove any cpu reference, the target is multi gpu

jafraustro · 2025-08-08T16:46:19Z

distributed/ddp-tutorial-series/multigpu_torchrun.py

+        torch.accelerator.set_device_index(rank)
+        print(f"Running on rank {rank} on device {device}")
+    else:
+        device = torch.device("cpu")


same questions as above

jafraustro · 2025-08-08T16:46:24Z

distributed/ddp-tutorial-series/multigpu_torchrun.py

+        print(f"Running on device {device}")
+
+    backend = torch.distributed.get_default_backend_for_device(device)
+    torch.distributed.init_process_group(backend=backend, device_id=device)


maybe you can use the rank variable instead of device since you need the device id

jafraustro · 2025-08-08T16:50:37Z

distributed/ddp-tutorial-series/multigpu_torchrun.py

+
+    backend = torch.distributed.get_default_backend_for_device(device)
+    torch.distributed.init_process_group(backend=backend, device_id=device)
+    return device


why are you returning device

I do not find a relation between the function name ddp_setup and the output

There is a function called _load_snapshot on line 52, in which the device is used to get a snapshot of the model in the device, previously was set to cuda by default.

previous version does not have return

def ddp_setup(): torch.cuda.set_device(int(os.environ["LOCAL_RANK"])) init_process_group(backend="nccl")

yes, can you check this code? , there is a "cuda" explicit function, in which the device name is needed to load the snapshot, unless I can repeat early lines code I think it should be left this way (with the return).

def _load_snapshot(self, snapshot_path): loc = f"cuda:{self.gpu_id}" snapshot = torch.load(snapshot_path, map_location=loc) self.model.load_state_dict(snapshot["MODEL_STATE"]) self.epochs_run = snapshot["EPOCHS_RUN"] print(f"Resuming training from snapshot at Epoch {self.epochs_run}")

Use the accelerator api device= torch.accelerator.current_accelerator() inside the function or obtain the device outside the function and pass it in as an argument.

Use rank variable instead of gpu_id if you are using torchrun

Example of obtaining the rank when using torchrun:

env_dict = { key: os.environ[key] for key in ("MASTER_ADDR", "MASTER_PORT", "RANK", "WORLD_SIZE") } rank = int(env_dict['RANK']) world_size = int(env_dict['WORLD_SIZE'])

You can construct the local device string like this:

loc = f"{device.type}:{rank}"

jafraustro · 2025-08-08T16:50:52Z

distributed/ddp-tutorial-series/multigpu_torchrun.py


 def main(save_every: int, total_epochs: int, batch_size: int, snapshot_path: str = "snapshot.pt"):
-    ddp_setup()
+    device = ddp_setup()


same concert as above

jafraustro · 2025-08-08T16:52:11Z

distributed/ddp-tutorial-series/multigpu_torchrun.py

    dataset, model, optimizer = load_train_objs()
    train_data = prepare_dataloader(dataset, batch_size)
-    trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)
+    trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path, device)


My recommendation is to remove the extra device argument and call the Accelerator API inside trainer function to reduce the number of changes.

Let`s see how it looks

jafraustro · 2025-08-08T16:52:28Z

distributed/ddp-tutorial-series/multinode.py

+        torch.accelerator.set_device_index(rank)
+        print(f"Running on rank {rank} on device {device}")
+    else:
+        device = torch.device("cpu")


same concern as above

jafraustro · 2025-08-08T16:52:47Z

distributed/ddp-tutorial-series/multinode.py

+        print(f"Running on device {device}")
+
+    backend = torch.distributed.get_default_backend_for_device(device)
+    torch.distributed.init_process_group(backend=backend, device_id=device)


I think you can use rank instead of device

jafraustro · 2025-08-08T16:52:59Z

distributed/ddp-tutorial-series/multinode.py

+
+    backend = torch.distributed.get_default_backend_for_device(device)
+    torch.distributed.init_process_group(backend=backend, device_id=device)
+    return device


same concern as above

There is a function called _load_snapshot on line 52, in which the device is used to get a snapshot of the model in the device, previously was set to cuda by default.

dggaytan · 2025-08-26T22:16:11Z

Sorry for the late response @jafraustro , can you take a look again, if additional changes are needed feel free to ping me and we can review this through a call

jafraustro · 2025-08-27T17:03:39Z

distributed/ddp-tutorial-series/run_example.sh

+# num_gpus = num local gpus to use (must be at least 2). Default = 2
+
+# samples to run include:
+# example.py


example.py does not exists, it should be:

multigpu.py

multigpu_torchrun.py

multinode.py

jafraustro

You also need to update single_gpu.py file:

comment with word cuda
Add to CI
Review the args. Currently they are set as required

CI will fail because how the args are set.

jafraustro · 2025-08-27T17:04:48Z

run_distributed_examples.sh

+    uv run bash run_example.sh multigpu_torchrun.py || error "ddp tutorial series multigpu torchrun example failed"
+    uv run bash run_example.sh multinode.py || error "ddp tutorial series multinode example failed"
+}
+


Update the README with the instructions tu run the examples

-> https://github.com/pytorch/examples/blob/main/distributed/tensor_parallelism/README.md

distributed/ddp-tutorial-series/multinode.py

jafraustro · 2025-08-27T22:42:38Z

distributed/ddp-tutorial-series/multigpu.py

+        device = torch.device(f"{torch.accelerator.current_accelerator()}:{rank}")
+        torch.accelerator.set_device_index(rank)
+        print(f"Running on rank {rank} on device {device}")
+    else:


this line is not neccesary.

jafraustro · 2025-08-27T22:42:50Z

distributed/ddp-tutorial-series/multigpu_torchrun.py

+        device = torch.device(f"{torch.accelerator.current_accelerator()}:{rank}")
+        torch.accelerator.set_device_index(rank)
+        print(f"Running on rank {rank} on device {device}")
+    else:


this line is not neccesary.

jafraustro · 2025-08-29T15:45:40Z

distributed/ddp-tutorial-series/multigpu.py

-    torch.cuda.set_device(rank)
-    init_process_group(backend="nccl", rank=rank, world_size=world_size)
+
+    if torch.accelerator.is_available():       


you also need to remove the if statement

jafraustro

LGTM

There is a suggested minor change

jafraustro · 2025-09-02T17:17:30Z

distributed/ddp-tutorial-series/README.md

+```
+### Multi-Node
+```
+torchrun --nnodes=2 --nproc_per_node=4  multinode.py 20 5


LGTM

minor change

Suggested change

torchrun --nnodes=2 --nproc_per_node=4 multinode.py 20 5

Node 0

torchrun --node-rank=0 --nnodes=2 --nproc_per_node=2 multinode.py 20 5

Node 1

torchrun --node-rank=1 --nnodes=2 --nproc_per_node=2 multinode.py 20 5

Also, I think it would be better if all args has a default value

I added the default values for epochs and save every 👍

Signed-off-by: dggaytan <[email protected]>

dggaytan requested review from eromomon and jafraustro August 8, 2025 15:37

jafraustro reviewed Aug 8, 2025

View reviewed changes

jafraustro requested changes Aug 8, 2025

View reviewed changes

dggaytan force-pushed the dggaytan/distributed_DDP_backup branch from ce47620 to 642060f Compare August 26, 2025 22:12

dggaytan requested a review from jafraustro August 26, 2025 22:12

jafraustro reviewed Aug 27, 2025

View reviewed changes

jafraustro requested changes Aug 27, 2025

View reviewed changes

distributed/ddp-tutorial-series/multinode.py Show resolved Hide resolved

jafraustro reviewed Aug 27, 2025

View reviewed changes

dggaytan force-pushed the dggaytan/distributed_DDP_backup branch from 7dc5ed5 to 67b4a05 Compare August 28, 2025 21:50

jafraustro requested changes Aug 29, 2025

View reviewed changes

jafraustro approved these changes Sep 2, 2025

View reviewed changes

dggaytan force-pushed the dggaytan/distributed_DDP_backup branch 3 times, most recently from 67b4a05 to eb6b10f Compare September 10, 2025 20:06

Adding torch accelerator to ddp-tutorial-series example

e4ead3a

Signed-off-by: dggaytan <[email protected]>

dggaytan force-pushed the dggaytan/distributed_DDP_backup branch from 2c49ec5 to e4ead3a Compare September 10, 2025 20:19

Merge branch 'main' into dggaytan/distributed_DDP_backup

cf74c6f

		device = torch.device("cpu")
		print(f"Running on device {device}")

-torchrun --nnodes=2 --nproc_per_node=4  multinode.py 20 5
+Node 0
+torchrun --node-rank=0 --nnodes=2 --nproc_per_node=2  multinode.py 20 5
+Node 1
+torchrun --node-rank=1 --nnodes=2 --nproc_per_node=2  multinode.py 20 5

Adding torch accelerator to ddp-tutorial-series example #38

Are you sure you want to change the base?

Adding torch accelerator to ddp-tutorial-series example #38

Uh oh!

Conversation

dggaytan commented Aug 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dggaytan commented Aug 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jafraustro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jafraustro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants