Skip to content

Conversation

@dggaytan
Copy link
Collaborator

@dggaytan dggaytan commented Aug 8, 2025

No description provided.

@dggaytan dggaytan requested review from eromomon and jafraustro August 8, 2025 15:37
Comment on lines 28 to 29
device = torch.device("cpu")
print(f"Running on device {device}")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the example have 'gloo' support?

I think it is only for GPU distributed workloads

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it does not work on cpu, should I leave this part as it is?, since while testing on XPU the "CPU Not supported error" is printed? , what do you suggest?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove any cpu reference, the target is multi gpu

torch.accelerator.set_device_index(rank)
print(f"Running on rank {rank} on device {device}")
else:
device = torch.device("cpu")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same questions as above

print(f"Running on device {device}")

backend = torch.distributed.get_default_backend_for_device(device)
torch.distributed.init_process_group(backend=backend, device_id=device)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can use the rank variable instead of device since you need the device id


backend = torch.distributed.get_default_backend_for_device(device)
torch.distributed.init_process_group(backend=backend, device_id=device)
return device
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you returning device

I do not find a relation between the function name ddp_setup and the output

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a function called _load_snapshot on line 52, in which the device is used to get a snapshot of the model in the device, previously was set to cuda by default.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous version does not have return

def ddp_setup():
    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
    init_process_group(backend="nccl")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, can you check this code? , there is a "cuda" explicit function, in which the device name is needed to load the snapshot, unless I can repeat early lines code I think it should be left this way (with the return).

    def _load_snapshot(self, snapshot_path):
        loc = f"cuda:{self.gpu_id}"
        snapshot = torch.load(snapshot_path, map_location=loc)
        self.model.load_state_dict(snapshot["MODEL_STATE"])
        self.epochs_run = snapshot["EPOCHS_RUN"]
        print(f"Resuming training from snapshot at Epoch {self.epochs_run}")

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Use the accelerator api device= torch.accelerator.current_accelerator() inside the function or obtain the device outside the function and pass it in as an argument.

  • Use rank variable instead of gpu_id if you are using torchrun

Example of obtaining the rank when using torchrun:

env_dict = {
        key: os.environ[key]
        for key in ("MASTER_ADDR", "MASTER_PORT", "RANK", "WORLD_SIZE")
    }
    rank = int(env_dict['RANK'])
    world_size = int(env_dict['WORLD_SIZE'])

You can construct the local device string like this:

loc = f"{device.type}:{rank}"


def main(save_every: int, total_epochs: int, batch_size: int, snapshot_path: str = "snapshot.pt"):
ddp_setup()
device = ddp_setup()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concert as above

dataset, model, optimizer = load_train_objs()
train_data = prepare_dataloader(dataset, batch_size)
trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)
trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path, device)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recommendation is to remove the extra device argument and call the Accelerator API inside trainer function to reduce the number of changes.

Let`s see how it looks

torch.accelerator.set_device_index(rank)
print(f"Running on rank {rank} on device {device}")
else:
device = torch.device("cpu")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concern as above

print(f"Running on device {device}")

backend = torch.distributed.get_default_backend_for_device(device)
torch.distributed.init_process_group(backend=backend, device_id=device)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use rank instead of device


backend = torch.distributed.get_default_backend_for_device(device)
torch.distributed.init_process_group(backend=backend, device_id=device)
return device
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concern as above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a function called _load_snapshot on line 52, in which the device is used to get a snapshot of the model in the device, previously was set to cuda by default.

@dggaytan dggaytan force-pushed the dggaytan/distributed_DDP_backup branch from ce47620 to 642060f Compare August 26, 2025 22:12
@dggaytan dggaytan requested a review from jafraustro August 26, 2025 22:12
@dggaytan
Copy link
Collaborator Author

Sorry for the late response @jafraustro , can you take a look again, if additional changes are needed feel free to ping me and we can review this through a call

# num_gpus = num local gpus to use (must be at least 2). Default = 2

# samples to run include:
# example.py
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example.py does not exists, it should be:

  • multigpu.py
  • multigpu_torchrun.py
  • multinode.py

Copy link
Owner

@jafraustro jafraustro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to update single_gpu.py file:

  • comment with word cuda
  • Add to CI
  • Review the args. Currently they are set as required

CI will fail because how the args are set.

uv run bash run_example.sh multigpu_torchrun.py || error "ddp tutorial series multigpu torchrun example failed"
uv run bash run_example.sh multinode.py || error "ddp tutorial series multinode example failed"
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device = torch.device(f"{torch.accelerator.current_accelerator()}:{rank}")
torch.accelerator.set_device_index(rank)
print(f"Running on rank {rank} on device {device}")
else:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is not neccesary.

device = torch.device(f"{torch.accelerator.current_accelerator()}:{rank}")
torch.accelerator.set_device_index(rank)
print(f"Running on rank {rank} on device {device}")
else:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is not neccesary.

@dggaytan dggaytan force-pushed the dggaytan/distributed_DDP_backup branch from 7dc5ed5 to 67b4a05 Compare August 28, 2025 21:50
torch.cuda.set_device(rank)
init_process_group(backend="nccl", rank=rank, world_size=world_size)

if torch.accelerator.is_available():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also need to remove the if statement

Copy link
Owner

@jafraustro jafraustro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

There is a suggested minor change

```
### Multi-Node
```
torchrun --nnodes=2 --nproc_per_node=4 multinode.py 20 5
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

minor change

Suggested change
torchrun --nnodes=2 --nproc_per_node=4 multinode.py 20 5
Node 0
torchrun --node-rank=0 --nnodes=2 --nproc_per_node=2 multinode.py 20 5
Node 1
torchrun --node-rank=1 --nnodes=2 --nproc_per_node=2 multinode.py 20 5

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think it would be better if all args has a default value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the default values for epochs and save every 👍

@dggaytan dggaytan force-pushed the dggaytan/distributed_DDP_backup branch 3 times, most recently from 67b4a05 to eb6b10f Compare September 10, 2025 20:06
@dggaytan dggaytan force-pushed the dggaytan/distributed_DDP_backup branch from 2c49ec5 to e4ead3a Compare September 10, 2025 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants