-
Notifications
You must be signed in to change notification settings - Fork 0
Adding torch accelerator to ddp-tutorial-series example #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| device = torch.device("cpu") | ||
| print(f"Running on device {device}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the example have 'gloo' support?
I think it is only for GPU distributed workloads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it does not work on cpu, should I leave this part as it is?, since while testing on XPU the "CPU Not supported error" is printed? , what do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove any cpu reference, the target is multi gpu
| torch.accelerator.set_device_index(rank) | ||
| print(f"Running on rank {rank} on device {device}") | ||
| else: | ||
| device = torch.device("cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same questions as above
| print(f"Running on device {device}") | ||
|
|
||
| backend = torch.distributed.get_default_backend_for_device(device) | ||
| torch.distributed.init_process_group(backend=backend, device_id=device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can use the rank variable instead of device since you need the device id
|
|
||
| backend = torch.distributed.get_default_backend_for_device(device) | ||
| torch.distributed.init_process_group(backend=backend, device_id=device) | ||
| return device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you returning device
I do not find a relation between the function name ddp_setup and the output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a function called _load_snapshot on line 52, in which the device is used to get a snapshot of the model in the device, previously was set to cuda by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previous version does not have return
def ddp_setup():
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
init_process_group(backend="nccl")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, can you check this code? , there is a "cuda" explicit function, in which the device name is needed to load the snapshot, unless I can repeat early lines code I think it should be left this way (with the return).
def _load_snapshot(self, snapshot_path):
loc = f"cuda:{self.gpu_id}"
snapshot = torch.load(snapshot_path, map_location=loc)
self.model.load_state_dict(snapshot["MODEL_STATE"])
self.epochs_run = snapshot["EPOCHS_RUN"]
print(f"Resuming training from snapshot at Epoch {self.epochs_run}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Use the accelerator api
device= torch.accelerator.current_accelerator()inside the function or obtain the device outside the function and pass it in as an argument. -
Use
rankvariable instead ofgpu_idif you are usingtorchrun
Example of obtaining the rank when using torchrun:
env_dict = {
key: os.environ[key]
for key in ("MASTER_ADDR", "MASTER_PORT", "RANK", "WORLD_SIZE")
}
rank = int(env_dict['RANK'])
world_size = int(env_dict['WORLD_SIZE'])
You can construct the local device string like this:
loc = f"{device.type}:{rank}"
|
|
||
| def main(save_every: int, total_epochs: int, batch_size: int, snapshot_path: str = "snapshot.pt"): | ||
| ddp_setup() | ||
| device = ddp_setup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same concert as above
| dataset, model, optimizer = load_train_objs() | ||
| train_data = prepare_dataloader(dataset, batch_size) | ||
| trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path) | ||
| trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path, device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recommendation is to remove the extra device argument and call the Accelerator API inside trainer function to reduce the number of changes.
Let`s see how it looks
| torch.accelerator.set_device_index(rank) | ||
| print(f"Running on rank {rank} on device {device}") | ||
| else: | ||
| device = torch.device("cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same concern as above
| print(f"Running on device {device}") | ||
|
|
||
| backend = torch.distributed.get_default_backend_for_device(device) | ||
| torch.distributed.init_process_group(backend=backend, device_id=device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use rank instead of device
|
|
||
| backend = torch.distributed.get_default_backend_for_device(device) | ||
| torch.distributed.init_process_group(backend=backend, device_id=device) | ||
| return device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same concern as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a function called _load_snapshot on line 52, in which the device is used to get a snapshot of the model in the device, previously was set to cuda by default.
ce47620 to
642060f
Compare
|
Sorry for the late response @jafraustro , can you take a look again, if additional changes are needed feel free to ping me and we can review this through a call |
| # num_gpus = num local gpus to use (must be at least 2). Default = 2 | ||
|
|
||
| # samples to run include: | ||
| # example.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
example.py does not exists, it should be:
- multigpu.py
- multigpu_torchrun.py
- multinode.py
jafraustro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to update single_gpu.py file:
- comment with word cuda
- Add to CI
- Review the args. Currently they are set as
required
CI will fail because how the args are set.
| uv run bash run_example.sh multigpu_torchrun.py || error "ddp tutorial series multigpu torchrun example failed" | ||
| uv run bash run_example.sh multinode.py || error "ddp tutorial series multinode example failed" | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the README with the instructions tu run the examples
| device = torch.device(f"{torch.accelerator.current_accelerator()}:{rank}") | ||
| torch.accelerator.set_device_index(rank) | ||
| print(f"Running on rank {rank} on device {device}") | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is not neccesary.
| device = torch.device(f"{torch.accelerator.current_accelerator()}:{rank}") | ||
| torch.accelerator.set_device_index(rank) | ||
| print(f"Running on rank {rank} on device {device}") | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is not neccesary.
7dc5ed5 to
67b4a05
Compare
| torch.cuda.set_device(rank) | ||
| init_process_group(backend="nccl", rank=rank, world_size=world_size) | ||
|
|
||
| if torch.accelerator.is_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you also need to remove the if statement
jafraustro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There is a suggested minor change
| ``` | ||
| ### Multi-Node | ||
| ``` | ||
| torchrun --nnodes=2 --nproc_per_node=4 multinode.py 20 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
minor change
| torchrun --nnodes=2 --nproc_per_node=4 multinode.py 20 5 | |
| Node 0 | |
| torchrun --node-rank=0 --nnodes=2 --nproc_per_node=2 multinode.py 20 5 | |
| Node 1 | |
| torchrun --node-rank=1 --nnodes=2 --nproc_per_node=2 multinode.py 20 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think it would be better if all args has a default value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the default values for epochs and save every 👍
67b4a05 to
eb6b10f
Compare
Signed-off-by: dggaytan <[email protected]>
2c49ec5 to
e4ead3a
Compare
No description provided.