Skip to content

Commit 7dc5ed5

Browse files
committed
Adding torch accelerator to ddp-tutorial-series example
Signed-off-by: dggaytan <[email protected]>
1 parent 642060f commit 7dc5ed5

File tree

5 files changed

+30
-11
lines changed

5 files changed

+30
-11
lines changed

distributed/ddp-tutorial-series/README.md

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,27 @@ Each code file extends upon the previous one. The series starts with a non-distr
1515
* [slurm/setup_pcluster_slurm.md](slurm/setup_pcluster_slurm.md): instructions to set up an AWS cluster
1616
* [slurm/config.yaml.template](slurm/config.yaml.template): configuration to set up an AWS cluster
1717
* [slurm/sbatch_run.sh](slurm/sbatch_run.sh): slurm script to launch the training job
18-
19-
20-
21-
18+
## Installation
19+
```
20+
pip install -r requirements.txt
21+
```
22+
## Running Examples
23+
For running the examples to run for 20 Epochs and save checkpoints every 5 Epochs, you can use the following command:
24+
### Single GPU
25+
```
26+
python single_gpu.py 20 5
27+
```
28+
### Multi-GPU
29+
```
30+
python multigpu.py 20 5
31+
```
32+
### Multi-GPU Torchrun
33+
```
34+
torchrun --nnodes=1 --nproc_per_node=4 multigpu_torchrun.py 20 5
35+
```
36+
### Multi-Node
37+
```
38+
torchrun --nnodes=2 --nproc_per_node=4 multinode.py 20 5
39+
```
40+
41+
For more details, check the [run_examples.sh](distributed/ddp-tutorial-series/run_examples.sh) script.

distributed/ddp-tutorial-series/multigpu.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,10 @@ def ddp_setup(rank, world_size):
1919
os.environ["MASTER_ADDR"] = "localhost"
2020
os.environ["MASTER_PORT"] = "12355"
2121

22-
rank = int(os.environ["LOCAL_RANK"])
2322
if torch.accelerator.is_available():
2423
device = torch.device(f"{torch.accelerator.current_accelerator()}:{rank}")
2524
torch.accelerator.set_device_index(rank)
2625
print(f"Running on rank {rank} on device {device}")
27-
else:
28-
print(f"Multi-GPU environment not detected")
2926

3027
backend = torch.distributed.get_default_backend_for_device(device)
3128
init_process_group(backend=backend, rank=rank, world_size=world_size)

distributed/ddp-tutorial-series/run_example.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@
44
# num_gpus = num local gpus to use (must be at least 2). Default = 2
55

66
# samples to run include:
7-
# example.py
7+
# multigpu_torchrun.py
8+
# multinode.py
89

910
echo "Launching ${1:-example.py} with ${2:-2} gpus"
10-
torchrun --nnodes=1 --nproc_per_node=${2:-2} --rdzv_id=101 --rdzv_endpoint="localhost:5972" ${1:-example.py}
11+
torchrun --nnodes=1 --nproc_per_node=${2:-2} --rdzv_id=101 --rdzv_endpoint="localhost:5972" ${1:-example.py} 10 1

distributed/ddp-tutorial-series/single_gpu.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,5 +78,5 @@ def main(device, total_epochs, save_every, batch_size):
7878
parser.add_argument('--batch_size', default=32, type=int, help='Input batch size on each device (default: 32)')
7979
args = parser.parse_args()
8080

81-
device = 0 # shorthand for cuda:0
81+
device = 0
8282
main(device, args.total_epochs, args.save_every, args.batch_size)

run_distributed_examples.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,10 @@ function distributed_tensor_parallelism() {
5151
}
5252

5353
function distributed_ddp-tutorial-series() {
54-
uv run bash run_example.sh multigpu.py || error "ddp tutorial series multigpu example failed"
54+
uv python multigpu.py 10 1 || error "ddp tutorial series multigpu example failed"
5555
uv run bash run_example.sh multigpu_torchrun.py || error "ddp tutorial series multigpu torchrun example failed"
5656
uv run bash run_example.sh multinode.py || error "ddp tutorial series multinode example failed"
57+
uv python single_gpu.py 10 1 || error "ddp tutorial series single gpu example failed"
5758
}
5859

5960
function distributed_ddp() {

0 commit comments

Comments
 (0)