-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
Lightning PyTorch DDP Bug Report: Selective Rank Deadlock with Combined Dataset
Bug Description
When using DDP (Distributed Data Parallel) with 4 GPUs and a custom combined dataset that wraps two separate datasets, only ranks 1 and 3 (odd ranks) successfully reach train_dataloader(), while ranks 0 and 2 (even ranks) hang indefinitely after setup() completes. This causes training to deadlock during dataloader initialization.
Environment
- PyTorch Lightning Version: 2.5.6 (please specify exact version)
- PyTorch Version: '2.7.0+cu128'
- Python Version: 3.12
- CUDA Version: 12.8
- Operating System: Ubuntu (EC2 instance)
- Hardware: 4x NVIDIA GPUs
- DDP Strategy:
ddpwith NCCL backend - Number of Workers: 0 (num_workers=0, single-process data loading)
Minimal Reproducible Example
Dataset Structure
class CombinedDataset(Dataset):
def __init__(self, dataset_a, dataset_b):
self.dataset_a = dataset_a
self.dataset_b = dataset_b
self.len_a = len(dataset_a)
self.len_b = len(dataset_b)
self._length = self.len_a + self.len_b
def __len__(self):
return self._length
def __getitem__(self, index):
if index % 2 == 0:
return self.dataset_a[index // 2]
else:
return self.dataset_b[index // 2]
class CombinedDataModule(LightningDataModule):
def setup(self, stage=None):
if stage in (None, 'fit'):
dataset_a = DatasetA(...)
dataset_b = DatasetB(...)
self.train_dataset = CombinedDataset(dataset_a, dataset_b)
def train_dataloader(self):
return DataLoader(
self.train_dataset,
sampler=DistributedSampler(self.train_dataset),
batch_size=1,
num_workers=0
)Training Configuration
trainer = Trainer(
accelerator='gpu',
devices=4,
strategy='ddp',
# ... other config
)Steps to Reproduce
- Create a
CombinedDatasetthat wraps two independent datasets (each working fine individually in DDP) - Use
CombinedDataModulewith DDP training on 4 GPUs - Call
trainer.fit(model, datamodule) - Observe that only 2 of 4 ranks (specifically odd-numbered ranks) successfully reach
train_dataloader()method
Expected Behavior
All 4 DDP ranks should:
- Complete
setup()successfully - Call
train_dataloader()to create the training DataLoader - Begin training iteration
Actual Behavior
- ✅ All 4 ranks complete
setup()successfully - ✅ All 4 ranks pass any DDP barriers in
setup() - ❌ Lightning internally calls
train_dataloader()ONLY on ranks 1 and 3 - ❌ Ranks 0 and 2 hang indefinitely in Lightning's internal code
- ❌ Training never starts due to deadlock
Debug Output
[COMBINED SETUP] Barrier passed, train setup complete # All 4 ranks print this
[rank: 0] Seed set to 42
[rank: 1] Seed set to 42
[rank: 2] Seed set to 42
[rank: 3] Seed set to 42
# Only ranks 1 and 3 proceed:
[COMBINED] Creating train_dataloader, num_workers=0 # Rank 1
[COMBINED] Building DataLoader... # Rank 1
[COMBINED __len__] Rank 1, returning length=65120
[SAMPLER INIT] Rank 1, dataset_len=65120
[SAMPLER INIT] Rank 1 completed, num_samples=16280
[COMBINED] Creating train_dataloader, num_workers=0 # Rank 3
[COMBINED __len__] Rank 3, returning length=65120
[SAMPLER INIT] Rank 3, dataset_len=65120
[SAMPLER INIT] Rank 3 completed, num_samples=16280
# Ranks 0 and 2 NEVER print these messages - they're stuck in Lightning internal code
# Training hangs here permanently
Additional Context
What Works:
- ✅
DatasetAalone with DDP on 4 GPUs - ✅
DatasetBalone with DDP on 4 GPUs - ✅
CombinedDataseton single GPU (no DDP) - ✅ Sanity check with validation dataloader (all ranks work)
What Fails:
- ❌
CombinedDatasetwith DDP on 4 GPUs (only during training dataloader creation)
Key Observations:
-
Rank-Selective Failure: Only even-numbered ranks (0, 2) fail to reach
train_dataloader(), while odd-numbered ranks (1, 3) succeed -
After Setup Success: All ranks complete
setup()and pass DDP barriers, but Lightning's internal code path diverges before callingtrain_dataloader() -
num_workers=0: Issue occurs even with single-process data loading (no multiprocessing workers), ruling out worker initialization issues
-
Consistent Pattern: Behavior is deterministic and reproducible across multiple runs
What version are you seeing the problem on?
v2.5
Reproduced in studio
No response
How to reproduce the bug
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response