Skip to content

Lightning PyTorch DDP Bug Report: Selective Rank Deadlock with Combined Dataset #21367

@alexmil2019

Description

@alexmil2019

Bug description

Lightning PyTorch DDP Bug Report: Selective Rank Deadlock with Combined Dataset

Bug Description

When using DDP (Distributed Data Parallel) with 4 GPUs and a custom combined dataset that wraps two separate datasets, only ranks 1 and 3 (odd ranks) successfully reach train_dataloader(), while ranks 0 and 2 (even ranks) hang indefinitely after setup() completes. This causes training to deadlock during dataloader initialization.

Environment

  • PyTorch Lightning Version: 2.5.6 (please specify exact version)
  • PyTorch Version: '2.7.0+cu128'
  • Python Version: 3.12
  • CUDA Version: 12.8
  • Operating System: Ubuntu (EC2 instance)
  • Hardware: 4x NVIDIA GPUs
  • DDP Strategy: ddp with NCCL backend
  • Number of Workers: 0 (num_workers=0, single-process data loading)

Minimal Reproducible Example

Dataset Structure

class CombinedDataset(Dataset):
    def __init__(self, dataset_a, dataset_b):
        self.dataset_a = dataset_a
        self.dataset_b = dataset_b
        self.len_a = len(dataset_a)
        self.len_b = len(dataset_b)
        self._length = self.len_a + self.len_b
    
    def __len__(self):
        return self._length
    
    def __getitem__(self, index):
        if index % 2 == 0:
            return self.dataset_a[index // 2]
        else:
            return self.dataset_b[index // 2]

class CombinedDataModule(LightningDataModule):
    def setup(self, stage=None):
        if stage in (None, 'fit'):
            dataset_a = DatasetA(...)
            dataset_b = DatasetB(...)
            self.train_dataset = CombinedDataset(dataset_a, dataset_b)
    
    def train_dataloader(self):
        return DataLoader(
            self.train_dataset,
            sampler=DistributedSampler(self.train_dataset),
            batch_size=1,
            num_workers=0
        )

Training Configuration

trainer = Trainer(
    accelerator='gpu',
    devices=4,
    strategy='ddp',
    # ... other config
)

Steps to Reproduce

  1. Create a CombinedDataset that wraps two independent datasets (each working fine individually in DDP)
  2. Use CombinedDataModule with DDP training on 4 GPUs
  3. Call trainer.fit(model, datamodule)
  4. Observe that only 2 of 4 ranks (specifically odd-numbered ranks) successfully reach train_dataloader() method

Expected Behavior

All 4 DDP ranks should:

  1. Complete setup() successfully
  2. Call train_dataloader() to create the training DataLoader
  3. Begin training iteration

Actual Behavior

  • ✅ All 4 ranks complete setup() successfully
  • ✅ All 4 ranks pass any DDP barriers in setup()
  • ❌ Lightning internally calls train_dataloader() ONLY on ranks 1 and 3
  • ❌ Ranks 0 and 2 hang indefinitely in Lightning's internal code
  • ❌ Training never starts due to deadlock

Debug Output

[COMBINED SETUP] Barrier passed, train setup complete  # All 4 ranks print this
[rank: 0] Seed set to 42
[rank: 1] Seed set to 42
[rank: 2] Seed set to 42
[rank: 3] Seed set to 42

# Only ranks 1 and 3 proceed:
[COMBINED] Creating train_dataloader, num_workers=0  # Rank 1
[COMBINED] Building DataLoader...                     # Rank 1
[COMBINED __len__] Rank 1, returning length=65120
[SAMPLER INIT] Rank 1, dataset_len=65120
[SAMPLER INIT] Rank 1 completed, num_samples=16280

[COMBINED] Creating train_dataloader, num_workers=0  # Rank 3
[COMBINED __len__] Rank 3, returning length=65120
[SAMPLER INIT] Rank 3, dataset_len=65120
[SAMPLER INIT] Rank 3 completed, num_samples=16280

# Ranks 0 and 2 NEVER print these messages - they're stuck in Lightning internal code
# Training hangs here permanently

Additional Context

What Works:

  • DatasetA alone with DDP on 4 GPUs
  • DatasetB alone with DDP on 4 GPUs
  • CombinedDataset on single GPU (no DDP)
  • ✅ Sanity check with validation dataloader (all ranks work)

What Fails:

  • CombinedDataset with DDP on 4 GPUs (only during training dataloader creation)

Key Observations:

  1. Rank-Selective Failure: Only even-numbered ranks (0, 2) fail to reach train_dataloader(), while odd-numbered ranks (1, 3) succeed

  2. After Setup Success: All ranks complete setup() and pass DDP barriers, but Lightning's internal code path diverges before calling train_dataloader()

  3. num_workers=0: Issue occurs even with single-process data loading (no multiprocessing workers), ruling out worker initialization issues

  4. Consistent Pattern: Behavior is deterministic and reproducible across multiple runs

What version are you seeing the problem on?

v2.5

Reproduced in studio

No response

How to reproduce the bug

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

cc @ethanwharris @justusschock @lantiga

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdistributedGeneric distributed-related topicstrategy: ddpDistributedDataParallelver: 2.5.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions