Lightning PyTorch DDP Bug Report: Selective Rank Deadlock with Combined Dataset

### Bug description

# Lightning PyTorch DDP Bug Report: Selective Rank Deadlock with Combined Dataset

## Bug Description

When using DDP (Distributed Data Parallel) with 4 GPUs and a custom combined dataset that wraps two separate datasets, only ranks 1 and 3 (odd ranks) successfully reach `train_dataloader()`, while ranks 0 and 2 (even ranks) hang indefinitely after `setup()` completes. This causes training to deadlock during dataloader initialization.

## Environment

- **PyTorch Lightning Version**: 2.5.6 (please specify exact version)
- **PyTorch Version**: '2.7.0+cu128'
- **Python Version**: 3.12
- **CUDA Version**: 12.8
- **Operating System**: Ubuntu (EC2 instance)
- **Hardware**: 4x NVIDIA GPUs
- **DDP Strategy**: `ddp` with NCCL backend
- **Number of Workers**: 0 (num_workers=0, single-process data loading)

## Minimal Reproducible Example

### Dataset Structure
```python
class CombinedDataset(Dataset):
    def __init__(self, dataset_a, dataset_b):
        self.dataset_a = dataset_a
        self.dataset_b = dataset_b
        self.len_a = len(dataset_a)
        self.len_b = len(dataset_b)
        self._length = self.len_a + self.len_b
    
    def __len__(self):
        return self._length
    
    def __getitem__(self, index):
        if index % 2 == 0:
            return self.dataset_a[index // 2]
        else:
            return self.dataset_b[index // 2]

class CombinedDataModule(LightningDataModule):
    def setup(self, stage=None):
        if stage in (None, 'fit'):
            dataset_a = DatasetA(...)
            dataset_b = DatasetB(...)
            self.train_dataset = CombinedDataset(dataset_a, dataset_b)
    
    def train_dataloader(self):
        return DataLoader(
            self.train_dataset,
            sampler=DistributedSampler(self.train_dataset),
            batch_size=1,
            num_workers=0
        )
```

### Training Configuration
```python
trainer = Trainer(
    accelerator='gpu',
    devices=4,
    strategy='ddp',
    # ... other config
)
```

## Steps to Reproduce

1. Create a `CombinedDataset` that wraps two independent datasets (each working fine individually in DDP)
2. Use `CombinedDataModule` with DDP training on 4 GPUs
3. Call `trainer.fit(model, datamodule)`
4. Observe that only 2 of 4 ranks (specifically odd-numbered ranks) successfully reach `train_dataloader()` method

## Expected Behavior

All 4 DDP ranks should:
1. Complete `setup()` successfully
2. Call `train_dataloader()` to create the training DataLoader
3. Begin training iteration

## Actual Behavior

- ✅ All 4 ranks complete `setup()` successfully
- ✅ All 4 ranks pass any DDP barriers in `setup()`
- ❌ Lightning internally calls `train_dataloader()` ONLY on ranks 1 and 3
- ❌ Ranks 0 and 2 hang indefinitely in Lightning's internal code
- ❌ Training never starts due to deadlock

## Debug Output

```
[COMBINED SETUP] Barrier passed, train setup complete  # All 4 ranks print this
[rank: 0] Seed set to 42
[rank: 1] Seed set to 42
[rank: 2] Seed set to 42
[rank: 3] Seed set to 42

# Only ranks 1 and 3 proceed:
[COMBINED] Creating train_dataloader, num_workers=0  # Rank 1
[COMBINED] Building DataLoader...                     # Rank 1
[COMBINED __len__] Rank 1, returning length=65120
[SAMPLER INIT] Rank 1, dataset_len=65120
[SAMPLER INIT] Rank 1 completed, num_samples=16280

[COMBINED] Creating train_dataloader, num_workers=0  # Rank 3
[COMBINED __len__] Rank 3, returning length=65120
[SAMPLER INIT] Rank 3, dataset_len=65120
[SAMPLER INIT] Rank 3 completed, num_samples=16280

# Ranks 0 and 2 NEVER print these messages - they're stuck in Lightning internal code
# Training hangs here permanently
```

## Additional Context

### What Works:
- ✅ `DatasetA` alone with DDP on 4 GPUs
- ✅ `DatasetB` alone with DDP on 4 GPUs  
- ✅ `CombinedDataset` on single GPU (no DDP)
- ✅ Sanity check with validation dataloader (all ranks work)

### What Fails:
- ❌ `CombinedDataset` with DDP on 4 GPUs (only during training dataloader creation)

### Key Observations:

1. **Rank-Selective Failure**: Only even-numbered ranks (0, 2) fail to reach `train_dataloader()`, while odd-numbered ranks (1, 3) succeed

2. **After Setup Success**: All ranks complete `setup()` and pass DDP barriers, but Lightning's internal code path diverges before calling `train_dataloader()`

3. **num_workers=0**: Issue occurs even with single-process data loading (no multiprocessing workers), ruling out worker initialization issues

4. **Consistent Pattern**: Behavior is deterministic and reproducible across multiple runs


### What version are you seeing the problem on?

v2.5

### Reproduced in studio

_No response_

### How to reproduce the bug

```python

```

### Error messages and logs

```
# Error messages and logs here please
```


### Environment

<details>
  <summary>Current environment</summary>

```
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
```

</details>


### More info

_No response_

cc @ethanwharris @justusschock @lantiga

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lightning PyTorch DDP Bug Report: Selective Rank Deadlock with Combined Dataset #21367

Bug description

Lightning PyTorch DDP Bug Report: Selective Rank Deadlock with Combined Dataset

Bug Description

Environment

Minimal Reproducible Example

Dataset Structure

Training Configuration

Steps to Reproduce

Expected Behavior

Actual Behavior

Debug Output

Additional Context

What Works:

What Fails:

Key Observations:

What version are you seeing the problem on?

Reproduced in studio

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Lightning PyTorch DDP Bug Report: Selective Rank Deadlock with Combined Dataset #21367

Description

Bug description

Lightning PyTorch DDP Bug Report: Selective Rank Deadlock with Combined Dataset

Bug Description

Environment

Minimal Reproducible Example

Dataset Structure

Training Configuration

Steps to Reproduce

Expected Behavior

Actual Behavior

Debug Output

Additional Context

What Works:

What Fails:

Key Observations:

What version are you seeing the problem on?

Reproduced in studio

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions