Bringing together results on ddp on a single machine

I'd like to understand how to use ddp properly with multiple GPUs on a single machine as I'm unsure of how to bring results together using this method.

I'm using TensorBoard for logging

The problem seems to be that my code (below) runs on each of the three GPUs (with a third of the data each), but the variables like  "overall_correct" only exist for each of the three processes so only a third of the data gets logged. For example, my overall performance on a single GPU is 82% but with the above process on 3 GPUs it is a third of that. I know this is a kindof silly thing but can someone explain how I should bring together the required validation/training statistics from the sub-processes using pytorch lightning?

My process is roughly:

```
model = MyModel(hparams)
tt_logger = TestTubeLogger(save_dir="path",name=expname)
trainer = Trainer(logger = tt_logger , gpus=3, distributed_backend='ddp' )
trainer.fit(model)

class MyModel(LightningModule):

     def __init__(self, hparams):
          super(MyModel, self).__init__() 
          self.hparams = hparams
          self.resnet = ResNetEncoder(self.hparams)
          self.loss_meter_training = averageMeter()
          self.overall_correct = 0.0

    def training_step(self, batch, batch_i):   
        ...
        self.loss_meter_training.update(float(total_loss))
        return {'loss': total_loss}

    def validation_step(self, batch, batch_nb):
        ...
        if something:
            self.overall_correct += 0.0
        return {'val_loss': total_loss}

    def validation_end(self, outputs):

        self.logger.experiment.add_scalar('epoch losses/training/total', self.loss_meter_training.avg, self.epoch_nb)
        self.logger.experiment.add_scalar('metrics/validation performance', self.overall_correct/20000, self.epoch_nb)
        self.loss_meter_validation.reset() 
        self.overall_correct = 0.0

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=self.hparams.lr)
        return optimizer 

    @pl.data_loader
    def tng_dataloader(self):

        t_loader = PairLoader('dummy', self.hparams , split='training')
        dist_sampler = torch.utils.data.distributed.DistributedSampler(t_loader)
        trainloader = data.DataLoader(t_loader,batch_size=self.hparams.batch_size, sampler=dist_sampler, num_workers = 12)  
    
        return trainloader
    
    @pl.data_loader
    def val_dataloader(self):
        
        v_loader = PairLoader('dummy', self.hparams , split='validation')
        dist_sampler = torch.utils.data.distributed.DistributedSampler(v_loader)
        trainloader = data.DataLoader(v_loader,batch_size=self.hparams.batch_size, sampler=dist_sampler, num_workers = 12) 
        
        return trainloader
```




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bringing together results on ddp on a single machine #702

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bringing together results on ddp on a single machine #702

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions