Skip to content

Bringing together results on ddp on a single machine #702

@brucemuller

Description

@brucemuller

I'd like to understand how to use ddp properly with multiple GPUs on a single machine as I'm unsure of how to bring results together using this method.

I'm using TensorBoard for logging

The problem seems to be that my code (below) runs on each of the three GPUs (with a third of the data each), but the variables like "overall_correct" only exist for each of the three processes so only a third of the data gets logged. For example, my overall performance on a single GPU is 82% but with the above process on 3 GPUs it is a third of that. I know this is a kindof silly thing but can someone explain how I should bring together the required validation/training statistics from the sub-processes using pytorch lightning?

My process is roughly:

model = MyModel(hparams)
tt_logger = TestTubeLogger(save_dir="path",name=expname)
trainer = Trainer(logger = tt_logger , gpus=3, distributed_backend='ddp' )
trainer.fit(model)

class MyModel(LightningModule):

     def __init__(self, hparams):
          super(MyModel, self).__init__() 
          self.hparams = hparams
          self.resnet = ResNetEncoder(self.hparams)
          self.loss_meter_training = averageMeter()
          self.overall_correct = 0.0

    def training_step(self, batch, batch_i):   
        ...
        self.loss_meter_training.update(float(total_loss))
        return {'loss': total_loss}

    def validation_step(self, batch, batch_nb):
        ...
        if something:
            self.overall_correct += 0.0
        return {'val_loss': total_loss}

    def validation_end(self, outputs):

        self.logger.experiment.add_scalar('epoch losses/training/total', self.loss_meter_training.avg, self.epoch_nb)
        self.logger.experiment.add_scalar('metrics/validation performance', self.overall_correct/20000, self.epoch_nb)
        self.loss_meter_validation.reset() 
        self.overall_correct = 0.0

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=self.hparams.lr)
        return optimizer 

    @pl.data_loader
    def tng_dataloader(self):

        t_loader = PairLoader('dummy', self.hparams , split='training')
        dist_sampler = torch.utils.data.distributed.DistributedSampler(t_loader)
        trainloader = data.DataLoader(t_loader,batch_size=self.hparams.batch_size, sampler=dist_sampler, num_workers = 12)  
    
        return trainloader
    
    @pl.data_loader
    def val_dataloader(self):
        
        v_loader = PairLoader('dummy', self.hparams , split='validation')
        dist_sampler = torch.utils.data.distributed.DistributedSampler(v_loader)
        trainloader = data.DataLoader(v_loader,batch_size=self.hparams.batch_size, sampler=dist_sampler, num_workers = 12) 
        
        return trainloader

Metadata

Metadata

Labels

featureIs an improvement or enhancement

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions