-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Labels
featureIs an improvement or enhancementIs an improvement or enhancement
Milestone
Description
I'd like to understand how to use ddp properly with multiple GPUs on a single machine as I'm unsure of how to bring results together using this method.
I'm using TensorBoard for logging
The problem seems to be that my code (below) runs on each of the three GPUs (with a third of the data each), but the variables like "overall_correct" only exist for each of the three processes so only a third of the data gets logged. For example, my overall performance on a single GPU is 82% but with the above process on 3 GPUs it is a third of that. I know this is a kindof silly thing but can someone explain how I should bring together the required validation/training statistics from the sub-processes using pytorch lightning?
My process is roughly:
model = MyModel(hparams)
tt_logger = TestTubeLogger(save_dir="path",name=expname)
trainer = Trainer(logger = tt_logger , gpus=3, distributed_backend='ddp' )
trainer.fit(model)
class MyModel(LightningModule):
def __init__(self, hparams):
super(MyModel, self).__init__()
self.hparams = hparams
self.resnet = ResNetEncoder(self.hparams)
self.loss_meter_training = averageMeter()
self.overall_correct = 0.0
def training_step(self, batch, batch_i):
...
self.loss_meter_training.update(float(total_loss))
return {'loss': total_loss}
def validation_step(self, batch, batch_nb):
...
if something:
self.overall_correct += 0.0
return {'val_loss': total_loss}
def validation_end(self, outputs):
self.logger.experiment.add_scalar('epoch losses/training/total', self.loss_meter_training.avg, self.epoch_nb)
self.logger.experiment.add_scalar('metrics/validation performance', self.overall_correct/20000, self.epoch_nb)
self.loss_meter_validation.reset()
self.overall_correct = 0.0
def configure_optimizers(self):
optimizer = optim.Adam(self.parameters(), lr=self.hparams.lr)
return optimizer
@pl.data_loader
def tng_dataloader(self):
t_loader = PairLoader('dummy', self.hparams , split='training')
dist_sampler = torch.utils.data.distributed.DistributedSampler(t_loader)
trainloader = data.DataLoader(t_loader,batch_size=self.hparams.batch_size, sampler=dist_sampler, num_workers = 12)
return trainloader
@pl.data_loader
def val_dataloader(self):
v_loader = PairLoader('dummy', self.hparams , split='validation')
dist_sampler = torch.utils.data.distributed.DistributedSampler(v_loader)
trainloader = data.DataLoader(v_loader,batch_size=self.hparams.batch_size, sampler=dist_sampler, num_workers = 12)
return trainloader
Metadata
Metadata
Assignees
Labels
featureIs an improvement or enhancementIs an improvement or enhancement