Could validation loss has both step and epoch logs? #18337

Galaxy-Husky · 2023-08-18T03:32:34Z

Galaxy-Husky
Aug 18, 2023

Hello, I'd like to log valiadation loss both every 1000 steps and at the end of each epoch.

So I set val_check_interval: 1000 and check_val_every_n_epoch: 1 in the config yaml for my lightningcli. Besides, I add following code in the validation_step:

self.log("val_loss", loss, on_epoch=True, prog_bar=True, sync_dist=True)

Let's take an example. Since the number of iterations in each epoch is 5182 in my model, the val loss was logged at step 1000, 2000, ..., 5000, 6182 (first epoch), 7182, ..., 10182, 10364 (second epoch), 11364, ... in the same log of tensorboard.

questions:

Are the val_check_interval, check_val_every_n_epoch and on_epoch=True conflicting choices? If so, what is the priority order?
Could validation loss has both step and epoch logs like training?

changspencer · 2023-08-19T14:10:30Z

changspencer
Aug 19, 2023

Traditionally, validating models occurs after each training epoch. It's unclear to me why you would wish to do this before the end of a training epoch unless your training data is extremely large or you're looking at online learning (ie. active, reinforcement, etc).

Instead of trying to set a flag/parameter for validation to be computed and logged after 1000 steps, why not try making your training epoch fit a given number of steps instead? For instance, set up the dataloader to only have a configurable number of steps that would determine the number of batches allowed per training epoch. I'm pretty sure dataset/dataloader objects allow you to specify a percentage of the total batches to use.

2 replies

Galaxy-Husky Aug 21, 2023
Author

Thank you for your reply. Yes, my training dataset is large and it will overfit before the third epoch end.

I'm sorry I didn't fully understand your advice. Could you give me some sample code or link?

changspencer Aug 21, 2023

It seems you may have at least one option using Pytorch Lightning's Trainer class. I think limit_train_batches is what you're looking for. I'd just check that you set your training dataloader to shuffle and to do sanity checks to ensure that you get a different part of the training dataset each time (based on an old discussion in 2020).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Could validation loss has both step and epoch logs? #18337

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Could validation loss has both step and epoch logs? #18337

Uh oh!

Galaxy-Husky Aug 18, 2023

Replies: 1 comment · 2 replies

Uh oh!

changspencer Aug 19, 2023

Uh oh!

Galaxy-Husky Aug 21, 2023 Author

Uh oh!

Uh oh!

changspencer Aug 21, 2023

Galaxy-Husky
Aug 18, 2023

Replies: 1 comment 2 replies

changspencer
Aug 19, 2023

Galaxy-Husky Aug 21, 2023
Author