🐛 Bug description
Because of the division here, and similar functions, the loss has the wrong scale when using gradient_accumulation_steps. This makes it confusingly low in comparison to the valid loss.
One option is to only divide on the backward call, i.e. doing this:
scaler.scale(loss / gradient_accumulation_steps).backward()
or one could multiply the loss again with gradient_accumulation_steps before returning it.