-
Notifications
You must be signed in to change notification settings - Fork 7.1k
fix bug in training model by amp #4874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 25842f6 (more details on the Dr. CI page):
1 failure not recognized by patterns:
This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
@xiaohu2015 Thanks for the PR. If there is no issue related to describe the bug, please add relevant information on the PR description so that it's clear what was the previous issue, etc. @prabhat00155 Could you please have a look as you've recently work on this at #4547? |
@prabhat00155 can you review the PR? I found the amp was not working, so I update the training code to fix bugs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xiaohu2015! Could you please upload the logs before and after the changes, since this is not covered by our unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking this as "Request changes" to avoid accidental merges before we gather enough about the nature of the bug, the training logs and we do proper investigation on our side.
@xiaohu2015 You can unblock this by adding context info on this PR as discussed above. Thanks!
After changing the code, I trained ResNet50, the result is 75.7 (use amp) and 75.5 (not use amp). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaohu2015 Your PR contains some good corrections but it's still very thin on information on the bug itself.
After changing the code, I trained ResNet50, the result is 75.7 (use amp) and 75.5 (not use amp).
This is not necessarily an indication of an improvement. Doing multiple runs with different seeds can lead to slightly different results every time due to the random initialization and random transforms applied to the data.
|
||
optimizer.step() | ||
if args.clip_grad_norm is not None: | ||
nn.utils.clip_grad_norm_(utils.get_optimizer_params(optimizer), args.clip_grad_norm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this out - I was just referring to ClassyVision's implementation before.
Given that the official documentation is using model.parameters()
, I think we can switch to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Classy might had it like this to support learnable params on the loss (we don't have this on Vision). Another reason might be that it was convenient in terms of code structure.
@mannatsingh Do you have any idea why it was used like that in Classy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so the only important reason that I can think of is that Apex's AMP works on its own (different) parameters which are disconnected from the model in certain settings (like O2). If you used the other approach, you would not actually be clipping the gradients. I'm not sure if torchvision even supports Apex AMP though!
Other situations are manageable, for instance, if you optimize the model and the loss, you just need to make sure to use both everywhere (it's slightly risky but not a blocker).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @mannatsingh, this was very helpful. I think this means we can with model.parameters()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaohu2015 I've marked as resolved all the "FYI" comments above and left only those that need to be addressed to merge the PR.
Effectively the only thing required is to add support of gradient clipping when amp
is active. I provided a reference from the documentation on how to do it. It's worth refactoring slightly the code to simplify according to the comments.
Please let me know if you plan to continue working on the PR. Thanks!
yes, the experment is to check that the amp is working, not to prove the model trained with amp is better. |
thanks, I do some modification with the help of the document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaohu2015 Thanks for the PR, LGTM!
Given that the code was modified heavily and it's not covered with tests, it would be good to do a run on our side to confirm that everything works as expected.
@prabhat00155 Let me know if you have the bandwidth for this.
@sallysyw Concerning the simplification discussed at https://github.com/pytorch/vision/pull/4874/files#r744539820, is this something you would be interested in doing or shall we create an issue about it?
Yeah sure, let me kick off a training run. |
This runs fine. Here is the output log: |
Hey @prabhat00155! You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py |
Summary: * fix bug in amp * fix bug in training by amp * support use gradient clipping when amp is enabled Reviewed By: datumbox Differential Revision: D32298968 fbshipit-source-id: 4366674522dc0faf5688207faa7e3cd33be2a6ea Co-authored-by: Vasilis Vryniotis <[email protected]> Co-authored-by: Prabhat Roy <[email protected]>
* fix bug in amp * fix bug in training by amp * support use gradient clipping when amp is enabled Co-authored-by: Vasilis Vryniotis <[email protected]> Co-authored-by: Prabhat Roy <[email protected]>
This PR:
autocast
contextcc @datumbox