Skip to content

refactor the total norm computation in grad clipping in APS #3243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jialun-zhang
Copy link

Summary: Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the fsdp and ddp params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 29, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Jul 30, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the fsdp and ddp params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Jul 30, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the fsdp and ddp params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Jul 30, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the fsdp and ddp params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Jul 30, 2025
…3243)

Summary:
Pull Request resolved: pytorch#3243

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the fsdp and ddp params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Jul 31, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the fsdp and ddp params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Jul 31, 2025
…3243)

Summary:
Pull Request resolved: pytorch#3243

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the fsdp and ddp params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Jul 31, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the fsdp and ddp params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Jul 31, 2025
…3243)

Summary:
Pull Request resolved: pytorch#3243

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@jialun-zhang jialun-zhang force-pushed the export-D79128843 branch 2 times, most recently from 29f3764 to 5199ed0 Compare July 31, 2025 18:37
jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Jul 31, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Jul 31, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Aug 4, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Aug 4, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Aug 4, 2025
…3243)

Summary:
Pull Request resolved: pytorch#3243

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@jialun-zhang jialun-zhang force-pushed the export-D79128843 branch 2 times, most recently from 4ea53e1 to 43c4f99 Compare August 4, 2025 19:57
jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Aug 4, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Aug 4, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Aug 5, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@jialun-zhang jialun-zhang force-pushed the export-D79128843 branch 2 times, most recently from 969fda9 to d7adb15 Compare August 5, 2025 05:49
jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Aug 5, 2025
…3243)

Summary:
Pull Request resolved: pytorch#3243

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Aug 5, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Aug 5, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
jialun-zhang pushed a commit to jialun-zhang/torchrec that referenced this pull request Aug 5, 2025
…3243)

Summary:

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

…3243)

Summary:
Pull Request resolved: pytorch#3243

Refactored the previous code for applying gradient clipping across ddp and fsdp parameter. Added a new funciton _compute_total_norm() that takes in the replicated and sharded params provided in the gradientclippingOpitmizer class and computes the total gradient norm of the given parameter.

Differential Revision: D79128843
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79128843

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants