Skip to content

Update distributed example tests in run_python_examples.sh #1250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 3, 2024

Conversation

sirutBuasai
Copy link
Contributor

@sirutBuasai sirutBuasai commented Apr 29, 2024

Update distributed example tests in run_python_examples.sh to use run_examples.sh.

Current script will fail distributed tests when retrieving torchrun environment variables such as WORLD_SIZE if not launched from run_examples.sh

@sirutBuasai sirutBuasai requested a review from msaroufim as a code owner April 29, 2024 21:14
@facebook-github-bot
Copy link
Contributor

Hi @sirutBuasai!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Copy link

netlify bot commented Apr 29, 2024

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit 6c8bb39
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/6632ee6e0bc58200081a175b

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@msaroufim
Copy link
Member

@sirutBuasai
Copy link
Contributor Author

Why not add these lines to the distributed test script instead?
Any reason it is only running fsdp_tp_example?

@msaroufim
Copy link
Member

Please go for it, it'd be an extremely useful contribution

@sirutBuasai
Copy link
Contributor Author

sirutBuasai commented May 1, 2024

I've added the script to the distributed test script.

I considered just calling bash run_distributed_examples.sh in distributed() function in the run_python_examples.sh so that we only declare the script entrypoint once and avoid redundancy but that might present some issues with error reporting since we'll be printing the error reports in both run_distributed_examples.sh and run_python_examples.sh.

Let me know what you think about whether we should consolidate the distributed tests into run_distributed_examples.sh or not,

@msaroufim
Copy link
Member

Indeed I'm suggesting that all distributed tests should only be in run_distributed_examples.sh

@sirutBuasai
Copy link
Contributor Author

sirutBuasai commented May 1, 2024

DRY'd the code a little but it might be heavier than necessary. lmk if there are any other changes I should make.

@sirutBuasai sirutBuasai requested a review from msaroufim May 2, 2024 20:33
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thank you!

@msaroufim msaroufim merged commit c49554c into pytorch:main May 3, 2024
@sirutBuasai sirutBuasai deleted the release/2.3 branch September 17, 2024 23:31
@sirutBuasai sirutBuasai restored the release/2.3 branch September 17, 2024 23:31
YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025
…#1250)

* Fix distributed test

* fix parallel scripts

* install dill

* remove dill

* run 2 gpu

* remove gpucount, use default

* Add examples to distributed examples

* refactor distributed test

* fx ERRORS overwriting

* run with base dir

* remove distributed from run_python_examples.sh

* move basedir to source

* separate init

---------

Co-authored-by: Sirut Buasai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants