-
Notifications
You must be signed in to change notification settings - Fork 2k
Multi-gpus training with accelerate #1246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
Please merge? |
This PR is not working: return self._untyped_storage.data_ptr() The make_policy will fail |
Could you give me some details about this error? |
Hi, the torch2.6+ introduces DTensor feature. Accelerate won't be able to load the model properly or prepare the model for distributed training when DTensor is not disabled. From what I can see in the train_accelerate, there is nowhere to properly handle this error. On my side, torch 2.7.1 + transformers & accelerate latest hit the error when training on multiple GPUs. Once you resolve the loading state dict error, there will still be an error: [rank0]: AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default! |
Can you please tell me what modifications need to be made to this saved weights for inference, I find that multi-card weights are very ineffective (compared to single card training) |
I used |
My checkpoint file is the same as the single card, but when reasoning (pi0) it has almost no effect, is there a problem with the checkpoint saving logic or do I need to make other changes, please advise! |
It is a interesting question. In Line 288, if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
logging.info(f"Checkpoint policy after step {step}")
checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)
# Wait for all processes before saving
accelerator.wait_for_everyone()
# Unwrap model for saving
unwrapped_policy = accelerator.unwrap_model(policy)
save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
update_last_checkpoint(checkpoint_dir) You can modify it to: if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
logging.info(f"Checkpoint policy after step {step}")
checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)
# Wait for all processes before saving
accelerator.wait_for_everyone()
accelerator.save_model(model, save_directory) Then try again? |
I am training with multiple GPUs now. |
I'm curious why you don't run into a deadlock problem due to incorrect usage
|
I've actually encountered this problem and look forward to the author's answer |
You mentioned above that you encountered the |
You are right. I think the correct code is: if cfg.save_checkpoint and is_saving_step:
accelerator.wait_for_everyone()
if accelerator.is_main_process:
logging.info(f"Checkpoint policy after step {step}")
checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)
logging.info(colored("This is save model in ", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")
accelerator.save_model(model, save_directory)
accelerator.wait_for_everyone() I will test it and commit it again. Thanks! |
When I saved the model using the [rank0]: Traceback (most recent call last): Do you know how to solve it? |
I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a
lerobot/scripts/train_accelerate.py
to fix it.What this does
This pull request introduces a new training script leveraging the
accelerate
library for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:New Training Script
lerobot/scripts/train_accelerate.py
that integrates theaccelerate
library for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.Configuration Updates
gradient_accumulation_steps
in thePreTrainedConfig
class to support gradient accumulation during training.Dependency Updates
accelerate>=1.7.0
to thepyproject.toml
file to include theaccelerate
library as a dependency for distributed and mixed-precision training.How to checkout & try? (for the reviewer)
Provide a simple way for the reviewer to try out your changes.
Examples: