Support acclerate multi gpu training #558

mshukor · 2024-12-08T10:31:06Z

What this does

Based on this PR. It includes:

The ability to keep training without accelerate
Updated to the recent main
Some minor fixes

Note: we still need to merge with vla branch before merging

How it was tested

ENV=aloha
ENV_TASK=AlohaTransferCube-v0
dataset_repo_id=lerobot/aloha_sim_transfer_cube_human
policy=act
LR=1e-5
LR_SCHEDULER=
USE_AMP=false
ASYNC_ENV=false

GPUS=2
EVAL_FREQ=10000 #51000 #10000 51000
OFFLINE_STEPS=100000 #25000 17000 12500 50000
TRAIN_BATCH_SIZE=4 # global batch size / num of gpus
EVAL_BATCH_SIZE=50

TASK_NAME=lerobot_${ENV}_transfer_cube_${policy}_2gpus

python -m accelerate.commands.launch --num_processes=$GPUS --mixed_precision=fp16 lerobot/scripts/train.py \
 hydra.job.name=base_distributed_aloha_transfer_cube \
 hydra.run.dir=/data/mshukor/logs/lerobot/${TASK_NAME} \
 dataset_repo_id=$dataset_repo_id \
 policy=$policy \
 env=$ENV env.task=$ENV_TASK \
 training.offline_steps=$OFFLINE_STEPS training.batch_size=$TRAIN_BATCH_SIZE \
 training.eval_freq=$EVAL_FREQ eval.n_episodes=50 eval.use_async_envs=$ASYNC_ENV eval.batch_size=$EVAL_BATCH_SIZE \
 training.lr_scheduler=$LR_SCHEDULER training.lr=$LR \
 wandb.enable=true

Co-authored-by: Simon Alibert <[email protected]>

Co-authored-by: Remi <[email protected]> Co-authored-by: Remi Cadene <[email protected]>

…uggingface#466)

Co-authored-by: Simon Alibert <[email protected]>

Co-authored-by: Remi <[email protected]> Co-authored-by: Remi <[email protected]>

Co-authored-by: Remi <[email protected]>

Co-authored-by: jess-moss <[email protected]> Co-authored-by: Simon Alibert <[email protected]>

…icy (huggingface#484) Co-authored-by: Alexander Soare <[email protected]>

Signed-off-by: ivelin <[email protected]>

…uggingface#450)

…ggingface#489)

)

Co-authored-by: Remi <[email protected]>

Cadene and others added 30 commits October 3, 2024 17:05

Enable CI for robot devices with mocked versions (huggingface#398)

26f97cf

Co-authored-by: Simon Alibert <[email protected]>

Add support for Stretch (hello-robot) (huggingface#409)

1a343c3

Co-authored-by: Remi <[email protected]> Co-authored-by: Remi Cadene <[email protected]>

Fix nightly by updating .cache in dockerignore (huggingface#464)

d5b6696

Fix issue with wrong using index instead of camera_index in opencv (h…

c29e70e

…uggingface#466)

Add policy/act_aloha_real.yaml + env/act_real.yaml (huggingface#429)

97b1feb

Co-authored-by: Simon Alibert <[email protected]>

Refactor record with add_frame (huggingface#468)

77478d5

Co-authored-by: Simon Alibert <[email protected]>

Make say(blocking=True) work for Linux (huggingface#460)

cd0fc26

Fix gymnasium version as pre-1.0.0 (huggingface#471)

c351e1f

Co-authored-by: Remi <[email protected]> Co-authored-by: Remi <[email protected]>

Update 9_use_aloha.md, missing comma (huggingface#479)

2efee45

Fix link (huggingface#482)

114870d

Co-authored-by: Remi <[email protected]>

Add FeetechMotorsBus, SO-100, Moss-v1 (huggingface#419)

07e8716

Co-authored-by: jess-moss <[email protected]> Co-authored-by: Simon Alibert <[email protected]>

Fix autocalib moss (huggingface#486)

55e4ff6

[Fix] Move back to manual calibration (huggingface#488)

172809a

feat: enable to use multiple rgb encoders per camera in diffusion pol…

538455a

…icy (huggingface#484) Co-authored-by: Alexander Soare <[email protected]>

Fix config file (huggingface#495)

e0df56d

fix: broken images and a few minor typos in README (huggingface#499)

963738d

Signed-off-by: ivelin <[email protected]>

Add support for Windows (huggingface#494)

8af6935

bug causes error uploading to huggingface, unicode issue on windows. (h…

20f4667

…uggingface#450)

Add distinction between two unallowed cases in name check "eval_" (hu…

975c1c2

…ggingface#489)

Rename deprecated argument (temporal_ensemble_momentum) (huggingface#490

96c7052

)

Dataset v2.0 (huggingface#461)

32eb0ce

Co-authored-by: Remi <[email protected]>

add changes from accelerate branch

93e6c3b

training with accelerate utils

ec66c36

disabel rendering

9f11f8a

remove disable rendering

bcd902b

log acclerate to wandb and fix symlink

6dbe067

fix loading to wandb

d3cbb77

fix eval on aloha

c70e17a

precommit

c3339f4

remove example

6422229

aliberts deleted the branch huggingface:user/rcadene/2024_10_07_vla January 22, 2025 10:34

aliberts closed this Jan 22, 2025

YushunXiang mentioned this pull request May 30, 2025

Does LeRobot have a program schedule that supports multi-GPUs parallel training? #1176

Closed

YushunXiang mentioned this pull request Jun 9, 2025

[WIP] Multi-gpus training with accelerate #1246

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support acclerate multi gpu training #558

Support acclerate multi gpu training #558

Uh oh!

mshukor commented Dec 8, 2024

Uh oh!

Uh oh!

Support acclerate multi gpu training #558

Support acclerate multi gpu training #558

Uh oh!

Conversation

mshukor commented Dec 8, 2024

What this does

How it was tested

Uh oh!

Uh oh!