-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Closed
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers
Description
Describe the bug
When using DeepSpeed backend, training is ok but get stuck in accelerator.save_state(save_path)
. If use MULTI_GPU, the process is OK.
The training script is
accelerate launch train_text_to_image_lora.py \
--pretrained_model_name_or_path="pretrain_models/stable-diffusion-v1-4/" \
--dataset_name="lambdalabs/pokemon-blip-captions" \
--output_dir="sd-pokemon-model-lora" \
--resolution=512 \
--gradient_accumulation_steps=1 \
--checkpointing_steps=100 \
--learning_rate=1e-4 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_epochs=50 \
--seed="0" \
--checkpointing_steps 50 \
--train_batch_size=1 \
--use_8bit_adam \
--enable_xformers_memory_efficient_attention
Reproduction
MULTI_GPU backend xx/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 1,2,3
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false
logs
03/08/2023 21:57:44 - INFO - __main__ - ***** Running training *****
03/08/2023 21:57:44 - INFO - __main__ - Num examples = 833
03/08/2023 21:57:44 - INFO - __main__ - Num Epochs = 2
03/08/2023 21:57:44 - INFO - __main__ - Instantaneous batch size per device = 1
03/08/2023 21:57:44 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 3
03/08/2023 21:57:44 - INFO - __main__ - Gradient Accumulation steps = 1
03/08/2023 21:57:44 - INFO - __main__ - Total optimization steps = 500
Steps: 10%|████████▎ | 50/500 [00:11<01:31, 4.94it/s, lr=0.0001, step_loss=0.00245]03/08/2023 21:57:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-50
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Model weights saved in sd-pokemon-model-lora/checkpoint-50/pytorch_model.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora/checkpoint-50/optimizer.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora/checkpoint-50/scheduler.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Gradient scaler state saved in sd-pokemon-model-lora/checkpoint-50/scaler.pt
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora/checkpoint-50/random_states_0.pkl
03/08/2023 21:57:55 - INFO - __main__ - Saved state to sd-pokemon-model-lora/checkpoint-50
Steps: 20%|████████████████▌ | 100/500 [00:22<01:21, 4.92it/s, lr=0.0001, step_loss=0.0787]03/08/2023 21:58:06 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-100
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Model weights saved in sd-pokemon-model-lora/checkpoint-100/pytorch_model.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora/checkpoint-100/optimizer.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora/checkpoint-100/scheduler.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Gradient scaler state saved in sd-pokemon-model-lora/checkpoint-100/scaler.pt
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora/checkpoint-100/random_states_0.pkl
03/08/2023 21:58:06 - INFO - __main__ - Saved state to sd-pokemon-model-lora/checkpoint-100
DeepSpeed backend xx/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false
Which I have commented self._checkpoint_tag_validation(tag)
in runtime/engine.py
or it stuck in this place.
If commented, the logs is
03/08/2023 22:06:10 - INFO - __main__ - ***** Running training *****
03/08/2023 22:06:10 - INFO - __main__ - Num examples = 833
03/08/2023 22:06:10 - INFO - __main__ - Num Epochs = 2
03/08/2023 22:06:10 - INFO - __main__ - Instantaneous batch size per device = 1
03/08/2023 22:06:10 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 3
03/08/2023 22:06:10 - INFO - __main__ - Gradient Accumulation steps = 1
03/08/2023 22:06:10 - INFO - __main__ - Total optimization steps = 500
Steps: 10%|████████▎ | 50/500 [00:11<01:36, 4.68it/s, lr=0.0001, step_loss=0.00255]03/08/2023 22:06:22 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-50
03/08/2023 22:06:22 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[2023-03-08 22:06:22,219] [INFO] [logging.py:75:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is begin to save!
/home/deepwisdom/anaconda3/envs/wjl/lib/python3.10/site-packages/torch/nn/modules/module.py:1432: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[2023-03-08 22:06:22,222] [INFO] [logging.py:75:log_dist] [Rank 0] Saving model checkpoint: sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt
[2023-03-08 22:06:22,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt...
[2023-03-08 22:06:22,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt.
...
get stuck in deepspeed/runtime/engine.py
# save_checkpoint
# https://github.com/microsoft/DeepSpeed/blob/v0.8.1/deepspeed/runtime/engine.py#LL3123C12-L3123C12
if self.save_zero_checkpoint:
self._create_zero_checkpoint_files(save_dir, tag)
self._save_zero_checkpoint(save_dir, tag)
Logs
No response
System Info
Ubuntu 20.04
Nvidia GTX 3090
CUDA Version: 11.7
Torch: 1.13.1
Diffusers: 0.15.0.dev0
deepspeed: 0.8.1
xformers: 0.0.17.dev466
accelerate: 0.16.0
arazd, kopyl, ShoufaChen, RenShuhuai-Andy, ryan-seungyong-lee and 2 more
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers