Skip to content

get stuck when save_state using DeepSpeed backend under training train_text_to_image_lora #2606

@better629

Description

@better629

Describe the bug

When using DeepSpeed backend, training is ok but get stuck in accelerator.save_state(save_path). If use MULTI_GPU, the process is OK.

The training script is

accelerate launch train_text_to_image_lora.py \
    --pretrained_model_name_or_path="pretrain_models/stable-diffusion-v1-4/"  \
    --dataset_name="lambdalabs/pokemon-blip-captions"  \
    --output_dir="sd-pokemon-model-lora" \
    --resolution=512 \
    --gradient_accumulation_steps=1 \
    --checkpointing_steps=100 \
    --learning_rate=1e-4 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --max_train_steps=500 \
    --validation_epochs=50 \
    --seed="0" \
    --checkpointing_steps 50 \
    --train_batch_size=1 \
    --use_8bit_adam \
    --enable_xformers_memory_efficient_attention

Reproduction

MULTI_GPU backend xx/accelerate/default_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 1,2,3
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false

logs

03/08/2023 21:57:44 - INFO - __main__ - ***** Running training *****
03/08/2023 21:57:44 - INFO - __main__ -   Num examples = 833
03/08/2023 21:57:44 - INFO - __main__ -   Num Epochs = 2
03/08/2023 21:57:44 - INFO - __main__ -   Instantaneous batch size per device = 1
03/08/2023 21:57:44 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 3
03/08/2023 21:57:44 - INFO - __main__ -   Gradient Accumulation steps = 1
03/08/2023 21:57:44 - INFO - __main__ -   Total optimization steps = 500
Steps:  10%|████████▎                                                                          | 50/500 [00:11<01:31,  4.94it/s, lr=0.0001, step_loss=0.00245]03/08/2023 21:57:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-50
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Model weights saved in sd-pokemon-model-lora/checkpoint-50/pytorch_model.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora/checkpoint-50/optimizer.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora/checkpoint-50/scheduler.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Gradient scaler state saved in sd-pokemon-model-lora/checkpoint-50/scaler.pt
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora/checkpoint-50/random_states_0.pkl
03/08/2023 21:57:55 - INFO - __main__ - Saved state to sd-pokemon-model-lora/checkpoint-50
Steps:  20%|████████████████▌                                                                  | 100/500 [00:22<01:21,  4.92it/s, lr=0.0001, step_loss=0.0787]03/08/2023 21:58:06 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-100
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Model weights saved in sd-pokemon-model-lora/checkpoint-100/pytorch_model.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora/checkpoint-100/optimizer.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora/checkpoint-100/scheduler.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Gradient scaler state saved in sd-pokemon-model-lora/checkpoint-100/scaler.pt
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora/checkpoint-100/random_states_0.pkl
03/08/2023 21:58:06 - INFO - __main__ - Saved state to sd-pokemon-model-lora/checkpoint-100

DeepSpeed backend xx/accelerate/default_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false

Which I have commented self._checkpoint_tag_validation(tag) in runtime/engine.py or it stuck in this place.
If commented, the logs is

03/08/2023 22:06:10 - INFO - __main__ - ***** Running training *****
03/08/2023 22:06:10 - INFO - __main__ -   Num examples = 833
03/08/2023 22:06:10 - INFO - __main__ -   Num Epochs = 2
03/08/2023 22:06:10 - INFO - __main__ -   Instantaneous batch size per device = 1
03/08/2023 22:06:10 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 3
03/08/2023 22:06:10 - INFO - __main__ -   Gradient Accumulation steps = 1
03/08/2023 22:06:10 - INFO - __main__ -   Total optimization steps = 500
Steps:  10%|████████▎                                                                          | 50/500 [00:11<01:36,  4.68it/s, lr=0.0001, step_loss=0.00255]03/08/2023 22:06:22 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-50
03/08/2023 22:06:22 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[2023-03-08 22:06:22,219] [INFO] [logging.py:75:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is begin to save!
/home/deepwisdom/anaconda3/envs/wjl/lib/python3.10/site-packages/torch/nn/modules/module.py:1432: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2023-03-08 22:06:22,222] [INFO] [logging.py:75:log_dist] [Rank 0] Saving model checkpoint: sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt
[2023-03-08 22:06:22,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt...
[2023-03-08 22:06:22,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt.
...

get stuck in deepspeed/runtime/engine.py

# save_checkpoint
# https://github.com/microsoft/DeepSpeed/blob/v0.8.1/deepspeed/runtime/engine.py#LL3123C12-L3123C12

        if self.save_zero_checkpoint:
            self._create_zero_checkpoint_files(save_dir, tag)
            self._save_zero_checkpoint(save_dir, tag)

Logs

No response

System Info

Ubuntu 20.04
Nvidia GTX 3090
CUDA Version: 11.7
Torch: 1.13.1
Diffusers: 0.15.0.dev0
deepspeed: 0.8.1
xformers: 0.0.17.dev466
accelerate: 0.16.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions