Skip to content

MultiGPU finetuning: AttributeError: 'Params4bit' object has no attribute 'absmax' #904

@smartinezai

Description

@smartinezai

System Info

module load python/3.11.6--gcc--8.5.0
module load openmpi/4.1.6--gcc--12.2.0
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load cuda/12.1

3 Nvidia A100

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

I am running finetuning.py using FSDP, PEFT and 4 bit quantization using the following slurm script


#!/bin/bash

# sbatch settings
#SBATCH --job-name=1n3gpu
#SBATCH --time=03:30:01
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:3
#SBATCH --partition=boost_usr_prod
#SBATCH --cpus-per-task=32
#SBATCH --output=logs/output_%j.log
#SBATCH --error=logs/error_%j.log


# Load required modules for Leonardo HPC
module purge
module load profile/deeplrn 

module load python/3.11.6--gcc--8.5.0
module load openmpi/4.1.6--gcc--12.2.0
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load cuda/12.1

source /leonardo_work/EUHPC_A04_062/vllm_env/bin/activate

###
#check if IB is used by NCCL
export NCCL_NET=IB
export NCCL_DEBUG=INFO

# training setup
GPUS_PER_NODE=3
MASTER_NAME=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=29500
NNODES=$SLURM_NNODES
MACHINE_RANK=$SLURM_PROCID
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
###
#srun bash -c '$CMD'
echo
srun python -u -m torch.distributed.run --nproc_per_node 3 --nnodes $SLURM_JOB_NUM_NODES --rdzv_backend c10d --rdzv_endpoint $MASTER_NAME:$MASTER_PORT \
 src/llama_cookbook/finetuning.py \
 --enable_fsdp \
 --use_peft \
 --peft_method lora \
 --quantization 4bit \
 --model_name /leonardo_scratch/fast/EUHPC_A04_062/Llama-3.3-70B-Instruct \
 --dataset monke_dataset \
 --save_model \
 --num_epochs 2 \
 --context_length 4096 \
 --output_dir /leonardo_work/EUHPC_A04_062/AI_Model/output_models/peft/Llama-3.3-70B-Instruct \
 --dist_checkpoint_root_folder model_checkpoints \
 --dist_checkpoint_folder fine-tuned \
 --use_wandb

Error logs


*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py:18: DeprecationWarning: `torch.distributed._shard.checkpoint` will be deprecated, use `torch.distributed.checkpoint` instead
  from torch.distributed._shard.checkpoint import (
/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py:18: DeprecationWarning: `torch.distributed._shard.checkpoint` will be deprecated, use `torch.distributed.checkpoint` instead
  from torch.distributed._shard.checkpoint import (
/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py:18: DeprecationWarning: `torch.distributed._shard.checkpoint` will be deprecated, use `torch.distributed.checkpoint` instead
  from torch.distributed._shard.checkpoint import (
`low_cpu_mem_usage` was None, now default to True since model is quantized.
`low_cpu_mem_usage` was None, now default to True since model is quantized.

Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.8
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.

Loading checkpoint shards:   3%|▎         | 1/30 [00:06<02:55,  6.05s/it]
Loading checkpoint shards:   3%|▎         | 1/30 [00:06<02:59,  6.20s/it]`low_cpu_mem_usage` was None, now default to True since model is quantized.

Loading checkpoint shards: 100%|██████████| 30/30 [02:53<00:00,  5.77s/it]
/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/cuda/memory.py:391: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(

Training Epoch: 1:   0%|�[34m          �[0m| 0/8 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/cuda/memory.py:391: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(

Training Epoch: 1/2, step 7/8 completed (loss: 0.9345082640647888): 100%|�[34m██████████�[0m| 8/8 [00:48<00:00,  6.03s/it]

Training Epoch: 1/2, step 7/8 completed (loss: 1.1425457000732422): 100%|�[34m██████████�[0m| 8/8 [00:57<00:00,  7.17s/it]


evaluating Epoch:   0%|�[32m          �[0m| 0/33 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
evaluating Epoch: 100%|�[32m██████████�[0m| 33/33 [00:26<00:00,  1.25it/s]

evaluating Epoch: 100%|�[32m██████████�[0m| 33/33 [00:26<00:00,  1.25it/s]
[rank1]: Traceback (most recent call last):
[rank1]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/finetuning.py", line 429, in <module>
[rank1]:     fire.Fire(main)
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/finetuning.py", line 407, in main
[rank1]:     results = train(
[rank1]:               ^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/utils/train_utils.py", line 242, in train
[rank1]:     save_peft_checkpoint(model, train_config.output_dir)
[rank1]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py", line 284, in save_peft_checkpoint
[rank1]:     state_dict = get_model_state_dict(model, options=options)
[rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 999, in get_model_state_dict
[rank1]:     model_state_dict = _get_model_state_dict(model, info)
[rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 475, in _get_model_state_dict
[rank1]:     fqns = _get_fqns(model, key)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 224, in _get_fqns
[rank1]:     curr_obj = getattr(curr_obj, curr_obj_name)
[rank1]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: AttributeError: 'Params4bit' object has no attribute 'absmax'
[rank2]: Traceback (most recent call last):
[rank2]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/finetuning.py", line 429, in <module>
[rank2]:     fire.Fire(main)
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank2]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank2]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank2]:     component, remaining_args = _CallAndUpdateTrace(
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank2]:     component = fn(*varargs, **kwargs)
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/finetuning.py", line 407, in main
[rank2]:     results = train(
[rank2]:               ^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/utils/train_utils.py", line 242, in train
[rank2]:     save_peft_checkpoint(model, train_config.output_dir)
[rank2]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py", line 284, in save_peft_checkpoint
[rank2]:     state_dict = get_model_state_dict(model, options=options)
[rank2]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 999, in get_model_state_dict
[rank2]:     model_state_dict = _get_model_state_dict(model, info)
[rank2]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 475, in _get_model_state_dict
[rank2]:     fqns = _get_fqns(model, key)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 224, in _get_fqns
[rank2]:     curr_obj = getattr(curr_obj, curr_obj_name)
[rank2]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: AttributeError: 'Params4bit' object has no attribute 'absmax'
W0324 16:14:25.342000 1346190 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1346238 closing signal SIGTERM
W0324 16:14:25.344000 1346190 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1346239 closing signal SIGTERM
E0324 16:14:26.659000 1346190 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 1346240) of binary: /leonardo_work/EUHPC_A04_062/vllm_env/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/run.py", line 922, in <module>
    main()
  File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
src/llama_cookbook/finetuning.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-03-24_16:14:25
  host      : lrdn0021-net1-3.leonardo.local
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 1346240)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: lrdn0021: task 0: Exited with exit code 1

Expected behavior

I expected the finetuning to finish without any issue

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions