MultiGPU finetuning: AttributeError: 'Params4bit' object has no attribute 'absmax'

### System Info

module load python/3.11.6--gcc--8.5.0
module load openmpi/4.1.6--gcc--12.2.0
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load cuda/12.1

3 Nvidia A100

### Information

- [x] The official example scripts
- [x] My own modified scripts

### 🐛 Describe the bug

I am running finetuning.py using FSDP, PEFT and 4 bit quantization using the following slurm script
```

#!/bin/bash

# sbatch settings
#SBATCH --job-name=1n3gpu
#SBATCH --time=03:30:01
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:3
#SBATCH --partition=boost_usr_prod
#SBATCH --cpus-per-task=32
#SBATCH --output=logs/output_%j.log
#SBATCH --error=logs/error_%j.log


# Load required modules for Leonardo HPC
module purge
module load profile/deeplrn 

module load python/3.11.6--gcc--8.5.0
module load openmpi/4.1.6--gcc--12.2.0
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load cuda/12.1

source /leonardo_work/EUHPC_A04_062/vllm_env/bin/activate

###
#check if IB is used by NCCL
export NCCL_NET=IB
export NCCL_DEBUG=INFO

# training setup
GPUS_PER_NODE=3
MASTER_NAME=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=29500
NNODES=$SLURM_NNODES
MACHINE_RANK=$SLURM_PROCID
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
###
#srun bash -c '$CMD'
echo
srun python -u -m torch.distributed.run --nproc_per_node 3 --nnodes $SLURM_JOB_NUM_NODES --rdzv_backend c10d --rdzv_endpoint $MASTER_NAME:$MASTER_PORT \
 src/llama_cookbook/finetuning.py \
 --enable_fsdp \
 --use_peft \
 --peft_method lora \
 --quantization 4bit \
 --model_name /leonardo_scratch/fast/EUHPC_A04_062/Llama-3.3-70B-Instruct \
 --dataset monke_dataset \
 --save_model \
 --num_epochs 2 \
 --context_length 4096 \
 --output_dir /leonardo_work/EUHPC_A04_062/AI_Model/output_models/peft/Llama-3.3-70B-Instruct \
 --dist_checkpoint_root_folder model_checkpoints \
 --dist_checkpoint_folder fine-tuned \
 --use_wandb

```


### Error logs

```

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py:18: DeprecationWarning: `torch.distributed._shard.checkpoint` will be deprecated, use `torch.distributed.checkpoint` instead
  from torch.distributed._shard.checkpoint import (
/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py:18: DeprecationWarning: `torch.distributed._shard.checkpoint` will be deprecated, use `torch.distributed.checkpoint` instead
  from torch.distributed._shard.checkpoint import (
/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py:18: DeprecationWarning: `torch.distributed._shard.checkpoint` will be deprecated, use `torch.distributed.checkpoint` instead
  from torch.distributed._shard.checkpoint import (
`low_cpu_mem_usage` was None, now default to True since model is quantized.
`low_cpu_mem_usage` was None, now default to True since model is quantized.

Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.8
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.

Loading checkpoint shards:   3%|▎         | 1/30 [00:06<02:55,  6.05s/it]
Loading checkpoint shards:   3%|▎         | 1/30 [00:06<02:59,  6.20s/it]`low_cpu_mem_usage` was None, now default to True since model is quantized.

Loading checkpoint shards: 100%|██████████| 30/30 [02:53<00:00,  5.77s/it]
/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/cuda/memory.py:391: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(

Training Epoch: 1:   0%|[34m          [0m| 0/8 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/cuda/memory.py:391: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(

Training Epoch: 1/2, step 7/8 completed (loss: 0.9345082640647888): 100%|[34m██████████[0m| 8/8 [00:48<00:00,  6.03s/it]

Training Epoch: 1/2, step 7/8 completed (loss: 1.1425457000732422): 100%|[34m██████████[0m| 8/8 [00:57<00:00,  7.17s/it]


evaluating Epoch:   0%|[32m          [0m| 0/33 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
evaluating Epoch: 100%|[32m██████████[0m| 33/33 [00:26<00:00,  1.25it/s]

evaluating Epoch: 100%|[32m██████████[0m| 33/33 [00:26<00:00,  1.25it/s]
[rank1]: Traceback (most recent call last):
[rank1]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/finetuning.py", line 429, in <module>
[rank1]:     fire.Fire(main)
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/finetuning.py", line 407, in main
[rank1]:     results = train(
[rank1]:               ^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/utils/train_utils.py", line 242, in train
[rank1]:     save_peft_checkpoint(model, train_config.output_dir)
[rank1]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py", line 284, in save_peft_checkpoint
[rank1]:     state_dict = get_model_state_dict(model, options=options)
[rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 999, in get_model_state_dict
[rank1]:     model_state_dict = _get_model_state_dict(model, info)
[rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 475, in _get_model_state_dict
[rank1]:     fqns = _get_fqns(model, key)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 224, in _get_fqns
[rank1]:     curr_obj = getattr(curr_obj, curr_obj_name)
[rank1]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: AttributeError: 'Params4bit' object has no attribute 'absmax'
[rank2]: Traceback (most recent call last):
[rank2]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/finetuning.py", line 429, in <module>
[rank2]:     fire.Fire(main)
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank2]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank2]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank2]:     component, remaining_args = _CallAndUpdateTrace(
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank2]:     component = fn(*varargs, **kwargs)
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/finetuning.py", line 407, in main
[rank2]:     results = train(
[rank2]:               ^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/utils/train_utils.py", line 242, in train
[rank2]:     save_peft_checkpoint(model, train_config.output_dir)
[rank2]:   File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py", line 284, in save_peft_checkpoint
[rank2]:     state_dict = get_model_state_dict(model, options=options)
[rank2]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 999, in get_model_state_dict
[rank2]:     model_state_dict = _get_model_state_dict(model, info)
[rank2]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 475, in _get_model_state_dict
[rank2]:     fqns = _get_fqns(model, key)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 224, in _get_fqns
[rank2]:     curr_obj = getattr(curr_obj, curr_obj_name)
[rank2]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: AttributeError: 'Params4bit' object has no attribute 'absmax'
W0324 16:14:25.342000 1346190 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1346238 closing signal SIGTERM
W0324 16:14:25.344000 1346190 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1346239 closing signal SIGTERM
E0324 16:14:26.659000 1346190 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 1346240) of binary: /leonardo_work/EUHPC_A04_062/vllm_env/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/run.py", line 922, in <module>
    main()
  File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
src/llama_cookbook/finetuning.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-03-24_16:14:25
  host      : lrdn0021-net1-3.leonardo.local
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 1346240)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: lrdn0021: task 0: Exited with exit code 1

```



### Expected behavior

I expected the finetuning to finish without any issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MultiGPU finetuning: AttributeError: 'Params4bit' object has no attribute 'absmax' #904

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MultiGPU finetuning: AttributeError: 'Params4bit' object has no attribute 'absmax' #904

Description

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions