-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Open
Description
System Info
module load python/3.11.6--gcc--8.5.0
module load openmpi/4.1.6--gcc--12.2.0
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load cuda/12.1
3 Nvidia A100
Information
- The official example scripts
- My own modified scripts
🐛 Describe the bug
I am running finetuning.py using FSDP, PEFT and 4 bit quantization using the following slurm script
#!/bin/bash
# sbatch settings
#SBATCH --job-name=1n3gpu
#SBATCH --time=03:30:01
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:3
#SBATCH --partition=boost_usr_prod
#SBATCH --cpus-per-task=32
#SBATCH --output=logs/output_%j.log
#SBATCH --error=logs/error_%j.log
# Load required modules for Leonardo HPC
module purge
module load profile/deeplrn
module load python/3.11.6--gcc--8.5.0
module load openmpi/4.1.6--gcc--12.2.0
module load nccl/2.19.1-1--gcc--12.2.0-cuda-12.1
module load cuda/12.1
source /leonardo_work/EUHPC_A04_062/vllm_env/bin/activate
###
#check if IB is used by NCCL
export NCCL_NET=IB
export NCCL_DEBUG=INFO
# training setup
GPUS_PER_NODE=3
MASTER_NAME=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=29500
NNODES=$SLURM_NNODES
MACHINE_RANK=$SLURM_PROCID
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
###
#srun bash -c '$CMD'
echo
srun python -u -m torch.distributed.run --nproc_per_node 3 --nnodes $SLURM_JOB_NUM_NODES --rdzv_backend c10d --rdzv_endpoint $MASTER_NAME:$MASTER_PORT \
src/llama_cookbook/finetuning.py \
--enable_fsdp \
--use_peft \
--peft_method lora \
--quantization 4bit \
--model_name /leonardo_scratch/fast/EUHPC_A04_062/Llama-3.3-70B-Instruct \
--dataset monke_dataset \
--save_model \
--num_epochs 2 \
--context_length 4096 \
--output_dir /leonardo_work/EUHPC_A04_062/AI_Model/output_models/peft/Llama-3.3-70B-Instruct \
--dist_checkpoint_root_folder model_checkpoints \
--dist_checkpoint_folder fine-tuned \
--use_wandb
Error logs
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py:18: DeprecationWarning: `torch.distributed._shard.checkpoint` will be deprecated, use `torch.distributed.checkpoint` instead
from torch.distributed._shard.checkpoint import (
/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py:18: DeprecationWarning: `torch.distributed._shard.checkpoint` will be deprecated, use `torch.distributed.checkpoint` instead
from torch.distributed._shard.checkpoint import (
/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py:18: DeprecationWarning: `torch.distributed._shard.checkpoint` will be deprecated, use `torch.distributed.checkpoint` instead
from torch.distributed._shard.checkpoint import (
`low_cpu_mem_usage` was None, now default to True since model is quantized.
`low_cpu_mem_usage` was None, now default to True since model is quantized.
Loading checkpoint shards: 0%| | 0/30 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/30 [00:00<?, ?it/s]wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.8
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Loading checkpoint shards: 3%|▎ | 1/30 [00:06<02:55, 6.05s/it]
Loading checkpoint shards: 3%|▎ | 1/30 [00:06<02:59, 6.20s/it]`low_cpu_mem_usage` was None, now default to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 30/30 [02:53<00:00, 5.77s/it]
/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/cuda/memory.py:391: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1: 0%|�[34m �[0m| 0/8 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/cuda/memory.py:391: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1/2, step 7/8 completed (loss: 0.9345082640647888): 100%|�[34m██████████�[0m| 8/8 [00:48<00:00, 6.03s/it]
Training Epoch: 1/2, step 7/8 completed (loss: 1.1425457000732422): 100%|�[34m██████████�[0m| 8/8 [00:57<00:00, 7.17s/it]
evaluating Epoch: 0%|�[32m �[0m| 0/33 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
evaluating Epoch: 100%|�[32m██████████�[0m| 33/33 [00:26<00:00, 1.25it/s]
evaluating Epoch: 100%|�[32m██████████�[0m| 33/33 [00:26<00:00, 1.25it/s]
[rank1]: Traceback (most recent call last):
[rank1]: File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/finetuning.py", line 429, in <module>
[rank1]: fire.Fire(main)
[rank1]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank1]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank1]: component, remaining_args = _CallAndUpdateTrace(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank1]: component = fn(*varargs, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/finetuning.py", line 407, in main
[rank1]: results = train(
[rank1]: ^^^^^^
[rank1]: File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/utils/train_utils.py", line 242, in train
[rank1]: save_peft_checkpoint(model, train_config.output_dir)
[rank1]: File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py", line 284, in save_peft_checkpoint
[rank1]: state_dict = get_model_state_dict(model, options=options)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 999, in get_model_state_dict
[rank1]: model_state_dict = _get_model_state_dict(model, info)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 475, in _get_model_state_dict
[rank1]: fqns = _get_fqns(model, key)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 224, in _get_fqns
[rank1]: curr_obj = getattr(curr_obj, curr_obj_name)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: AttributeError: 'Params4bit' object has no attribute 'absmax'
[rank2]: Traceback (most recent call last):
[rank2]: File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/finetuning.py", line 429, in <module>
[rank2]: fire.Fire(main)
[rank2]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank2]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank2]: component, remaining_args = _CallAndUpdateTrace(
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank2]: component = fn(*varargs, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/finetuning.py", line 407, in main
[rank2]: results = train(
[rank2]: ^^^^^^
[rank2]: File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/utils/train_utils.py", line 242, in train
[rank2]: save_peft_checkpoint(model, train_config.output_dir)
[rank2]: File "/leonardo_work/EUHPC_A04_062/AI_Model/src/llama_cookbook/model_checkpointing/checkpoint_handler.py", line 284, in save_peft_checkpoint
[rank2]: state_dict = get_model_state_dict(model, options=options)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 999, in get_model_state_dict
[rank2]: model_state_dict = _get_model_state_dict(model, info)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]: return func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 475, in _get_model_state_dict
[rank2]: fqns = _get_fqns(model, key)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 224, in _get_fqns
[rank2]: curr_obj = getattr(curr_obj, curr_obj_name)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: AttributeError: 'Params4bit' object has no attribute 'absmax'
W0324 16:14:25.342000 1346190 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1346238 closing signal SIGTERM
W0324 16:14:25.344000 1346190 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1346239 closing signal SIGTERM
E0324 16:14:26.659000 1346190 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 1346240) of binary: /leonardo_work/EUHPC_A04_062/vllm_env/bin/python
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/run.py", line 922, in <module>
main()
File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/leonardo_work/EUHPC_A04_062/vllm_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/llama_cookbook/finetuning.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-03-24_16:14:25
host : lrdn0021-net1-3.leonardo.local
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 1346240)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: lrdn0021: task 0: Exited with exit code 1
Expected behavior
I expected the finetuning to finish without any issue