-
-
Notifications
You must be signed in to change notification settings - Fork 769
Description
when i user qlora, c10:error is threw
- bit4 is used
- quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
), - optim is paged_adamw_32bit
use bitsandbytes==0.39.1 transformers==4.30.2
Traceback (most recent call last):
File "/checkpoint/binary/train_package/chat/sft/train_qlora.py", line 586, in
main()
File "/checkpoint/binary/train_package/chat/sft/train_qlora.py", line 558, in main
train_result = trainer.train()
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2007, in _inner_training_loop
self.optimizer.step()
File "/root/.local/lib/python3.8/site-packages/accelerate/optimizer.py", line 140, in step
self.optimizer.step(closure)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, *kwargs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(args, kwargs)
File "/root/.local/lib/python3.8/site-packages/bitsandbytes/optim/optimizer.py", line 270, in step
torch.cuda.synchronize()
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/cuda/init.py", line 566, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8958528457 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f89584f23ec in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f8983bd0c64 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e0dc (0x7f8983ba80dc in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x244 (0x7f8983bab054 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f89ae2a8e23 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f89585089e0 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8958508af9 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f89ae506c68 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object) + 0x2d5 (0x7f89ae506f85 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x110632 (0x5617c44ba632 in ./python_bin)
frame #11: + 0x110059 (0x5617c44ba059 in ./python_bin)
frame #12: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #13: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #14: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #15: + 0x177ce7 (0x5617c4521ce7 in ./python_bin)
frame #16: PyDict_SetItemString + 0x4c (0x5617c4524d8c in ./python_bin)
frame #17: PyImport_Cleanup + 0xaa (0x5617c4597a2a in ./python_bin)
frame #18: Py_FinalizeEx + 0x79 (0x5617c45fd4c9 in ./python_bin)
frame #19: Py_RunMain + 0x1bc (0x5617c460083c in ./python_bin)
frame #20: Py_BytesMain + 0x39 (0x5617c4600c29 in ./python_bin)
frame #21: __libc_start_main + 0xf2 (0x7f89d0c5d192 in /lib64/libc.so.6)
frame #22: + 0x1f9ad7 (0x5617c45a3ad7 in ./python_bin)
Fatal Python error: Aborted
Thread 0x00007f86b1640640 (most recent call first):
Current thread 0x00007f89d02c8cc0 (most recent call first):