Skip to content

terminate called after throwing an instance of 'c10::Error' #597

@CRyan2016

Description

@CRyan2016

when i user qlora, c10:error is threw

  • bit4 is used
  • quantization_config=BitsAndBytesConfig(
    load_in_4bit=True,
    load_in_8bit=False,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
    ),
  • optim is paged_adamw_32bit

use bitsandbytes==0.39.1 transformers==4.30.2

Traceback (most recent call last):
File "/checkpoint/binary/train_package/chat/sft/train_qlora.py", line 586, in
main()
File "/checkpoint/binary/train_package/chat/sft/train_qlora.py", line 558, in main
train_result = trainer.train()
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2007, in _inner_training_loop
self.optimizer.step()
File "/root/.local/lib/python3.8/site-packages/accelerate/optimizer.py", line 140, in step
self.optimizer.step(closure)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, *kwargs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(args, kwargs)
File "/root/.local/lib/python3.8/site-packages/bitsandbytes/optim/optimizer.py", line 270, in step
torch.cuda.synchronize()
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/cuda/init.py", line 566, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8958528457 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, std::string const&) + 0x64 (0x7f89584f23ec in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f8983bd0c64 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e0dc (0x7f8983ba80dc in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x244 (0x7f8983bab054 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f89ae2a8e23 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f89585089e0 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8958508af9 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f89ae506c68 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object) + 0x2d5 (0x7f89ae506f85 in /opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x110632 (0x5617c44ba632 in ./python_bin)
frame #11: + 0x110059 (0x5617c44ba059 in ./python_bin)
frame #12: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #13: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #14: + 0x110043 (0x5617c44ba043 in ./python_bin)
frame #15: + 0x177ce7 (0x5617c4521ce7 in ./python_bin)
frame #16: PyDict_SetItemString + 0x4c (0x5617c4524d8c in ./python_bin)
frame #17: PyImport_Cleanup + 0xaa (0x5617c4597a2a in ./python_bin)
frame #18: Py_FinalizeEx + 0x79 (0x5617c45fd4c9 in ./python_bin)
frame #19: Py_RunMain + 0x1bc (0x5617c460083c in ./python_bin)
frame #20: Py_BytesMain + 0x39 (0x5617c4600c29 in ./python_bin)
frame #21: __libc_start_main + 0xf2 (0x7f89d0c5d192 in /lib64/libc.so.6)
frame #22: + 0x1f9ad7 (0x5617c45a3ad7 in ./python_bin)

Fatal Python error: Aborted

Thread 0x00007f86b1640640 (most recent call first):

Current thread 0x00007f89d02c8cc0 (most recent call first):

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't workingHigh Priority(first issues that will be worked on)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions