-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
NVIDIA Open GPU Kernel Modules Version
575.64.03
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Ubuntu 24.04.3 LTS
Kernel Release
6.14.0-27-generic
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- I am running on a stable kernel release.
Hardware: GPU
NVIDIA GeForce RTX 4090
Describe the bug
I run CUDA program and within 1h the driver crashes. GPUs have enough power and system is not overheating (it's watercooled).
I've tried different kernel versions, different driver versions, CMOS reset, default BIOS or different options, same issue every time.
It happens with both proprietary and open kernel driver.
When driver crashes it goes with blank text screen and message: "nvidia-modeset: ERROR: GPU: Error while waiting for GPU progress".
Here are some stack traces.
Call Trace 1
NVRM: VM: nv_free_pages: 0x1
NVRM: VM: nv_free_pages:3890: 0x00000000a5cab6f8, 1 page(s), count = 1, page_table = 0x000000002de50b39
NVRM: VM: nv_free_system_pages: 1 pages
NVRM: VM: nvidia_vma_release:101: 0x775fb8c90000 - 0x775fb8ca0000, 0x00010000 bytes @ 0x0000000000000000, 0x0000000000000000, 0x00000000a14af54d
NVRM: VM: nvidia_vma_release:101: 0x775fb8d82000 - 0x775fb8d92000, 0x00010000 bytes @ 0x0000000000000000, 0x0000000000000000, 0x00000000274e7f81
NVRM: VM: nvidia_vma_release:101: 0x775fb8d92000 - 0x775fb8da2000, 0x00010000 bytes @ 0x0000000000000000, 0x0000000000000000, 0x000000009357fad0
NVRM: VM: nvidia_vma_release:101: 0x775fbd00d000 - 0x775fbd00e000, 0x00001000 bytes @ 0x0000000000000000, 0x00000000727b7b44, 0x0000000034deee6b
NVRM: VM: nv_alloc_release:1766: 0x00000000727b7b44, 1 page(s), count = 13, page_table = 0x0000000036527f65
NVRM: VM: nvidia_vma_release:101: 0x775fbd00e000 - 0x775fbd01e000, 0x00010000 bytes @ 0x0000000000000000, 0x0000000000000000, 0x0000000060415cca
NVRM: VM: nvidia_vma_release:101: 0x775fbf2fb000 - 0x775fbf2fc000, 0x00001000 bytes @ 0x0000000000000000, 0x00000000855cd83a, 0x00000000b5dddb94
NVRM: VM: nv_alloc_release:1766: 0x00000000855cd83a, 1 page(s), count = 13, page_table = 0x000000005a97ce7a
NVRM: VM: nvidia_vma_release:101: 0x775fbf2fc000 - 0x775fbf2fd000, 0x00001000 bytes @ 0x0000000000000000, 0x00000000c4abedf2, 0x000000001cf6f6b6
NVRM: VM: nv_alloc_release:1766: 0x00000000c4abedf2, 1 page(s), count = 13, page_table = 0x000000003ebe2652
NVRM: VM: nvidia_vma_release:101: 0x775fbf2fd000 - 0x775fbf2fe000, 0x00001000 bytes @ 0x0000000000000000, 0x00000000e88005e7, 0x00000000df9a72b8
NVRM: VM: nv_alloc_release:1766: 0x00000000e88005e7, 1 page(s), count = 13, page_table = 0x000000008757d474
NVRM: VM: nvidia_vma_release:101: 0x775fbf2fe000 - 0x775fbf2ff000, 0x00001000 bytes @ 0x0000000000000000, 0x0000000081132f33, 0x000000008790cb97
NVRM: VM: nv_alloc_release:1766: 0x0000000081132f33, 1 page(s), count = 13, page_table = 0x00000000ae5fdcec
NVRM: VM: nvidia_vma_release:101: 0x775fbf2ff000 - 0x775fbf30f000, 0x00010000 bytes @ 0x0000000000000000, 0x0000000000000000, 0x00000000d0779534
WARNING: CPU: 12 PID: 200421 at nvidia/nv.c:5039 nvidia_dev_put+0xb1/0xc0 [nvidia]
Modules linked in: iptable_filter xt_comment iptable_nat nf_conntrack_netlink veth xt_MASQUERADE bridge stp llc xt_set ip_set xfrm_user xfrm_algo snd_seq_dummy snd_hrtimer nvidia_uvm(OE) overlay qrtr ip6t_REJECT xt_hl ip6t_rt ipt_REJECT xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nft_compat binfmt_misc nls_iso8859_1 ipmi_ssif nvidia_drm(POE) nvidia_modeset(OE) snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_codec_hdmi amd_atl intel_rapl_msr intel_rapl_common snd_hda_intel amd64_edac edac_mce_amd snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio kvm_amd snd_hda_codec nvidia(OE) snd_usbmidi_lib snd_hda_core snd_ump snd_hwdep kvm snd_pcm spd5118 irqbypass snd_seq_midi snd_seq_midi_event polyval_clmulni polyval_generic ghash_clmulni_intel snd_rawmidi sha256_ssse3 sha1_ssse3 snd_seq aesni_intel crypto_simd mfd_aaeon eeepc_wmi cryptd asus_wmi snd_seq_device sparse_keymap snd_timer wmi_bmof platform_profile rapl drm_ttm_helper mc snd acpi_ipmi ttm
i2c_piix4 ipmi_si video soundcore ccp k10temp i2c_smbus ipmi_devintf joydev input_leds ipmi_msghandler gpio_amdpt mac_hid nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_masq nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sch_fq_codel nf_tables msr parport_pc ppdev lp parport efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq dm_mirror dm_region_hash dm_log cdc_ether usbnet uas usb_storage mii hid_generic nvme i40e ahci thunderbolt libahci nvme_core nvme_auth libie wmi ucsi_acpi typec_ucsi typec usbhid hid
CPU: 12 UID: 1000 PID: 200421 Comm: pool-gnome-cale Tainted: P OE 6.14.0-27-generic #27~24.04.1-Ubuntu
Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: ASUS System Product Name/Pro WS WRX90E-SAGE SE, BIOS 1203 07/18/2025
RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
NVRM: nvidia_close on GPU with minor number 255
NVRM: nvidia_ctl_close
Code: 31 d2 31 f6 31 ff e9 29 0d 08 dd 48 c7 c7 f0 35 6b c1 e8 f2 1e 46 de 5b 41 5c 41 5d 5d 31 c0 31 d2 31 f6 31 ff e9 0a 0d 08 dd <0f> 0b eb c2 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90
RSP: 0018:ff53a0042213f910 EFLAGS: 00010202
RAX: 0000000000000026 RBX: ff391ece5bb18000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff53a0042213f860
RBP: ff53a0042213f928 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ff391ece5bb186a8
R13: 0000000000000000 R14: ff391ece55d8d6a0 R15: ffffffffc16b3740
FS: 0000000000000000(0000) GS:ff391f4bbca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00006204224ec800 CR3: 00000002af4a9005 CR4: 0000000000f71ef0
PKRU: 55555554
NVRM: VM: nv_free_pages: 0x1
Call Trace:
NVRM: VM: nv_free_pages:3890: 0x000000009818b312, 1 page(s), count = 1, page_table = 0x00000000b24f713
NVRM: VM: nv_free_system_pages: 1 pages
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? aa_file_perm+0x13b/0x2d0
? srso_alias_return_thunk+0x5/0xfbef5
? eventfd_read+0xdc/0x200
? security_file_permission+0x36/0x60
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? vfs_read+0x2a8/0x390
? srso_alias_return_thunk+0x5/0xfbef5
? ksys_read+0x9d/0xf0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_futex+0x12a/0x200
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit_to_user_mode+0x2d/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x96/0x1e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x79004a91b4cd
Code: Unable to access opcode bytes at 0x79004a91b4a3.
RSP: 002b:000079000dbfb7a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
RAX: fffffffffffffdfc RBX: 00007900040029d0 RCX: 000079004a91b4cd
RDX: 00000000ffffffff RSI: 0000000000000001 RDI: 000078fffc000bb0
RBP: 000079000dbfb7c0 R08: 0000000000000000 R09: 000000007fffffff
R10: 00007900040029d0 R11: 0000000000000293 R12: 000000007fffffff
R13: 000079004af52c10 R14: 0000000000000001 R15: 000078fffc000bb0
Call Trace 2
Another one happened in the same second as the first one.
------------[ cut here ]------------
NVRM: nvidia_close on GPU with minor number 4
NVRM: nvidia_close on GPU with minor number 0
NVRM: nvidia_close on GPU with minor number 2
NVRM: nvidia_close on GPU with minor number 1
NVRM: nvidia_close on GPU with minor number 4
NVRM: nvidia_close on GPU with minor number 0
NVRM: nvidia_close on GPU with minor number 2
NVRM: nvidia_close on GPU with minor number 1
NVRM: nvidia_close on GPU with minor number 4
NVRM: nvidia_close on GPU with minor number 0
NVRM: nvidia_close on GPU with minor number 1
NVRM: nvidia_close on GPU with minor number 4
NVRM: nvidia_close on GPU with minor number 1
NVRM: nvidia_close on GPU with minor number 255
NVRM: nvidia_close on GPU with minor number 4
NVRM: nvidia_ctl_close
WARNING: CPU: 28 PID: 6705 at nvidia/nv.c:5039 nvidia_dev_put+0xb1/0xc0 [nvidia]
NVRM: nvidia_close on GPU with minor number 1
NVRM: nvidia_close on GPU with minor number 4
NVRM: VM: nv_free_pages: 0x1
NVRM: VM: nv_free_pages:3890: 0x00000000c2b1c48f, 1 page(s), count = 1, page_table = 0x00000000bba9b7e5
NVRM: VM: nv_free_system_pages: 1 pages
Code: 16 f0 c5 fa 7f 07 c5 fa 7f 4c 17 f0 c3 62 e1 fe 28 6f 06 62 e1 fe 28 6f 4c 16 ff 62 e1 fe 28 7f 07 62 e1 fe 28 7f 4c 17 ff c3 <48> 8b 4c 16 f8 48 8b 36 48 89 37 48 89 4c 17 f8 c3 62 e1 fe 48 6f
CPU: 28 UID: 1000 PID: 6705 Comm: xdg-desktop-por Tainted: P W OE 6.14.0-27-generic #27~24.04.1-Ubuntu
Tainted: [P]=PROPRIETARY_MODULE, [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: ASUS System Product Name/Pro WS WRX90E-SAGE SE, BIOS 1203 07/18/2025
RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
Code: 31 d2 31 f6 31 ff e9 29 0d 08 dd 48 c7 c7 f0 35 6b c1 e8 f2 1e 46 de 5b 41 5c 41 5d 5d 31 c0 31 d2 31 f6 31 ff e9 0a 0d 08 dd <0f> 0b eb c2 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90
RSP: 0018:ff53a004226b79a0 EFLAGS: 00010202
RAX: 0000000000000026 RBX: ff391ece5bb18000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff53a004226b78f0
RBP: ff53a004226b79b8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ff391ece5bb186a8
R13: 0000000000000000 R14: ff391ece55d8d6a0 R15: ffffffffc16b3740
FS: 0000000000000000(0000) GS:ff391f4bbd200000(0000) knlGS:0000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007dee4d1166e0 CR3: 000000014b673003 CR4: 0000000000f71ef0
PKRU: 55555554
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_futex+0x12a/0x200
? srso_alias_return_thunk+0x5/0xfbef5
? fput+0x157/0x190
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? do_filp_open+0xd4/0x1a0
? srso_alias_return_thunk+0x5/0xfbef5
? putname+0x60/0x80
? srso_alias_return_thunk+0x5/0xfbef5
? do_sys_openat2+0x9f/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x96/0x1e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7c37fa92725d
Code: Unable to access opcode bytes at 0x7c37fa927233.
RSP: 002b:00007c37d61fe808 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 00006077beef6450 RCX: 00007c37fa92725d
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00006077beef6460
RBP: 00007c37d61fe840 R08: 0000000000000007 R09: 00007c37d00047e0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007c37d61ff648
R13: 0000000000000000 R14: 00006077beef6460 R15: 0000000000000002
</TASK>
Call Trace 3
Another one happened in the same second.
RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
PKRU: 55555554
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? __pfx_pollwake+0x10/0x10
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_get_rseq_cs+0x22/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_ip_fixup+0x8f/0x1f0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? irqentry_exit_to_user_mode+0x2d/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x96/0x1e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x775fc631b4cd
Code: Unable to access opcode bytes at 0x775fc631b4a3.
RSP: 002b:0000775fa9e6ab60 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
RAX: fffffffffffffdfc RBX: 000064dd00ef2670 RCX: 0000775fc631b4cd
RDX: 00000000ffffffff RSI: 0000000000000001 RDI: 0000775f98000da0
RBP: 0000775fa9e6ab80 R08: 0000000000000000 R09: 000000007fffffff
R10: 000064dd00ef2670 R11: 0000000000000293 R12: 000000007fffffff
R13: 0000775fc6752c10 R14: 0000000000000001 R15: 0000775f98000da0
</TASK>
Call Trace 4
And another one, same time.
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_futex+0x12a/0x200
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? sysvec_apic_timer_interrupt+0x57/0xc0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7539ec298d71
Code: Unable to access opcode bytes at 0x7539ec298d47.
Call Trace 5
Then this one a second later.
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? aa_file_perm+0x13b/0x2d0
? srso_alias_return_thunk+0x5/0xfbef5
? eventfd_read+0xdc/0x200
? security_file_permission+0x36/0x60
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? vfs_read+0x2a8/0x390
? srso_alias_return_thunk+0x5/0xfbef5
? ksys_read+0x9d/0xf0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_futex+0x12a/0x200
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit_to_user_mode+0x2d/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x96/0x1e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x79004a91b4cd
Code: Unable to access opcode bytes at 0x79004a91b4a3.
RSP: 002b:000079000dbfb7a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
RAX: fffffffffffffdfc RBX: 00007900040029d0 RCX: 000079004a91b4cd
RDX: 00000000ffffffff RSI: 0000000000000001 RDI: 000078fffc000bb0
RBP: 000079000dbfb7c0 R08: 0000000000000000 R09: 000000007fffffff
R10: 00007900040029d0 R11: 0000000000000293 R12: 000000007fffffff
R13: 000079004af52c10 R14: 0000000000000001 R15: 000078fffc000bb0
</TASK>
Call Trace 6
And another one:
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? switch_fpu_return+0x50/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? __futex_wait+0x160/0x1d0
? __pfx_futex_wake_mark+0x10/0x10
? srso_alias_return_thunk+0x5/0xfbef5
? hrtimer_cancel+0x15/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? futex_wait+0x85/0x130
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_get_rseq_cs+0x22/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_ip_fixup+0x8f/0x1f0
? do_futex+0x105/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? restore_fpregs_from_fpstate+0x3d/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? switch_fpu_return+0x50/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? irqentry_exit_to_user_mode+0x2d/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x78480392725d
Code: Unable to access opcode bytes at 0x784803927233.
RSP: 002b:000078470dbfe948 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 00007847e4f2c000 RCX: 000078480392725d
RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007847e4f2c040
RBP: 00007847e4f2c040 R08: 0000000000000000 R09: ffffffffffffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007847e4f2c150 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Call Trace 7
And another one:
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
? srso_alias_return_thunk+0x5/0xfbef5
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? schedule+0x3f/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? futex_wait_queue+0x69/0xa0
? srso_alias_return_thunk+0x5/0xfbef5
? __futex_wait+0x160/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? timerqueue_del+0x31/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? __remove_hrtimer+0x52/0xb0
? srso_alias_return_thunk+0x5/0xfbef5
? hrtimer_try_to_cancel.part.0+0x55/0xf0
? srso_alias_return_thunk+0x5/0xfbef5
? hrtimer_cancel+0x21/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? futex_wait+0x85/0x130
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_get_rseq_cs+0x22/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? futex_wake+0x89/0x190
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_futex+0x12a/0x200
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? restore_fpregs_from_fpstate+0x3d/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? switch_fpu_return+0x50/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit_to_user_mode+0x2d/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit+0x43/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? sysvec_call_function_single+0x57/0xc0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7158b9e98d71
Code: Unable to access opcode bytes at 0x7158b9e98d47
Call Trace 8
And another one:
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
get_signal+0x8cb/0x8d0
arch_do_signal_or_restart+0x39/0x110
syscall_exit_to_user_mode+0x146/0x1d0
do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_futex+0x18e/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_futex+0x12a/0x200
? srso_alias_return_thunk+0x5/0xfbef5
? fput+0x157/0x190
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? do_filp_open+0xd4/0x1a0
? srso_alias_return_thunk+0x5/0xfbef5
? putname+0x60/0x80
? srso_alias_return_thunk+0x5/0xfbef5
? do_sys_openat2+0x9f/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x96/0x1e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7c37fa92725d
Code: Unable to access opcode bytes at 0x7c37fa927233
Call Trace 9
And another one:
Call Trace:
<TASK>
nvidia_close+0x1a2/0x270 [nvidia]
__fput+0xea/0x2d0
____fput+0x15/0x20
task_work_run+0x5d/0xa0
do_exit+0x26c/0x4e0
do_group_exit+0x34/0x90
__x64_sys_exit_group+0x18/0x20
x64_sys_call+0x1666/0x2650
do_syscall_64+0x7e/0x170
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? unix_seqpacket_recvmsg+0x43/0x70
? srso_alias_return_thunk+0x5/0xfbef5
? sock_recvmsg+0xde/0xf0
? filp_flush+0x8d/0xb0
? srso_alias_return_thunk+0x5/0xfbef5
? ____sys_recvmsg+0x111/0x230
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? ___sys_recvmsg+0x9c/0xf0
? do_sys_openat2+0x9f/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_get_rseq_cs+0x22/0x260
? srso_alias_return_thunk+0x5/0xfbef5
? rseq_ip_fixup+0x8f/0x1f0
? srso_alias_return_thunk+0x5/0xfbef5
? restore_fpregs_from_fpstate+0x3d/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? switch_fpu_return+0x50/0xe0
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0xc8/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
? syscall_exit_to_user_mode+0x38/0x1d0
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0x8a/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x70d49aeee21d
Code: Unable to access opcode bytes at 0x70d49aeee1f3.
And there are 6 more stack trace which I can post if relevant.
To Reproduce
I run CUDA program and within 1h it crashes, that's the only way to reproduce it. Eventually I can run a burn-test and it can happen as well.
Bug Incidence
Always
nvidia-bug-report.log.gz
I can't send nvidia-bug-report, as when the driver crashes Xorg immediately after SEGV and I cannot do anything, only reboot is the option (e.g. via Magic SysRq key).
More Info
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 575.64.03 Debug Build (kenorb@3XS) Sun 10 Aug 12:36:54 BST 2025
GCC version: gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)
$ uname -r
6.14.0-27-generic
$ lsmod | rg nvidia
nvidia_uvm 2150400 0
nvidia_drm 135168 58
nvidia_modeset 2101248 18 nvidia_drm
nvidia 14741504 535 nvidia_uvm,nvidia_modeset
drm_ttm_helper 16384 2 nvidia_drm
video 77824 2 asus_wmi,nvidia_modeset
$ modinfo nvidia | head
filename: /lib/modules/6.14.0-27-generic/updates/dkms/nvidia.ko.zst
import_ns: DMA_BUF
alias: char-major-195-*
version: 575.64.03
supported: external
license: Dual MIT/GPL
firmware: nvidia/575.64.03/gsp_tu10x.bin
firmware: nvidia/575.64.03/gsp_ga10x.bin
srcversion: 8DBF4ED3568DB8FEA5B7834
alias: pci:v000010DEd*sv*sd*bc06sc80i00*
$ modinfo nvidia_modeset | head
filename: /lib/modules/6.14.0-27-generic/updates/dkms/nvidia-modeset.ko.zst
version: 575.64.03
supported: external
license: Dual MIT/GPL
srcversion: 4E29AA9F8BB75D880663278
depends: video,nvidia
name: nvidia_modeset
retpoline: Y
vermagic: 6.14.0-27-generic SMP preempt mod_unload modversions
parm: output_rounding_fix:bool
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.04.3 LTS
Release: 24.04
Codename: noble
$ nvidia-smi pci -i 0,1
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-329ede61-7982-f4f6-...-...)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-0535a00b-ecd6-8908-...-...)
$ cat /etc/modprobe.d/nvidia.conf
# /etc/modprobe.d/nvidia.conf
options nvidia NVreg_DynamicPowerManagement=0
options nvidia NVreg_EnableGpuFirmwareLogs=1 # Increased verbosity for debugging (set 2 for more)
options nvidia NVreg_EnablePCIeGen3=0 # Allow auto-negotiation to avoid PCIe issues
options nvidia NVreg_EnableResizableBar=1 # Keep if BIOS/GPU supports; test with 0 if issues
options nvidia NVreg_EnableStreamMemOPs=1 # Optional, keep for CUDA workloads
options nvidia NVreg_InitializeSystemMemoryAllocations=1
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_ResmanDebugLevel=1 # Increased verbosity for debugging (set 2 for more)
options nouveau modeset=0
blacklist nouveau
$ sudo nvflash --index 0 --version
NVIDIA Firmware Update Utility (Version 5.867.0)
Reading EEPROM (this operation may take up to 30 seconds)
Redundant Firmware : Instance 0 (Identical)
Sign-On Message : PG139 SKU 332 VGA BIOS
Build GUID : 23B22FE99D20451B9D3059226648D230
Build Number : 32193434
IFR Subsystem ID : 19DA-1675
Subsystem Vendor ID : 0x19DA
Subsystem ID : 0x1675
Version : 95.02.3C.40.1B
Image Hash : N/A
Product Name : GPU Board
Device Name(s) : Graphics Device
Board ID : 0x0475
Vendor ID : 0x10DE
Device ID : 0x2684
Hierarchy ID : Normal Board
Chip SKU : 301-0
Project : G139-0332
Build Date : 12/13/22
Modification Date : 05/05/23
UEFI Version : 0x7000B ( x64 )
UEFI Variant ID : 0x000000000000000B ( Unknown )
UEFI Signer(s) : Microsoft Corporation UEFI CA 2011
XUSB-FW Version ID : N/A
XUSB-FW Build Time : N/A
InfoROM Version : G002.0000.00.03
InfoROM Backup : Present
License Placeholder : Present
GPU Mode : N/A
CEC OTA-signed Blob : Not Present
# Note: All GPUs VBIOS is matching and it's the latest from Zotac (from 13/12/22).