Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia 560.28.03-1 throwing kernel stack trace with linux kernels from 6.10.3 up to 6.10.9 or newer #705

Open
2 tasks done
mashu opened this issue Sep 16, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@mashu
Copy link

mashu commented Sep 16, 2024

NVIDIA Open GPU Kernel Modules Version

560.35.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Debian GNU/Linux trixie/sid

Kernel Release

6.10.9

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 4090 Laptop GPU

Describe the bug

I am getting lots of errors and kernel tainted with stack in dmesg
with latest nvidia driver 560.28.03-1 and linux kernel 6.10.3 (for
full log see nvidia-bug-report.log.gz included in this report) on GNU/Linux Debian setup.

Short summary:

  1. The error messages are consistently related to the function
    follow_pte+0x1de/0x200.
  2. In the call traces, we can see NVIDIA-related functions being called:
    nv_revoke_gpu_mappings+0x67/0xb0 [nvidia]
    RmHandleIdleSustained+0x39/0x130 [nvidia]
    rm_execute_work_item+0xe0/0x150 [nvidia]
    3.The module list shows NVIDIA modules loaded:
    nvidia_uvm(OE)
    nvidia_drm(OE)
    nvidia_modeset(OE)
    nvidia(OE)
    The (OE) suffix likely indicates these are out-of-tree (externally
    built) modules and NVIDIA is the only OE module I have.
  3. The error is occurring in a kernel thread named "nv_queue", which
    is likely an NVIDIA driver thread.
  4. The warnings are being triggered at include/linux/rwsem.h:80, which
    suggests there might be an issue with how the NVIDIA driver is
    handling read-write semaphores in the kernel.

To Reproduce

Boot 6.10.9 kernel with latest official nvidia driver and check dmesg logs.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

Above nvidia-bug-report.log.gz includes this but also pasting here for convinience

[   50.485511] CPU: 14 PID: 1229 Comm: nv_queue Tainted: G        W  OE      6.10.9-amd64 #1  Debian 6.10.9-1
[   50.485511] Hardware name: LENOVO 83AG/LNVNB161216, BIOS MHCN42WW 03/25/2024
[   50.485511] RIP: 0010:follow_pte+0x20b/0x220
[   50.485512] Code: 00 00 00 c0 eb 8b 49 8b 3c 24 e8 00 bf 91 00 e8 bb 5e e1 ff bd ea ff ff ff 5b 89 e8 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc <0f> 0b e9 1e fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90
[   50.485513] RSP: 0018:ffffab47c10afb60 EFLAGS: 00010246
[   50.485513] RAX: 0000000000000000 RBX: 00007fcbe8b8e000 RCX: ffffab47c10afba0
[   50.485514] RDX: ffffab47c10afb98 RSI: 00007fcbe8b8e000 RDI: ffff9c8870728e70
[   50.485514] RBP: ffffab47c10afbe0 R08: ffffab47c10afd38 R09: 0000000000000000
[   50.485515] R10: 000000008040003c R11: 0000000000000000 R12: ffffab47c10afba0
[   50.485515] R13: ffffab47c10afb98 R14: ffff9c8874afb180 R15: 0000000000000000
[   50.485516] FS:  0000000000000000(0000) GS:ffff9c97b3300000(0000) knlGS:0000000000000000
[   50.485516] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.485517] CR2: 00007f80c9c7b6b4 CR3: 00000001a2ce6000 CR4: 0000000000f50ef0
[   50.485517] PKRU: 55555554
[   50.485517] Call Trace:
[   50.485518]  <TASK>
[   50.485518]  ? __warn+0x80/0x120
[   50.485519]  ? follow_pte+0x20b/0x220
[   50.485520]  ? report_bug+0x164/0x190
[   50.485521]  ? handle_bug+0x3c/0x80
[   50.485522]  ? exc_invalid_op+0x17/0x70
[   50.485523]  ? asm_exc_invalid_op+0x1a/0x20
[   50.485524]  ? follow_pte+0x20b/0x220
[   50.485525]  follow_phys+0x4b/0x110
[   50.485526]  untrack_pfn+0x57/0x120
[   50.485528]  unmap_single_vma+0xa6/0xe0
[   50.485529]  zap_page_range_single+0x122/0x1d0
[   50.485530]  unmap_mapping_range+0x111/0x140
[   50.485532]  nv_revoke_gpu_mappings+0x67/0xb0 [nvidia]
[   50.485584]  RmHandleIdleSustained+0x39/0x130 [nvidia]
[   50.485678]  ? gpumgrGetGpu+0x69/0xa0 [nvidia]
[   50.485781]  rm_execute_work_item+0xe0/0x150 [nvidia]
[   50.485882]  ? os_execute_work_item+0x19/0x80 [nvidia]
[   50.485934]  _main_loop+0x8f/0x150 [nvidia]
[   50.485991]  ? __pfx__main_loop+0x10/0x10 [nvidia]
[   50.486046]  kthread+0xcf/0x100
[   50.486048]  ? __pfx_kthread+0x10/0x10
[   50.486049]  ret_from_fork+0x31/0x50
[   50.486049]  ? __pfx_kthread+0x10/0x10
[   50.486050]  ret_from_fork_asm+0x1a/0x30
[   50.486051]  </TASK>
[   50.486052] ---[ end trace 0000000000000000 ]---

More Info

No response

@mashu mashu added the bug Something isn't working label Sep 16, 2024
@Tarballwalf
Copy link

due to this, it seems that on my end x11/xwayland has stopped working. cannot even launch any proton games on xwayland.

@mashu mashu changed the title Nvidia 560.28.03-1 throwing kernel tained errors with linux kernels from 6.10.3 up to 6.10.9 or newer Nvidia 560.28.03-1 throwing kernel stack trace with linux kernels from 6.10.3 up to 6.10.9 or newer Sep 20, 2024
@veldenb
Copy link

veldenb commented Sep 20, 2024

Same here on 6.11 kernel when I try to enter sleep mode on wayland, Ubuntu 24.10 beta:

2024-09-20T22:59:50.819111+02:00 bernard-desktop kernel: CPU: 27 UID: 0 PID: 15484 Comm: nvidia-sleep.sh Kdump: loaded Tainted: G           OE      6.11.0-7-generic #7-Ubuntu
2024-09-20T22:59:50.819112+02:00 bernard-desktop kernel: Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
2024-09-20T22:59:50.819112+02:00 bernard-desktop kernel: Hardware name: ASUS System Product Name/ROG CROSSHAIR VIII DARK HERO, BIOS 3801 07/30/2021
2024-09-20T22:59:50.819113+02:00 bernard-desktop kernel: RIP: 0010:follow_pte+0x1d7/0x200
2024-09-20T22:59:50.819113+02:00 bernard-desktop kernel: Code: 48 81 e2 00 00 00 c0 48 09 c2 48 f7 d2 48 85 fa 75 30 e8 1c e4 ff ff 48 8b 15 d5 28 92 01 48 81 e2 00 00 00 c0 e9 7b ff ff ff <0f> 0b e9 56 fe 
ff ff 48 8b 45 d0 48 8b 38 e8 46 03 e9 00 e8 31 be
2024-09-20T22:59:50.819113+02:00 bernard-desktop kernel: RSP: 0018:ffffb0bb0708f770 EFLAGS: 00010246
2024-09-20T22:59:50.819114+02:00 bernard-desktop kernel: RAX: 0000000000000000 RBX: 0000713de4a06000 RCX: ffffb0bb0708f7c0
2024-09-20T22:59:50.819114+02:00 bernard-desktop kernel: RDX: ffffb0bb0708f7b8 RSI: 0000713de4a06000 RDI: ffff9077da98a398
2024-09-20T22:59:50.819115+02:00 bernard-desktop kernel: RBP: ffffb0bb0708f7a8 R08: ffffb0bb0708f978 R09: 0000000000000000
2024-09-20T22:59:50.819115+02:00 bernard-desktop kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffb0bb0708f808
2024-09-20T22:59:50.819115+02:00 bernard-desktop kernel: R13: 0000000000000000 R14: ffffb0bb0708f7b8 R15: ffff9077d24c9080
2024-09-20T22:59:50.819116+02:00 bernard-desktop kernel: FS:  00007d42cce13740(0000) GS:ffff907ecef80000(0000) knlGS:0000000000000000
2024-09-20T22:59:50.819116+02:00 bernard-desktop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-09-20T22:59:50.819116+02:00 bernard-desktop kernel: CR2: 0000000086d79000 CR3: 0000000110136000 CR4: 0000000000f50ef0
2024-09-20T22:59:50.819117+02:00 bernard-desktop kernel: PKRU: 55555554
2024-09-20T22:59:50.819117+02:00 bernard-desktop kernel: Call Trace:
2024-09-20T22:59:50.819125+02:00 bernard-desktop kernel:  <TASK>
2024-09-20T22:59:50.819125+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819126+02:00 bernard-desktop kernel:  ? show_trace_log_lvl+0x273/0x310
2024-09-20T22:59:50.819126+02:00 bernard-desktop kernel:  ? show_trace_log_lvl+0x273/0x310
2024-09-20T22:59:50.819128+02:00 bernard-desktop kernel:  ? follow_phys+0x4c/0x110
2024-09-20T22:59:50.819129+02:00 bernard-desktop kernel:  ? show_regs.part.0+0x22/0x30
2024-09-20T22:59:50.819129+02:00 bernard-desktop kernel:  ? show_regs.cold+0x8/0x10
2024-09-20T22:59:50.819129+02:00 bernard-desktop kernel:  ? follow_pte+0x1d7/0x200
2024-09-20T22:59:50.819130+02:00 bernard-desktop kernel:  ? __warn.cold+0xa7/0x101
2024-09-20T22:59:50.819130+02:00 bernard-desktop kernel:  ? follow_pte+0x1d7/0x200
2024-09-20T22:59:50.819130+02:00 bernard-desktop kernel:  ? report_bug+0x114/0x160
2024-09-20T22:59:50.819131+02:00 bernard-desktop kernel:  ? handle_bug+0x51/0xa0
2024-09-20T22:59:50.819131+02:00 bernard-desktop kernel:  ? exc_invalid_op+0x18/0x80
2024-09-20T22:59:50.819131+02:00 bernard-desktop kernel:  ? asm_exc_invalid_op+0x1b/0x20
2024-09-20T22:59:50.819132+02:00 bernard-desktop kernel:  ? follow_pte+0x1d7/0x200
2024-09-20T22:59:50.819132+02:00 bernard-desktop kernel:  follow_phys+0x4c/0x110
2024-09-20T22:59:50.819132+02:00 bernard-desktop kernel:  untrack_pfn+0x55/0x130
2024-09-20T22:59:50.819132+02:00 bernard-desktop kernel:  unmap_single_vma+0xbc/0xf0
2024-09-20T22:59:50.819133+02:00 bernard-desktop kernel:  zap_page_range_single+0x138/0x210
2024-09-20T22:59:50.819133+02:00 bernard-desktop kernel:  unmap_mapping_range+0x119/0x140
2024-09-20T22:59:50.819133+02:00 bernard-desktop kernel:  nv_revoke_gpu_mappings_locked+0x46/0x80 [nvidia]
2024-09-20T22:59:50.819134+02:00 bernard-desktop kernel:  nv_set_system_power_state+0x1d6/0x480 [nvidia]
2024-09-20T22:59:50.819134+02:00 bernard-desktop kernel:  nv_procfs_write_suspend+0x102/0x1b0 [nvidia]
2024-09-20T22:59:50.819134+02:00 bernard-desktop kernel:  proc_reg_write+0x6c/0xb0
2024-09-20T22:59:50.819135+02:00 bernard-desktop kernel:  vfs_write+0x107/0x490
2024-09-20T22:59:50.819135+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819135+02:00 bernard-desktop kernel:  ksys_write+0x71/0x100
2024-09-20T22:59:50.819136+02:00 bernard-desktop kernel:  __x64_sys_write+0x19/0x30
2024-09-20T22:59:50.819136+02:00 bernard-desktop kernel:  x64_sys_call+0x7e/0x22b0
2024-09-20T22:59:50.819136+02:00 bernard-desktop kernel:  do_syscall_64+0x7e/0x170
2024-09-20T22:59:50.819136+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819137+02:00 bernard-desktop kernel:  ? __do_sys_newfstat+0x76/0x80
2024-09-20T22:59:50.819159+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819160+02:00 bernard-desktop kernel:  ? syscall_exit_to_user_mode+0x4e/0x250
2024-09-20T22:59:50.819160+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819160+02:00 bernard-desktop kernel:  ? do_syscall_64+0x8a/0x170
2024-09-20T22:59:50.819160+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819161+02:00 bernard-desktop kernel:  ? filp_flush+0x57/0x90
2024-09-20T22:59:50.819161+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819162+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819162+02:00 bernard-desktop kernel:  ? syscall_exit_to_user_mode+0x4e/0x250
2024-09-20T22:59:50.819163+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819163+02:00 bernard-desktop kernel:  ? do_syscall_64+0x8a/0x170
2024-09-20T22:59:50.819163+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819169+02:00 bernard-desktop kernel:  ? irqentry_exit_to_user_mode+0x43/0x250
2024-09-20T22:59:50.819170+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819170+02:00 bernard-desktop kernel:  ? irqentry_exit+0x43/0x50
2024-09-20T22:59:50.819170+02:00 bernard-desktop kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
2024-09-20T22:59:50.819171+02:00 bernard-desktop kernel:  ? exc_page_fault+0x96/0x1c0
2024-09-20T22:59:50.819171+02:00 bernard-desktop kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
2024-09-20T22:59:50.819171+02:00 bernard-desktop kernel: RIP: 0033:0x7d42ccb26274
2024-09-20T22:59:50.819172+02:00 bernard-desktop kernel: Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d f5 2d 0f 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
2024-09-20T22:59:50.819172+02:00 bernard-desktop kernel: RSP: 002b:00007ffef22725d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
2024-09-20T22:59:50.819172+02:00 bernard-desktop kernel: RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007d42ccb26274
2024-09-20T22:59:50.819173+02:00 bernard-desktop kernel: RDX: 0000000000000008 RSI: 00005f6c7f01e520 RDI: 0000000000000001
2024-09-20T22:59:50.819173+02:00 bernard-desktop kernel: RBP: 00007ffef2272600 R08: 0000000000000000 R09: 0000000000000001
2024-09-20T22:59:50.819174+02:00 bernard-desktop kernel: R10: 00005f6c7f01e510 R11: 0000000000000202 R12: 0000000000000008
2024-09-20T22:59:50.819174+02:00 bernard-desktop kernel: R13: 00005f6c7f01e520 R14: 00007d42ccc125c0 R15: 00007d42ccc0fea0
2024-09-20T22:59:50.819179+02:00 bernard-desktop kernel:  </TASK>
2024-09-20T22:59:50.819179+02:00 bernard-desktop kernel: ---[ end trace 0000000000000000 ]---
2024-09-20T22:59:50.819179+02:00 bernard-desktop kernel: ------------[ cut here ]------------

nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080        Off |   00000000:0B:00.0  On |                  N/A |
|  0%   37C    P8             18W /  320W |     535MiB /  10240MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

@birdie-github
Copy link

birdie-github commented Sep 25, 2024

This has been known for months: see #662

No need to create dupes.

Though actually it may help NVIDIA prioritize fixing this bug because it's annoying as hell.

I got 40KB worth of back traces on every suspend, and now I simply power off the PC entirely, since I got fed up with this.

@MaxKh
Copy link

MaxKh commented Sep 26, 2024

now I simply power off the PC entirely, since I got fed up with this.

Same here. 560.35.03 and earlier.
Archlinux, GeForce RTX 3050 Ti Laptop

@veldenb
Copy link

veldenb commented Sep 26, 2024

This has been known for months: see #662

No need to create dupes.

Though actually it may help NVIDIA prioritize fixing this bug because it's annoying as hell.

I got 40KB worth of back traces on every suspend, and now I simply power off the PC entirely, since I got fed up with this.

Good to know, following #662 :)

@mashu
Copy link
Author

mashu commented Sep 26, 2024

This has been known for months: see #662

No need to create dupes.

Though actually it may help NVIDIA prioritize fixing this bug because it's annoying as hell.

I got 40KB worth of back traces on every suspend, and now I simply power off the PC entirely, since I got fed up with this.

It's important to keep issues distinct and avoid mislabeling them as duplicates without clear evidence.
If you're experiencing suspend-related problems, it would be best to discuss those in a thread specifically addressing that issue, rather than here.

This bug report has totally different stack trace signature than #662 and original report didn't mention any suspend related issues.

@Theluga
Copy link

Theluga commented Sep 26, 2024

I don't know if the #662 is related but on the closed-source side, this error is already known by Nvidia Nvidia forum.

I have the same problem on the Arch Linux and my workaround was to use the Linux-LTS 6.6.52-1-lts temporarily.

The problem was posted on the Arch forums since july Arch Forum

I hope they fix this soon, because 6.10 is basically incompatible with Nvidia drivers open or not without errors.

@tekstryder
Copy link

I can confirm this issue is indeed resolved with the API changes in kernel 6.12.

  • Arch Linux | Kernel 6.12.1
  • nVidia 565.57.01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants