-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
6.12: drm_open_helper RIP #712
Comments
What happens if you revert that kernel change made by upstream. Made the drivers compile without additional patches: What happens if you revert that change in kernel. That is what I did before: https://gitlab.manjaro.org/packages/core/linux612/-/blob/ec1f53f77fd3f92f7cd4eeed444a341d8ded3291/revert-nvidia-446d0f48.patch |
Thanks! Tracked internally as NV bug 4888621. |
This may be related to commit 641bb4394f40 ("fs: move FMODE_UNSIGNED_OFFSET to fop_flags"). At least for nvidia-470xx it's fixed by adding the |
@joanbm diff --git a/kernel-open/nvidia-uvm/uvm_hmm.c b/kernel-open/nvidia-uvm/uvm_hmm.c
index 93e64424..dc64184e 100644
--- a/kernel-open/nvidia-uvm/uvm_hmm.c
+++ b/kernel-open/nvidia-uvm/uvm_hmm.c
@@ -2694,7 +2694,7 @@ static NV_STATUS dmamap_src_sysmem_pages(uvm_va_block_t *va_block,
continue;
}
- if (PageSwapCache(src_page)) {
+ if (folio_test_swapcache(page_folio(src_page))) {
// TODO: Bug 4050579: Remove this when swap cached pages can be
// migrated.
status = NV_WARN_MISMATCHED_TARGET; |
The bug hit production for me, and the internal laptop display suddenly stopped working. If I detach the computer from HDMI, then I have no screen. This means that the bug is critical, as it renders the system unusable. But more importantly, this exposes a big flaw on how critical bugs are handled and prevented, project management wise. Critical bugs like these need to be visually separated from the rest, for example by using a tag. And while they are present, all efforts shall concentrate on fixing them before coding anything else. It cannot happen that adding two lines of code, that someone else coded for you, for a critical bug, takes 2 months. |
If referring to that patch, it's already included (well a slightly different version of) in the production branch of the drivers, aka 550.135 released November 20 (beta drivers like 565.57.01, are well, betas and may not see immediate fixes -- generally would also avoid using brand new kernel branches to give time unless ok with being the tester -- ideally use long-term-support branches). |
@ionenwks Okay, this is what has happened: I visited the downloads repo and opened latest.txt. This document said "550.135", yet on the repo there was three major versions after that: 555, 560, 565. Hence I assumed that maybe "latest.txt" was outdated or something. Probably a small improvement would be naming that file more explicitly, like "stable.txt". I also see that versions are explained on a different page, here. Maybe this info needs to be in the repo itself, in the form of text files. Like "1-stable.txt", "2-new-features.txt", and "3-experimental.txt". Or as a reference to that page, or both. But if you ask me, I wouldn't have multiple "flavors" of the driver. Instead I would adopt the model "stable base with some beta on it". Release often, and only modify code further when known bugs have been addressed or decided, aka the new small beta code is no longer beta. Next I don't see this patch applied on the 550 branch code. Hence it seems I had received the critical bug anyway. Finally using older code, like an older kernel, counterintuitively usually leads to less stability. When you wait too long to update bugs accumulate, and you have to deal with all hitting at the same time. When you update more often, you can handle the bugs more like "drop by drop". |
@es20490446e for now only 550 series seems to support 6.12 kinda. Older Nvidia drivers or 560 (gets no updates anymore) and the latest 565 are not yet updated, but patched by the user community, especially around Arch-based distros. Adding patches might not be the problem here, however to verify if those don't create regressions and are all tested against a wide range of Nvidia hardware. QA normally takes time, especially with out-of-tree kernel drivers. It is also known that Nvidia doesn't accept issues from unreleased kernels, such as RC-Kernels. Normally it is critical to look at least at RC1 releases and provide patches as soon as possible to the users for testing during the development cycle. When AMD switched to FOSS and in-tree drivers, a better support was given. Also their kernel developers are very active at discord and other public social media channels to work with the user and developer community for fast fixing issues. Maybe Nvidia might want to change policies and adopt one or two things other companies do. Latest example of needed code changes in their open drivers for 6.13 can be seen here: #746 Also it seems that modesetting needs to be active with 6.12+ kernels. Not all drivers default to that setting yet. |
@philmmanjaro Oki, thanks for the info 👍 |
OpenSUSE just rolled out kernel 6.12 and this bug is still in effect when using the drivers from the cuda repository maintained by nvidia. I'll update if I find out how/if it's intended to work |
@pallaswept I have the same issue. If you find a solution, please share it. Thank you! |
Same issue for me on Tumbleweed with 6.12 kernel using the cuda 565 drivers. |
It seems this also affects closed source drivers? I zypper dup'ed yesterday to tumbleweed 20241226 using linux kernel 6.12 and now have a black screen too.
How would I downgrade "the client"? All packages I see regarding v550 is
I did a I'm willing to change to the open source driver once I find my 1650 SUPER is supported, but it seems it won't help? |
The bad news is, your PC is schizophrenic 😆 The good news is, that's not this bug, and you're happy using 550 drivers which are supported, so you can probably fix it. |
Thanks, let me summarize the current status before I go: found that besides the 37489 build warnings of the nvidia module I somehow missed the one error that dracut also didn't bother to complain about but silently used the old 550.90 module:
ugly workaround to inject the fix (member is not used) in the build until tumbleweed releases a proper fix:
|
Is there any workaround to be able to use internal laptop monitor with full nvidia graphics? No nvidia drivers will populate on my laptop screen any longer. |
@username227 would you mind sharing the logs collected by the nvidia-bug-report.sh script? |
Attached. Generated while running in hybrid mode since nvidia mode gives only black screen. |
…ge_folio())) Commit "mm: remove PageSwapCache" on Wed, 21 Aug 2024: torvalds/linux@32f51ea Thanks to / found on NVIDIA/open-gpu-kernel-modules#712
NVIDIA Open GPU Kernel Modules Version
ed4be64
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
CachyOS (ArchLinux)
Kernel Release
6.12.0rc1
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
GPU 0: NVIDIA GeForce RTX 4070 SUPER (UUID: GPU-8c5baf85-cb1f-fe26-95d5-ff3fd51249bb)
Describe the bug
Since the 6.12.0rc1 Release the kernel drm-helper is crashing with the 560.35.03 drivers.
Following patches were pulled in, to make the driver compatible with 6.12, these were extracted out of the 550.120 release:
drm_fbdev fixup for 6.11+: https://github.com/CachyOS/kernel-patches/blob/master/6.12/misc/nvidia/0004-6.11-Add-fix-for-fbdev.patch
drm_outpull_pill for 6.12: https://github.com/CachyOS/kernel-patches/blob/master/6.12/misc/nvidia/0005-6.12-drm_outpull_pill-changed-check.patch
Additional patch to make the module compilation happy (Introduced in commit torvalds/linux@32f51ea ):
with these patches the DKMS Compilation is successful and the driver works fine with the 6.11.x kernel.
Booting into 6.12.0rc1 results into that the driver crashes, at drm_open_helper and there is graphical interface available anymore. The tty is working fine.
Following is visible in the dmesg log:
To Reproduce
Bug Incidence
Always
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz
More Info
No response
The text was updated successfully, but these errors were encountered: