Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nv_module_resources_init tried to execute NX-protected page #765

Open
2 tasks done
SyntheticBird45 opened this issue Jan 9, 2025 · 5 comments
Open
2 tasks done
Labels
bug Something isn't working

Comments

@SyntheticBird45
Copy link

SyntheticBird45 commented Jan 9, 2025

NVIDIA Open GPU Kernel Modules Version

565.77

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Artix Linux

Kernel Release

Linux unknown 6.11.10-hardened1-1-hardened

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

Nvidia Geforce RTX 4070 (nvidia-smi don't work since driver can't load)

Describe the bug

On linux kernel 6.11.10 (and 6.12.6) (linux-hardened patches applied), built with Clang CFI and Thin LTO, the open driver tries to execute an NX page.

I was able to reproduce this on 6.12.6 but another unrelated bug make me unable to profit from any GPU.
I downgraded to 6.11.10 but issue stay the same.
This issue DO NOT happen on the official linux-hardened arch linux package compiled with GCC.

Kernel stack trace (truncated for privacy concerns, can add additional details if required):
dmesg_nvidia_open.log

To Reproduce

  • Pull up a release from https://github.com/anthraxx/linux-hardened/releases.
  • Configure kernel for using clang, enable CFI and Thin LTO.
  • Compile it and install it
  • Install NVIDIA open driver through DKMS or manually.
  • sudo modprobe nvidia
  • Should say Killed
  • dmesg will give this stack trace.

Bug Incidence

Always

nvidia-bug-report.log.gz

No point since neither nvidia driver loads, neither is it runtime feature related. Also i don't like sharing my whole PCI tree in public github.

More Info

No response

@SyntheticBird45 SyntheticBird45 added the bug Something isn't working label Jan 9, 2025
@TheBetterSolution
Copy link

TheBetterSolution commented Jan 11, 2025

Hi @SyntheticBird45 , I just found the
nv_module_resources_init called the ktime_get_raw_ts64 function, I think ktime_get_raw_ts64 is in other module, because this exception is happened when executing code on the invalid address(that maybe in the address space of other module), I suggest we can try to remove the NV_KTIME_GET_RAW_TS64_PRESENT macro and building again, please refer to:

#if !defined(NV_KTIME_GET_RAW_TS64_PRESENT)

Just a suggestion for your reference.

@SyntheticBird45
Copy link
Author

Thanks @TheBetterSolution, I'll try soon

@SyntheticBird45
Copy link
Author

I've tried commenting the macro but it then completely went haywire on the compile errors (see logs below):

compilerror.txt

I also tried to #define NV_KTIME_GET_RAW_TS64_PRESENT but as one could have guessed, the kernel just hanged at boot.

@TheBetterSolution
Copy link

TheBetterSolution commented Jan 12, 2025

Excuse me @SyntheticBird45
I researched the calls in nv_module_resources_init (and reference the offset) that execute the code outside the module, I guess the exception is raised at ktime_get_raw_ts64 in NV_KMEM_CACHE_CREATE:

nvidia_stack_t_cache = NV_KMEM_CACHE_CREATE(nvidia_stack_cache_name,

But I can't ensure that's the root cause.

I think the following code needs to change for removing NV_KTIME_GET_RAW_TS64_PRESENT:
https://github.com/NVIDIA/open-gpu-kernel-modules/blob/9d0b0414a5304c3679c5db9d44d2afba8e58cc1b/kernel-open/conftest.sh#L4663C1-L4668C90

But please don't try it, we can wait the official response.

@aritger
Copy link
Collaborator

aritger commented Jan 15, 2025

Hi. I don't know if it is related, but the mention of CFI here reminded me of this issue:
#439

In that issue, the problem is that the core of nvidia.ko isn't being built by kbuild, and thus isn't getting kbuild's extra CFLAGS for CFI. In the current issue, I don't think we're even getting far enough to execute any of the "core" of nvidia.ko, so there is probably something else going on here. But, you may trip on issue/439 once we resolve the current problem. Or, something about the core of nvidia.ko not being built with kbuild's CFLAGS could be confusing things and triggering the current problem.

I see:

[ 1543.298921] Tainted: [O]=OOT_MODULE, [T]=RANDSTRUCT

It might be a useful experiment to test without RANDSTRUCT.

Looking at your call trace:

[ 1543.298943]  ? asm_exc_page_fault+0x26/0x30
[ 1543.298945]  ? nv_module_resources_init+0x80/0x160 [nvidia c7e7bb6b8f2be675d60a29e17e81cd37292395a3]
[ 1543.299005]  ? nv_module_init+0xf/0x150 [nvidia c7e7bb6b8f2be675d60a29e17e81cd37292395a3]
[ 1543.299047]  ? init_module+0x102/0x2cb [nvidia c7e7bb6b8f2be675d60a29e17e81cd37292395a3]

And @TheBetterSolution's speculation, I guess you are concerned about this path?

    nv_module_resources_init()
     NV_KMEM_CACHE_CREATE()
      nv_kmem_cache_create()
       nv_ktime_get_raw_ns()
        ktime_get_raw_ts64()

I suppose you could check the Module.symvers for your kernel (typically /usr/lib/modules/uname -r/build/Module.symvers) to find what module provides ktime_get_raw_ts64. E.g., for me:

$ grep ktime_get_raw_ts64 ./build/Module.symvers
0x00000000      ktime_get_raw_ts64      vmlinux EXPORT_SYMBOL

(i.e., part of vmlinux, not a separate kernel module)

Was there specific reason to suspect ktime_get_raw_ts64(), or was that just speculation?

It might be easiest to sprinkle from printks in nv_module_resources_init() and its callees to determine where exactly the asm_exc_page_fault is happening.

printk(KERN_ERR "file:%s, line:%d\n", __FILE__, __LINE__);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants