Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTD3 dont allow gpu to sleep after a monitor has been plugged and unplugged on prime reverse sync #759

Open
1 of 2 tasks
Aetherall opened this issue Dec 29, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@Aetherall
Copy link

Aetherall commented Dec 29, 2024

NVIDIA Open GPU Kernel Modules Version

NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 565.77 srcversion: 0BDAE46B2642DAFAAF16C9C

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Nixos unstable

Kernel Release

Linux 6.12.6 NixOS SMP PREEMPT_DYNAMIC Thu Dec 19 17:13:24 UTC 2024 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU

Describe the bug

GPU power state locked in D0 after a monitor has been plugged and unplugged to the gpu.

I am using reverse sync on a advanced optimus laptop, to allow hotplug monitors in the gpu hdmi/dp ports.

After a clean boot, and before plugging an external monitor, the nvidia gpu is in D3cold state.

> cat /sys/class/drm/card*/device/power_state
D0
D3cold

I can power on the nvidia card using nvidia-smi or by running glxinfo, and the power state will switch to D0 before going back to sleep as intented.
The issue arises when a monitor is plugged to the gpu. this will wake the gpu to allow reverse prime to render to the external display.
However, when unplugging the monitor, the power state will never leave D0. (no process running on the gpu)

I found related posts on the nvidia forums,
https://forums.developer.nvidia.com/t/nvidia-dgpu-in-hybrid-optimus-laptop-not-powering-down-after-unplugging-external-monitor/318196
https://forums.developer.nvidia.com/t/565-release-feedback-discussion/310777/154
https://forums.developer.nvidia.com/t/565-release-feedback-discussion/310777/41
and most importantly https://forums.developer.nvidia.com/t/bug-linux-driver-fails-to-remove-framebuffer-device-when-hdmi-cable-plugged-out/316645

In the last one, gm151 noticed that the framebuffer created when plugging the monitor is never cleaned up.

When the cable is plugged in a new framebuffer device is created as it should, however when the cable is plugged out, the device is NOT removed even with no clients using it. This has several negative consequences:

  • If the virtual console is remapped to the new framebuffer, then after plugging out, the console is NOT remapped back to the integrated GPU. (This can be inhibited by passing fbcon=map:0, however this does not help the framebuffer to get removed)
  • The DGPU device fails to enter D3cold state and consumes power.
    Here are some facts from the kernel’s sysfs. Note this is WITHOUT any graphical environment running, pure text console, ruling out the graphical env as a culprit.
    Also, a drm journal log was emitted on plug [drm] fb1: nvidia-drmdrmfb frame buffer device but nothing on unplug.

I am facing the same situation and did some more testing.

I can indeed see that the frame buffer at /dev/fb1 ( /sys/class/graphics/fb1 ) is created when the monitor is plugged in and not removed on unplug.

Reloading the nvidia-drm module allow to go back in sleep mode:
modprobe -r nvidia-drm && modprobe nvidia-drm
I notice that the ghost framebuffer is removed afterwards, maybe it is what allows RTD3 to kick in.

dumping the framebuffers using cat /sys/kernel/debug/dri/12{8,9}/framebuffer shows that:

  • on fresh boot with d3cold and no monitor, the nvidia related dri framebuffer is empty (empty file)
    whereas the other contains a framebuffer allocated by fbcon

  • after plugging in a monitor, the nvidia dri framebuffer now contains a framebuffer allocated by fcon too, with a layer size corresponding to the monitor.

  • after unplugging the monitor, the framebuffer does not go back to an empty file, and stays allocated by fbcon

I tried every combination of kernel parameters / module options to no avail. The more I tried with are:

options nvidia NVreg_EnableGpuFirmware=0 NVreg_DynamicPowerManagementVideoMemoryThreshold=0 NVreg_DynamicPowerManagement=0x02 NVreg_UsePageAttributeTable=1 NVreg_InitializeSystemMemoryAllocations=0 NVreg_PreserveVideoMemoryAllocations=1
options nvidia-drm modeset=1 fbdev=1

I also tried several linux kernel versions.

It seems like the code responsible for fbdev moved recently in the linux kernel and in the oppen-gpu-kernel-modules,
I saw changes related to hotplugs events, so maybe we are missing a handler to cleanup the framebuffers ?

Thanks !

To Reproduce

  • on a advanced optimus laptop on hybrid mode
  • unplug external monitors
  • boot to tty with reverse prime and fine-grained power control enabled
  • cat /sys/class/drm/card*/device/power_state to check and wait until gpu is D3cold
  • ls /dev/fb* -> should show only the fb0 ( integrated graphics -> internal monitor )
  • nvidia-smi -> boot the gpu
  • cat /sys/class/drm/card*/device/power_state to check and wait until gpu is D3cold
  • here we know that RTD3 works
  • plug an external monitor
  • cat /sys/class/drm/card*/device/power_state to check and wait until gpu is D0
  • here we know the gpu can wake for a monitor
  • unplug the external monitor
  • cat /sys/class/drm/card*/device/power_state to check and wait until gpu is D3cold <- never happens

Bug Incidence

Always

nvidia-bug-report.log.gz

I have tested 20+ module option combination, which options are the most interesting to generate the report with ?

More Info

No response

@Aetherall Aetherall added the bug Something isn't working label Dec 29, 2024
@Binary-Eater
Copy link

Binary-Eater commented Dec 30, 2024

Hi @Aetherall,

We are tracking this in bug 5034343 internally. I am very interested in this issue myself.
My original opinion on this issue is that it should have persisted across various kernel versions, independent of the recent fbdev API refactors.

It seems like the code responsible for fbdev moved recently in the linux kernel and in the oppen-gpu-kernel-modules,
I saw changes related to hotplugs events, so maybe we are missing a handler to cleanup the framebuffers ?

My expectation is that the hotplug event handler is enumerated by the fbdev core API. I have an explanation of this here.

If you use a LTS kernel like kernel 6.6, are you saying that you do not have this issue? I would be surprised if that was the case.

If so, could you provide two bug collection reports? One with the 6.6 kernel and one with 6.12 kernel using the same driver version.

If you would not mind following up with abchauhan on the NVIDIA forum post, that would be appreciated as well. That way, I can be provided with a repro setup. In theory, I should be able to reproduce this with my work laptop, but it helps us with overall process.

@Aetherall
Copy link
Author

Hi @Binary-Eater thanks for the followup !

I initially had the issue on LTS kernel, and later upgraded to 6.12 to see if it would fix the issue. I have not reverted back as this kernel version contains other unrelated improvements I want to keep.

I will followup with abchauhan asap and provide the gz logfile and a reproducible environment, however I wont be available for new years eve so it might take few days.

Meanwhile here is my nvidia nixos configuration if you want to reproduce it on the same os as well.

{
  config,
  pkgs,
  lib,
  ...
}: {
  boot.kernelPackages = pkgs.linuxPackages_latest;
  hardware.graphics.enable = true;
  powerManagement.enable = true;

  services.auto-cpufreq.settings = {
    battery = {
      governor = "powersave";
      turbo = "auto";
    };
    charger = {
      governor = "performance";
      turbo = "auto";
    };
  };

  hardware.nvidia = {
    modesetting.enable = true;

    powerManagement.enable = true;

    dynamicBoost.enable = true;
    nvidiaPersistenced = true;

    open = true;
    nvidiaSettings = true;
    package = config.boot.kernelPackages.nvidiaPackages.beta;
  };

  services.udev.extraRules = ''
    # Create consistent gpu devices symlinks
    ACTION=="bind", SUBSYSTEM=="pci", ATTRS{vendor}=="0x8086", ATTR{class}=="0x030000", RUN+="${pkgs.coreutils-full}/bin/ln -s /dev/dri/by-path/pci-0000:00:02.0-card /dev/gpu_intel"
    ACTION=="bind", SUBSYSTEM=="pci", ATTRS{vendor}=="0x10de", ATTR{class}=="0x030000", RUN+="${pkgs.coreutils-full}/bin/ln -s /dev/dri/by-path/pci-0000:01:00.0-card /dev/gpu_nvidia"
  '';

  services.xserver.videoDrivers = ["nvidia"];

  environment.sessionVariables.AQ_DRM_DEVICES = "/dev/gpu_nvidia";
  environment.sessionVariables.VK_ICD_FILENAMES = "/run/opengl-driver/share/vulkan/icd.d/nvidia_icd.x86_64.json";
  environment.sessionVariables.GBM_BACKEND = "nvidia-drm";
  environment.sessionVariables.LIBVA_DRIVER_NAME = "nvidia";
  environment.sessionVariables.__GLX_VENDOR_LIBRARY_NAME = "nvidia";

  specialisation = {
    powersave.configuration = {
      system.nixos.tags = ["powersave"]; # this specialisation have the RTD3 issue
      hardware.nvidia = {
        powerManagement.enable = true;
        powerManagement.finegrained = true;
        prime = {
          offload.enable = true;
          offload.enableOffloadCmd = true;
          reverseSync.enable = true;
          intelBusId = "PCI:0:2:0";
          nvidiaBusId = "PCI:1:0:0";
        };
      };
      environment.sessionVariables.AQ_DRM_DEVICES = lib.mkForce "/dev/gpu_intel:/dev/gpu_nvidia";
      environment.sessionVariables.VK_ICD_FILENAMES = lib.mkForce "";
      environment.sessionVariables.GBM_BACKEND = lib.mkForce "";
      environment.sessionVariables.LIBVA_DRIVER_NAME = lib.mkForce "";
      environment.sessionVariables.__GLX_VENDOR_LIBRARY_NAME = lib.mkForce "";
    };
  };
}

Happy new year !

@Kimiblock
Copy link

Yeah I'm seeing this on an RTX 4060 mobile. Maybe nvidia-smi --gpu-reset can sort of mitigate this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants