Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor multi-GPU performance due to CPU-side stalls in gbm_surface_lock_front_buffer #743

Open
1 of 2 tasks
Gert-dev opened this issue Nov 18, 2024 · 4 comments
Open
1 of 2 tasks
Labels
bug Something isn't working

Comments

@Gert-dev
Copy link

Gert-dev commented Nov 18, 2024

NVIDIA Open GPU Kernel Modules Version

565.57.01

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Arch Linux

Kernel Release

6.11.8

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 4090 Laptop GPU

Describe the bug

The NVIDIA driver blocks calls to gbm_surface_lock_front_buffer on the CPU until its rendering has fully finished. Because it blocks on the CPU, in multi-GPU scenarios where it copies from a primary GPU, it holds up mutter's rendering thread, preventing it from doing further rendering on the primary GPU until the NVIDIA driver is fully done, which in turn makes the next frame be delayed for both primary and secondary GPU alike.

Essentially rendering on all primary and secondary GPU go in lockstep with one another, reducing performance of displays attached to the primary GPU (as it can't continue rendering until the NVIDIA driver finishes rendering/copying) as well as its own displays (as the NVIDIA GPU can't render/copy until the primary GPU is done rendering).

The reasoning for why this happens is described by Austin Shafer in this mutter thread. Potential solutions that may be compatible with other graphics drivers are also described here by Michel Dänzer.

To Reproduce

  1. Have a device with a high-refresh rate display (e.g. >= 120 Hz) attached to the iGPU and a display port attached to the NVIDIA dGPU directly.
  2. Attach a display to the NVIDIA dGPU, ideally also a high-refresh rate monitor.
  3. Observe various graphical tasks on the screen attached to the primary GPU dropping in FPS, such as mutter's FPS using CLUTTER_SHOW_FPS=1, or tools such as glxgears, especially when moving its window around (the former is the most accurate).
  4. Observe stutter and non-smoothness on the screen attached to the secondary GPU as well, similar to Animations after idling are noticeably choppy until GPU ramps up with GSP firmware enabled #693, but in this case the stutter happens all the time.

Ideally use high-refresh rate monitors with >= 120 Hz in all cases since these make the issue much more visible where full performance is not attained. These are fairly common nowadays in high-end laptops.

There is a lot of debugging and profiling information to be found in this mutter ticket (see also above and below the linked comment).

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

I'm not sure how this affects single-GPU NVIDIA systems; the block likely also happens there but may have a less adverse effect.

I wanted to create this ticket here to keep track of this problem and create more visibility as it is unique the NVIDIA driver on Linux and it is a performance problem that may be the root cause of other open problems that might otherwise be hard to track down. GNOME/mutter is discussed here, but this may also impact KDE and other desktop environments.

Note that mutter stable 47.1 has several performance bottlenecks of its own with multiple GPUs, so to properly test the full potential of fixes you may also need (these are also touched upon in the linked mutter issue):

@Gert-dev Gert-dev added the bug Something isn't working label Nov 18, 2024
@Gert-dev
Copy link
Author

Gert-dev commented Jan 4, 2025

Small update: both listed MRs are now part of the mutter main branch and don't require patching any more.

To be more precise, !4015 is part of mutter 47.2 stable and !4027 has recently been merged to the main branch and should be part of GNOME 48.

@gilvbp
Copy link

gilvbp commented Jan 4, 2025

Small update: both listed MRs are now part of the mutter main branch and don't require patching any more.

To be more precise, !4015 is part of mutter 47.2 stable and !4027 has recently been merged to the main branch and should be part of GNOME 48.

Awesome work! Congrats and thank you!

@iox
Copy link

iox commented Jan 6, 2025

I am experiencing very similar behaviour in KDE Plasma 6.1 (see https://forums.developer.nvidia.com/t/frequent-lags-and-loss-of-smoothness-in-kde-plasma-with-dual-monitors-on-nvidia-3070-ti-wayland-x11/311888/13) . My current workaround is to force the Intel GPU frequency to stay higher with intel_gpu_frequency -c min=1000. Would this workaround help in your setup @Gert-dev ?

@Gert-dev
Copy link
Author

Gert-dev commented Jan 7, 2025

Thanks for the tip, but with the equivalent for an AMD iGPU in my case it unfortunately doesn't fix it. With the original issue looking like a lockstep issue between iGPU and dGPU, it doesn't sound strange to me that upping the frequency of one or both might alleviate the problem somewhat to the point where the experience may be better.

Austin Shafer also posted that there is a similar issue related to the GSP firmware on the NVIDIA GPU today here, but as mentioned there it sounds like those are additional issues that, whilst they do affect the problem as a whole, aren't the main culprit - there appears to be a hard CPU block in a GBM function (reasons described here) that does not happen with nouveau or other graphics drivers that seems to be the biggest hit, without that being solved first it will be very hard to reach full performance on these setups, which becomes especially noticeable with high refresh rate monitors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants