Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HMM support in UVM #338

Open
VivekPanyam opened this issue Jul 29, 2022 · 32 comments
Open

HMM support in UVM #338

VivekPanyam opened this issue Jul 29, 2022 · 32 comments
Labels
Feature Pending The function or feature is not available in this driver version NV-Triaged An NVBug has been created for dev to investigate

Comments

@VivekPanyam
Copy link

NVIDIA Open GPU Kernel Modules Version

515.57 Release

Does this happen with the proprietary driver (of the same version) as well?

Yes

Operating System and Version

Ubuntu 20.04.1 LTS

Kernel Release

5.13.0-1029-aws

Hardware: GPU

NVIDIA T4

Describe the bug

Hello!

HMM support has been mentioned in several NVIDIA docs and presentations since 2017 (including the announcement of the open-source kernel modules), but it seems to be disabled here (and doesn't work when using the proprietary driver).

typedef struct
{
// This stores pointers to uvm_va_block_t for HMM blocks.
uvm_range_tree_t blocks;
uvm_mutex_t blocks_lock;
// TODO: Bug 3351822: [UVM-HMM] Remove temporary testing changes.
// This flag is set true by default for each va_space so most processes
// don't see partially implemented UVM-HMM behavior but can be enabled by
// test code for a given va_space so the test process can do some interim
// testing. It needs to be a separate flag instead of modifying
// uvm_disable_hmm or va_space->flags since those are user inputs and are
// visible/checked by test code.
// Remove this when UVM-HMM is fully integrated into chips_a.
bool disable;
} uvm_hmm_va_space_t;

I assume the referenced bug/task is internal. Is there any information you can share on what additional work needs to happen to enable UVM-HMM (or potentially a timeline?).

See references and a repro below.

To Reproduce

I'm testing HMM using the following code (from one of the presentations linked below):

#include <stdio.h>

#define LEN sizeof(int)

__global__ void
compute_this(int *pDataFromCpu)
{
    atomicAdd(pDataFromCpu, 1);
}

int main(void)
{
    int *pData = (int*)malloc(LEN);
    *pData = 0;

    // Run on GPU:
    compute_this<<<512,1000>>>(pData);

    cudaDeviceSynchronize();

    printf("Results: %d\n", *pData);

    free(pData);
    return 0;
}

It currently just prints Results: 0 and the following message is in the output of dmesg

[ 3896.223804] NVRM: Xid (PCI:0000:00:1e): 31, pid=42009, name=a.out, Ch 00000007, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x5558_a1468000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_ATOMIC

Bug Incidence

Always

nvidia-bug-report.log.gz

N/A

More Info

References:

@VivekPanyam VivekPanyam added the bug Something isn't working label Jul 29, 2022
@johnhubbard
Copy link

Hi,

HMM functionality is not available because our kernel driver does not yet support it. We are working on that, but we routinely do not provide timelines or ETAs for things like that, sorry for the vagueness there.

@VivekPanyam
Copy link
Author

Hi John!

Didn't expect you to respond directly :) When you say "our kernel driver," what are you referring to specifically? The code in this repo or a binary blob somewhere else?

Is there anything that can be done in the OSS codebase to accelerate HMM support or is it blocked by NVIDIA internal dependencies?

Thanks!

@johnhubbard
Copy link

I'm referring to the code in this repo. In fact, maybe I should have written "kernel drivers", because both nvidia.ko and nvidia-uvm.ko are involved in supporting HMM.

As for accelerating development, nvidia-uvm.ko in particular is built from some very complex source code, due to the need to handle every aspect of the CUDA programming model. Adding production-quality HMM support to that simply takes time.

We realize that this is in demand and are working on it.

@VivekPanyam
Copy link
Author

That makes sense. Is there a public issue tracking HMM support? If not, would you mind commenting on this issue when there's something publicly available to test?

Thanks again

@johnhubbard
Copy link

I think that this issue might be the public issue that tracks HMM support. :)

Sure, I'll make a note to update this as part of the "release HMM support" steps.

@mtijanic mtijanic added Feature Pending The function or feature is not available in this driver version NV-Triaged An NVBug has been created for dev to investigate and removed bug Something isn't working labels Aug 4, 2022
@VivekPanyam
Copy link
Author

Hi @johnhubbard!

There was a post on the NVIDIA tech blog yesterday (November 10th) that talks about HMM:

For PCIe-based platforms such as x86 or Arm, you can use the same Unified Memory programming model as the NVIDIA Grace Hopper model. That is possible through the Heterogeneous Memory Management (HMM) feature, which is a combination of Linux kernel features and NVIDIA driver features that use software to emulate memory coherence between CPUs and GPUs.

- https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/

It seems like HMM is still disabled in the 525.53 code drop from 15 hours ago:

NV_STATUS uvm_hmm_va_space_initialize(uvm_va_space_t *va_space)
{
struct mm_struct *mm = va_space->va_space_mm.mm;
if (!uvm_hmm_is_enabled(va_space))
return NV_OK;
uvm_assert_mmap_lock_locked_write(mm);
uvm_assert_rwsem_locked_write(&va_space->lock);
// TODO: Bug 3351822: [UVM-HMM] Remove temporary testing changes.
// Disable HMM by default for each va_space until enough functionality is
// implemented that this can be enabled by default.
// Note that it can be enabled for testing under controlled circumstances.
va_space->hmm.disable = true;
return NV_OK;

Am I missing something?

Thanks!

@johnhubbard
Copy link

That blog post was in error. After receiving your question here, we have corrected the blog to reflect that HMM is not yet supported in our driver. Thanks for alerting us, and sorry for the incorrect information that went out.

@NVIDIA NVIDIA deleted a comment from vanesa500 Nov 23, 2022
@NVIDIA NVIDIA deleted a comment from vanesa500 Nov 23, 2022
@NVIDIA NVIDIA deleted a comment from vanesa500 Nov 23, 2022
@NVIDIA NVIDIA deleted a comment from vanesa500 Nov 23, 2022
@sdake
Copy link

sdake commented Jan 24, 2023

cc @sdake

@woachk
Copy link

woachk commented Mar 12, 2023

Worth noting that the public alpha of this was (silently) pushed as part of r530. Seems to work ok w/ some testing so far.

@johnhubbard
Copy link

Yes, an early version of HMM support is included in the r530 release. However, as I wrote here:

https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/kernel-open/nvidia-uvm/uvm_hmm.c#L42

, it is not ready for production use. That's why it was "silently" included. Once it is ready for production use, we will formally announce that (as per my Aug 3, 2022 comment here).

@oscarbg
Copy link

oscarbg commented Mar 19, 2023

@johnhubbard can talk, if HMM support once ready for production use, will be enabled on closed source kernel driver also? I mean for pre Turing cards like Titan V (Volta).. also can talk about if Windows HMM support is planned eventually? (even if only on TCC mode or will come to WDMM mode also) thanks..

@sdake
Copy link

sdake commented Mar 19, 2023

NVIDIA proprietary driver (530.30.02) using A30:

[12929.936750] Call Trace:
[12929.937530]  __schedule+0x282/0x870
[12929.938611]  ? kvm_sched_clock_read+0xd/0x20
[12929.939703]  schedule+0x46/0xb0
[12929.940439]  rwsem_down_write_slowpath+0x257/0x4d0
[12929.941548]  ? __free_slab+0xcf/0x1d0
[12929.942426]  uvm_perf_thrashing_stop+0x3d/0xa0 [nvidia_uvm]
[12929.943756]  uvm_va_space_destroy+0xa4/0x480 [nvidia_uvm]
[12929.945018]  uvm_release.constprop.0+0x93/0xc0 [nvidia_uvm]
[12929.946309]  uvm_release_entry.part.0.isra.0+0x7a/0xb0 [nvidia_uvm]
[12929.947770]  ? up+0x12/0x60
[12929.948428]  ? __fput+0x100/0x240
[12929.949166]  ? kmem_cache_free+0xff/0x420
[12929.949871]  ? mntput_no_expire+0x47/0x270
[12929.950591]  __fput+0x92/0x240
[12929.951134]  task_work_run+0x62/0xa0
[12929.951792]  do_exit+0x34b/0xa90
[12929.952367]  ? __schedule+0x28a/0x870
[12929.953019]  ? timerqueue_del+0x1e/0x50
[12929.953691]  do_group_exit+0x33/0xa0
[12929.954325]  get_signal+0x170/0x890
[12929.954942]  arch_do_signal_or_restart+0xf1/0x7e0
[12929.955789]  ? do_epoll_wait+0xd8/0x670
[12929.956468]  ? hrtimer_interrupt+0x15d/0x2c0
[12929.957221]  ? handle_irq_event+0x73/0xb0
[12929.957932]  exit_to_user_mode_prepare+0xff/0x160
[12929.958586]  syscall_exit_to_user_mode+0x28/0x150
[12929.959187]  entry_SYSCALL_64_after_hwframe+0x61/0xc6

@oscarbg
Copy link

oscarbg commented Mar 19, 2023

@sdake it was said was working using open kernel module only, but thanks for testing and confirming it doesn’t work on propietary (right now?).. as said earlier hope propietary kernel driver gets enabled also as unique way for pre turing cards (but I think HMM is Pascal+ only so only needed for Pascal and Volta generations)..

@johnhubbard
Copy link

HMM depends upon the open source version of the driver. The open source version of the driver, in turn, only works on Turing and later GPUs.

As it says in the r530_00 release notes, Ch. 44, "The open flavor of kernel modules supports Turing, Ampere, and forward. The open kernel modules cannot support GPUs before Turing, because the open kernel modules depend on the GPU System Processor (GSP) first introduced in Turing."

Therefore, HMM is only available on Turing and later GPUs.

@oscarbg
Copy link

oscarbg commented Mar 21, 2023

@johnhubbard thanks! so seems also no Windows support planned (even for WSL2), right? in any driver mode be either TCC or WDDM..

@johnhubbard
Copy link

Right, no Windows support exists.

The HMM feature required OS kernel changes, in addition to changes in our driver stack here. The open source Linux kernel, and the kernel community, made it possible to make such changes.

On Windows, however, Microsoft has not made any such corresponding changes, so HMM is unavailable there.

@sdake
Copy link

sdake commented Mar 21, 2023

@oscarbg all good. I reported the stack trace from the production driver available from NVIDIA's deb repos using: apt install cuda using cuda 12.1. I can say HMM "functions" on the proprietary. driver, although its reliability is very poor, generating kernel stack traces often.

I will try the open-source kernel driver, and report kernel traces or other bad behavior here.

Does the GRID driver function with HMM?

TY!
-steve

@oscarbg
Copy link

oscarbg commented Apr 17, 2023

thanks @sdake..
thanks @johnhubbard ..
seems I have bad luck testing HMM with new Ada 4070 GPU and Nvidia 530.41.03 prebuilt open kernel module
installed with sh ./NVIDIA-Linux-[...].run -m=kernel-open
compiled sample and getting results=0 and

[    4.618595] NVRM cpuidInfoAMD: Unrecognized AMD processor in cpuidInfoAMD
[    4.663159] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  530.41.03  Release Build  (dvs-builder@U16-T02-35-3)  Thu Mar 16 19:33:35 UTC 2023
[    5.358387] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  530.41.03  Release Build  (dvs-builder@U16-T02-35-3)  Thu Mar 16 19:23:50 UTC 2023
[  329.642495] NVRM: GPU at PCI:0000:01:00: GPU-b53502d1-facc-b12d-2a6c-cbb01b5beae4
[  329.642499] NVRM: Xid (PCI:0000:01:00): 31, pid=7087, name=sm80, Ch 00000014, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x5560_dc4f4000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_ATOMIC
[  330.484339] NVRM: Xid (PCI:0000:01:00): 31, pid=7105, name=sm80.out, Ch 00000014, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x55d5_7e958000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_ATOMIC

so not working..
it's because not building the kernel module by myself as prebuilt open kernel module doesn't enable HMM by default or because since early 530.xx beta 530.41.03 it's disabled?
thanks..

@johnhubbard
Copy link

HMM support is disabled by default in the r530 driver. That's why your sample is failing as shown above.

@sdake
Copy link

sdake commented Apr 18, 2023

There is no need to recompile the kernel. You can set a driver load parameter to enable HMM. Look at modinfo, and /etc/modules. I don't have the exact commands as I am typing this on a phone.

Cheers
Steve

@oscarbg
Copy link

oscarbg commented Apr 20, 2023

@johnhubbard @sdake thanks both.. tested with the needed uvm module parameter and works! only open source kernel module works, as shared on this thread.. curious why closed source module also admits the same parameter but doesn’t work..
Concluding situation is little sad as closed source modules are needed if wanting to use Gsync right now for example.. so hoping for Gsync open source support soon (535.xx?) or HMM closed source support.. idealy both..

@aritger
Copy link
Collaborator

aritger commented Apr 20, 2023

Yes, we're working hard to close the remaining feature gaps (such as Gsync ) in the open kernel modules. I can't promise particular releases, here, but yes: everything should ultimately converge in the open kernel modules.

@sdake
Copy link

sdake commented Apr 20, 2023 via email

@oscarbg
Copy link

oscarbg commented Apr 20, 2023

@aritger thanks for information..
@sdake really experiencing for fun.. no real use case right now..

@sdake
Copy link

sdake commented May 11, 2023

Has anyone benched this approach for ML workloads versus,say, Microsoft's awesome work with DeepSpeed? This feels like a solution seeking a problem.

Thank you,
Steve

@johnhubbard
Copy link

HMM is now supported with CUDA 12.2. Please see https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/ for more information.

@bhaveshdavda
Copy link

bhaveshdavda commented Aug 1, 2024

@johnhubbard I'll just ask this simply:

What does it take to have HMM enabled and nvidia-smi -q to show Addressing Mode : HMM on an HGX-H100 x86 system running Ubuntu 22.04 with the 5.15.0-105-generic kernel?

Driver Version: 550.90.07      CUDA Version: 12.4

Are there any BIOS settings or kernel params to be passed via GRUB that are not documented anywhere?

edit: Going to try and upgrade the kernel to linux-image-6.8.0-39-generic based on the blog stating:

A sufficiently recent Linux kernel: 6.1.24+, 6.2.11+, or 6.3+.

@sdake
Copy link

sdake commented Aug 4, 2024

@bhaveshdavda there is not a simple answer, unfortunately.

A working kernel.org is in this repository as a dockerfile which you can build locally. After running build.sh, the target directory will then contain 4 .deb files you can co-install.

https://github.com/artificialwisdomai/origin

Please let me know how it goes.

Thanks
-steve

@bhaveshdavda
Copy link

Update. I finally got this working in a Kubernetes environment no less with the NVIDIA GPU Operator. Notes:

  1. Upgraded Ubuntu 22.04 LTS kernel to 6.8.0-39-generic
  2. Passed module parameter uvm_ats_mode=0 to the nvidia-uvm module
  3. [Main takeaway] Use the open kernel module instead of the default NVIDIA proprietary modules via a Helm chart variable passed to the GPU operator

@sdake
Copy link

sdake commented Aug 12, 2024

It isn't necessary to upgrade the kernel, but instead, it is necessary to configure the one you have properly. There is one additional config option required. I didn't need uvm_ats_mode=0, and I am not sure why you would want to turn off address translation service as it protects the platform's DMA operations from third party corruption.

There is a Docker to build Debian upstream kernel (should work fine with Ubuntu as well) here:

https://github.com/artificialwisdomai/origin/tree/main/platform/Dockerfiles/linux-kernel

@bhaveshdavda
Copy link

bhaveshdavda commented Aug 12, 2024

@sdake I agree with your statement about disabling ATS not being required and I too assumed ATS is an important security feature for PCIe. And I also feel like the stock Ubuntu 22.04 LTS kernel 5.15.0-105-generic would probably also work as it has the right Kconfig. I think the main variable is using the open kernel module instead of the proprietary module because the nvidia-uvm module has #ifdef'ed out implementation of uvm-hmm.c in the latter

Edit:

  1. uvm_ats_mode=0 is not required
  2. 5.15.0-105-generic kernel definitely doesn't work, but stock Ubuntu 22.04 6.8.0-39-generic does

@DevenBL
Copy link

DevenBL commented Dec 28, 2024

Hello,
This seems to be a related memory management issue. Will this address the GTT/Shared memory problem for Nvidia drivers on Linux? Enabling GTT support like the amdgpu driver for Nvidia would greatly simplify specific Cuda, rendering, Ai and Gaming workloads where Nvidia's specific memory management interfaces are not explicitly used. All other competitor's software support this on Windows and Linux except Nvidia where it's only supported on Windows but not Linux.

I have the nvidia-uvm module loaded and Addressing Mode : HMM enabled and it did not seem to alleviate these issues. Testing was done with the nvidia 565.77 drivers and the 6.12.6 kernel with the gpu open nvidia kernel modules.

As seen here:
#618
#663
https://forums.developer.nvidia.com/t/vram-allocation-issues/239678
https://forums.developer.nvidia.com/t/non-existent-shared-vram-on-nvidia-linux-drivers/260304

And most recently:
#758

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Pending The function or feature is not available in this driver version NV-Triaged An NVBug has been created for dev to investigate
Projects
None yet
Development

No branches or pull requests

9 participants