HMM support in UVM #338

VivekPanyam · 2022-07-29T05:04:37Z

NVIDIA Open GPU Kernel Modules Version

515.57 Release

Does this happen with the proprietary driver (of the same version) as well?

Yes

Operating System and Version

Ubuntu 20.04.1 LTS

Kernel Release

5.13.0-1029-aws

Hardware: GPU

NVIDIA T4

Describe the bug

Hello!

HMM support has been mentioned in several NVIDIA docs and presentations since 2017 (including the announcement of the open-source kernel modules), but it seems to be disabled here (and doesn't work when using the proprietary driver).

open-gpu-kernel-modules/kernel-open/nvidia-uvm/uvm_hmm.h

Lines 35 to 50 in d8f3bcf

    
           typedef struct 
        
           { 
        
               // This stores pointers to uvm_va_block_t for HMM blocks. 
        
               uvm_range_tree_t blocks; 
        
               uvm_mutex_t blocks_lock; 
        
               // TODO: Bug 3351822: [UVM-HMM] Remove temporary testing changes. 
        
               // This flag is set true by default for each va_space so most processes 
        
               // don't see partially implemented UVM-HMM behavior but can be enabled by 
        
               // test code for a given va_space so the test process can do some interim 
        
               // testing. It needs to be a separate flag instead of modifying 
        
               // uvm_disable_hmm or va_space->flags since those are user inputs and are 
        
               // visible/checked by test code. 
        
               // Remove this when UVM-HMM is fully integrated into chips_a. 
        
               bool disable; 
        
           } uvm_hmm_va_space_t;

I assume the referenced bug/task is internal. Is there any information you can share on what additional work needs to happen to enable UVM-HMM (or potentially a timeline?).

See references and a repro below.

To Reproduce

I'm testing HMM using the following code (from one of the presentations linked below):

#include <stdio.h>

#define LEN sizeof(int)

__global__ void
compute_this(int *pDataFromCpu)
{
    atomicAdd(pDataFromCpu, 1);
}

int main(void)
{
    int *pData = (int*)malloc(LEN);
    *pData = 0;

    // Run on GPU:
    compute_this<<<512,1000>>>(pData);

    cudaDeviceSynchronize();

    printf("Results: %d\n", *pData);

    free(pData);
    return 0;
}

It currently just prints Results: 0 and the following message is in the output of dmesg

[ 3896.223804] NVRM: Xid (PCI:0000:00:1e): 31, pid=42009, name=a.out, Ch 00000007, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x5558_a1468000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_ATOMIC

Bug Incidence

Always

nvidia-bug-report.log.gz

N/A

More Info

References:

2017 GTC talk - Using HMM to Blur the Lines between CPU and GPU Programming
2019 GTC talk - Memory Management on Modern GPU Architectures
The announcement post for this repo
A 2017 Red Hat summit presentation - GPUs: HMM: Heterogeneous Memory
Management

The text was updated successfully, but these errors were encountered:

johnhubbard · 2022-07-29T20:16:24Z

Hi,

HMM functionality is not available because our kernel driver does not yet support it. We are working on that, but we routinely do not provide timelines or ETAs for things like that, sorry for the vagueness there.

VivekPanyam · 2022-07-30T02:29:10Z

Hi John!

Didn't expect you to respond directly :) When you say "our kernel driver," what are you referring to specifically? The code in this repo or a binary blob somewhere else?

Is there anything that can be done in the OSS codebase to accelerate HMM support or is it blocked by NVIDIA internal dependencies?

Thanks!

johnhubbard · 2022-07-31T01:59:50Z

I'm referring to the code in this repo. In fact, maybe I should have written "kernel drivers", because both nvidia.ko and nvidia-uvm.ko are involved in supporting HMM.

As for accelerating development, nvidia-uvm.ko in particular is built from some very complex source code, due to the need to handle every aspect of the CUDA programming model. Adding production-quality HMM support to that simply takes time.

We realize that this is in demand and are working on it.

VivekPanyam · 2022-08-04T04:31:52Z

That makes sense. Is there a public issue tracking HMM support? If not, would you mind commenting on this issue when there's something publicly available to test?

Thanks again

johnhubbard · 2022-08-04T04:41:19Z

I think that this issue might be the public issue that tracks HMM support. :)

Sure, I'll make a note to update this as part of the "release HMM support" steps.

VivekPanyam · 2022-11-11T07:36:33Z

Hi @johnhubbard!

There was a post on the NVIDIA tech blog yesterday (November 10th) that talks about HMM:

For PCIe-based platforms such as x86 or Arm, you can use the same Unified Memory programming model as the NVIDIA Grace Hopper model. That is possible through the Heterogeneous Memory Management (HMM) feature, which is a combination of Linux kernel features and NVIDIA driver features that use software to emulate memory coherence between CPUs and GPUs.

- https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/

It seems like HMM is still disabled in the 525.53 code drop from 15 hours ago:

open-gpu-kernel-modules/kernel-open/nvidia-uvm/uvm_hmm.c

Lines 72 to 88 in 758b4ee

    
           NV_STATUS uvm_hmm_va_space_initialize(uvm_va_space_t *va_space) 
        
           { 
        
               struct mm_struct *mm = va_space->va_space_mm.mm; 
        
               if (!uvm_hmm_is_enabled(va_space)) 
        
                   return NV_OK; 
        
               uvm_assert_mmap_lock_locked_write(mm); 
        
               uvm_assert_rwsem_locked_write(&va_space->lock); 
        
               // TODO: Bug 3351822: [UVM-HMM] Remove temporary testing changes. 
        
               // Disable HMM by default for each va_space until enough functionality is 
        
               // implemented that this can be enabled by default. 
        
               // Note that it can be enabled for testing under controlled circumstances. 
        
               va_space->hmm.disable = true; 
        
               return NV_OK;

Am I missing something?

Thanks!

johnhubbard · 2022-11-17T00:13:27Z

That blog post was in error. After receiving your question here, we have corrected the blog to reflect that HMM is not yet supported in our driver. Thanks for alerting us, and sorry for the incorrect information that went out.

sdake · 2023-01-24T21:51:35Z

cc @sdake

woachk · 2023-03-12T16:56:23Z

Worth noting that the public alpha of this was (silently) pushed as part of r530. Seems to work ok w/ some testing so far.

johnhubbard · 2023-03-12T20:48:06Z

Yes, an early version of HMM support is included in the r530 release. However, as I wrote here:

https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/kernel-open/nvidia-uvm/uvm_hmm.c#L42

, it is not ready for production use. That's why it was "silently" included. Once it is ready for production use, we will formally announce that (as per my Aug 3, 2022 comment here).

oscarbg · 2023-03-19T06:18:49Z

@johnhubbard can talk, if HMM support once ready for production use, will be enabled on closed source kernel driver also? I mean for pre Turing cards like Titan V (Volta).. also can talk about if Windows HMM support is planned eventually? (even if only on TCC mode or will come to WDMM mode also) thanks..

sdake · 2023-03-19T17:03:13Z

NVIDIA proprietary driver (530.30.02) using A30:

[12929.936750] Call Trace:
[12929.937530]  __schedule+0x282/0x870
[12929.938611]  ? kvm_sched_clock_read+0xd/0x20
[12929.939703]  schedule+0x46/0xb0
[12929.940439]  rwsem_down_write_slowpath+0x257/0x4d0
[12929.941548]  ? __free_slab+0xcf/0x1d0
[12929.942426]  uvm_perf_thrashing_stop+0x3d/0xa0 [nvidia_uvm]
[12929.943756]  uvm_va_space_destroy+0xa4/0x480 [nvidia_uvm]
[12929.945018]  uvm_release.constprop.0+0x93/0xc0 [nvidia_uvm]
[12929.946309]  uvm_release_entry.part.0.isra.0+0x7a/0xb0 [nvidia_uvm]
[12929.947770]  ? up+0x12/0x60
[12929.948428]  ? __fput+0x100/0x240
[12929.949166]  ? kmem_cache_free+0xff/0x420
[12929.949871]  ? mntput_no_expire+0x47/0x270
[12929.950591]  __fput+0x92/0x240
[12929.951134]  task_work_run+0x62/0xa0
[12929.951792]  do_exit+0x34b/0xa90
[12929.952367]  ? __schedule+0x28a/0x870
[12929.953019]  ? timerqueue_del+0x1e/0x50
[12929.953691]  do_group_exit+0x33/0xa0
[12929.954325]  get_signal+0x170/0x890
[12929.954942]  arch_do_signal_or_restart+0xf1/0x7e0
[12929.955789]  ? do_epoll_wait+0xd8/0x670
[12929.956468]  ? hrtimer_interrupt+0x15d/0x2c0
[12929.957221]  ? handle_irq_event+0x73/0xb0
[12929.957932]  exit_to_user_mode_prepare+0xff/0x160
[12929.958586]  syscall_exit_to_user_mode+0x28/0x150
[12929.959187]  entry_SYSCALL_64_after_hwframe+0x61/0xc6

oscarbg · 2023-03-19T21:07:13Z

@sdake it was said was working using open kernel module only, but thanks for testing and confirming it doesn’t work on propietary (right now?).. as said earlier hope propietary kernel driver gets enabled also as unique way for pre turing cards (but I think HMM is Pascal+ only so only needed for Pascal and Volta generations)..

johnhubbard · 2023-03-20T23:55:41Z

HMM depends upon the open source version of the driver. The open source version of the driver, in turn, only works on Turing and later GPUs.

As it says in the r530_00 release notes, Ch. 44, "The open flavor of kernel modules supports Turing, Ampere, and forward. The open kernel modules cannot support GPUs before Turing, because the open kernel modules depend on the GPU System Processor (GSP) first introduced in Turing."

Therefore, HMM is only available on Turing and later GPUs.

oscarbg · 2023-03-21T01:02:18Z

@johnhubbard thanks! so seems also no Windows support planned (even for WSL2), right? in any driver mode be either TCC or WDDM..

johnhubbard · 2023-03-21T01:08:08Z

Right, no Windows support exists.

The HMM feature required OS kernel changes, in addition to changes in our driver stack here. The open source Linux kernel, and the kernel community, made it possible to make such changes.

On Windows, however, Microsoft has not made any such corresponding changes, so HMM is unavailable there.

sdake · 2023-03-21T01:11:21Z

@oscarbg all good. I reported the stack trace from the production driver available from NVIDIA's deb repos using: apt install cuda using cuda 12.1. I can say HMM "functions" on the proprietary. driver, although its reliability is very poor, generating kernel stack traces often.

I will try the open-source kernel driver, and report kernel traces or other bad behavior here.

Does the GRID driver function with HMM?

TY!
-steve

oscarbg · 2023-04-17T21:50:39Z

thanks @sdake..
thanks @johnhubbard ..
seems I have bad luck testing HMM with new Ada 4070 GPU and Nvidia 530.41.03 prebuilt open kernel module
installed with sh ./NVIDIA-Linux-[...].run -m=kernel-open
compiled sample and getting results=0 and

[    4.618595] NVRM cpuidInfoAMD: Unrecognized AMD processor in cpuidInfoAMD
[    4.663159] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  530.41.03  Release Build  (dvs-builder@U16-T02-35-3)  Thu Mar 16 19:33:35 UTC 2023
[    5.358387] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  530.41.03  Release Build  (dvs-builder@U16-T02-35-3)  Thu Mar 16 19:23:50 UTC 2023
[  329.642495] NVRM: GPU at PCI:0000:01:00: GPU-b53502d1-facc-b12d-2a6c-cbb01b5beae4
[  329.642499] NVRM: Xid (PCI:0000:01:00): 31, pid=7087, name=sm80, Ch 00000014, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x5560_dc4f4000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_ATOMIC
[  330.484339] NVRM: Xid (PCI:0000:01:00): 31, pid=7105, name=sm80.out, Ch 00000014, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x55d5_7e958000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_ATOMIC

so not working..
it's because not building the kernel module by myself as prebuilt open kernel module doesn't enable HMM by default or because since early 530.xx beta 530.41.03 it's disabled?
thanks..

johnhubbard · 2023-04-17T21:59:00Z

HMM support is disabled by default in the r530 driver. That's why your sample is failing as shown above.

sdake · 2023-04-18T04:21:46Z

There is no need to recompile the kernel. You can set a driver load parameter to enable HMM. Look at modinfo, and /etc/modules. I don't have the exact commands as I am typing this on a phone.

Cheers
Steve

oscarbg · 2023-04-20T17:21:55Z

@johnhubbard @sdake thanks both.. tested with the needed uvm module parameter and works! only open source kernel module works, as shared on this thread.. curious why closed source module also admits the same parameter but doesn’t work..
Concluding situation is little sad as closed source modules are needed if wanting to use Gsync right now for example.. so hoping for Gsync open source support soon (535.xx?) or HMM closed source support.. idealy both..

aritger · 2023-04-20T17:30:15Z

Yes, we're working hard to close the remaining feature gaps (such as Gsync ) in the open kernel modules. I can't promise particular releases, here, but yes: everything should ultimately converge in the open kernel modules.

sdake · 2023-04-20T18:05:28Z

May I ask your use case for HMM? For ML I am unconvinced there are not superior methods to manage memory. I was originally interested in ML use case. Also thanks for pointing out this is only within the open source drivers. I am running A30s, A40s, A2s, and A100s. So when I tested with commercial drivers I saw no measurable bemefit - because commercial drivers are not yet enabled! Thanks for your use case information. Cheers Steve

…

On Thu, Apr 20, 2023 at 10:30 AM Andy Ritger ***@***.***> wrote: Yes, we're working hard to close the remaining feature gaps (such as Gsync ) in the open kernel modules. I can't promise particular releases, here, but yes: everything should ultimately converge in the open kernel modules. — Reply to this email directly, view it on GitHub <#338 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFYRCJY7CMW76L5EW4OOVLXCFXDFANCNFSM547PNNRA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

oscarbg · 2023-04-20T20:49:56Z

@aritger thanks for information..
@sdake really experiencing for fun.. no real use case right now..

sdake · 2023-05-11T23:28:58Z

Has anyone benched this approach for ML workloads versus,say, Microsoft's awesome work with DeepSpeed? This feels like a solution seeking a problem.

Thank you,
Steve

johnhubbard · 2023-09-11T21:03:47Z

HMM is now supported with CUDA 12.2. Please see https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/ for more information.

bhaveshdavda · 2024-08-01T19:32:03Z

@johnhubbard I'll just ask this simply:

What does it take to have HMM enabled and nvidia-smi -q to show Addressing Mode : HMM on an HGX-H100 x86 system running Ubuntu 22.04 with the 5.15.0-105-generic kernel?

Driver Version: 550.90.07      CUDA Version: 12.4

Are there any BIOS settings or kernel params to be passed via GRUB that are not documented anywhere?

edit: Going to try and upgrade the kernel to linux-image-6.8.0-39-generic based on the blog stating:

A sufficiently recent Linux kernel: 6.1.24+, 6.2.11+, or 6.3+.

sdake · 2024-08-04T08:06:04Z

@bhaveshdavda there is not a simple answer, unfortunately.

A working kernel.org is in this repository as a dockerfile which you can build locally. After running build.sh, the target directory will then contain 4 .deb files you can co-install.

https://github.com/artificialwisdomai/origin

Please let me know how it goes.

Thanks
-steve

bhaveshdavda · 2024-08-12T04:02:35Z

Update. I finally got this working in a Kubernetes environment no less with the NVIDIA GPU Operator. Notes:

Upgraded Ubuntu 22.04 LTS kernel to 6.8.0-39-generic
Passed module parameter uvm_ats_mode=0 to the nvidia-uvm module
[Main takeaway] Use the open kernel module instead of the default NVIDIA proprietary modules via a Helm chart variable passed to the GPU operator

sdake · 2024-08-12T17:11:42Z

It isn't necessary to upgrade the kernel, but instead, it is necessary to configure the one you have properly. There is one additional config option required. I didn't need uvm_ats_mode=0, and I am not sure why you would want to turn off address translation service as it protects the platform's DMA operations from third party corruption.

There is a Docker to build Debian upstream kernel (should work fine with Ubuntu as well) here:

https://github.com/artificialwisdomai/origin/tree/main/platform/Dockerfiles/linux-kernel

bhaveshdavda · 2024-08-12T17:47:56Z

@sdake I agree with your statement about disabling ATS not being required and I too assumed ATS is an important security feature for PCIe. And I also feel like the stock Ubuntu 22.04 LTS kernel 5.15.0-105-generic would probably also work as it has the right Kconfig. I think the main variable is using the open kernel module instead of the proprietary module because the nvidia-uvm module has #ifdef'ed out implementation of uvm-hmm.c in the latter

Edit:

uvm_ats_mode=0 is not required
5.15.0-105-generic kernel definitely doesn't work, but stock Ubuntu 22.04 6.8.0-39-generic does

DevenBL · 2024-12-28T05:26:24Z

Hello,
This seems to be a related memory management issue. Will this address the GTT/Shared memory problem for Nvidia drivers on Linux? Enabling GTT support like the amdgpu driver for Nvidia would greatly simplify specific Cuda, rendering, Ai and Gaming workloads where Nvidia's specific memory management interfaces are not explicitly used. All other competitor's software support this on Windows and Linux except Nvidia where it's only supported on Windows but not Linux.

I have the nvidia-uvm module loaded and Addressing Mode : HMM enabled and it did not seem to alleviate these issues. Testing was done with the nvidia 565.77 drivers and the 6.12.6 kernel with the gpu open nvidia kernel modules.

As seen here:
#618
#663
https://forums.developer.nvidia.com/t/vram-allocation-issues/239678
https://forums.developer.nvidia.com/t/non-existent-shared-vram-on-nvidia-linux-drivers/260304

And most recently:
#758

VivekPanyam added the bug Something isn't working label Jul 29, 2022

mtijanic added Feature Pending The function or feature is not available in this driver version NV-Triaged An NVBug has been created for dev to investigate and removed bug Something isn't working labels Aug 4, 2022

NVIDIA deleted a comment from vanesa500 Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HMM support in UVM #338

HMM support in UVM #338

VivekPanyam commented Jul 29, 2022

johnhubbard commented Jul 29, 2022

VivekPanyam commented Jul 30, 2022

johnhubbard commented Jul 31, 2022

VivekPanyam commented Aug 4, 2022

johnhubbard commented Aug 4, 2022

VivekPanyam commented Nov 11, 2022

johnhubbard commented Nov 17, 2022

sdake commented Jan 24, 2023

woachk commented Mar 12, 2023

johnhubbard commented Mar 12, 2023

oscarbg commented Mar 19, 2023

sdake commented Mar 19, 2023

oscarbg commented Mar 19, 2023 •

edited

Loading

johnhubbard commented Mar 20, 2023

oscarbg commented Mar 21, 2023

johnhubbard commented Mar 21, 2023

sdake commented Mar 21, 2023

oscarbg commented Apr 17, 2023

johnhubbard commented Apr 17, 2023

sdake commented Apr 18, 2023

oscarbg commented Apr 20, 2023

aritger commented Apr 20, 2023

sdake commented Apr 20, 2023 via email

oscarbg commented Apr 20, 2023

sdake commented May 11, 2023

johnhubbard commented Sep 11, 2023

bhaveshdavda commented Aug 1, 2024 •

edited

Loading

sdake commented Aug 4, 2024

bhaveshdavda commented Aug 12, 2024

sdake commented Aug 12, 2024

bhaveshdavda commented Aug 12, 2024 •

edited

Loading

DevenBL commented Dec 28, 2024 •

edited

Loading

HMM support in UVM #338

HMM support in UVM #338

Comments

VivekPanyam commented Jul 29, 2022

NVIDIA Open GPU Kernel Modules Version

Does this happen with the proprietary driver (of the same version) as well?

Operating System and Version

Kernel Release

Hardware: GPU

Describe the bug

To Reproduce

Bug Incidence

nvidia-bug-report.log.gz

More Info

References:

johnhubbard commented Jul 29, 2022

VivekPanyam commented Jul 30, 2022

johnhubbard commented Jul 31, 2022

VivekPanyam commented Aug 4, 2022

johnhubbard commented Aug 4, 2022

VivekPanyam commented Nov 11, 2022

johnhubbard commented Nov 17, 2022

sdake commented Jan 24, 2023

woachk commented Mar 12, 2023

johnhubbard commented Mar 12, 2023

oscarbg commented Mar 19, 2023

sdake commented Mar 19, 2023

oscarbg commented Mar 19, 2023 • edited Loading

johnhubbard commented Mar 20, 2023

oscarbg commented Mar 21, 2023

johnhubbard commented Mar 21, 2023

sdake commented Mar 21, 2023

oscarbg commented Apr 17, 2023

johnhubbard commented Apr 17, 2023

sdake commented Apr 18, 2023

oscarbg commented Apr 20, 2023

aritger commented Apr 20, 2023

sdake commented Apr 20, 2023 via email

oscarbg commented Apr 20, 2023

sdake commented May 11, 2023

johnhubbard commented Sep 11, 2023

bhaveshdavda commented Aug 1, 2024 • edited Loading

sdake commented Aug 4, 2024

bhaveshdavda commented Aug 12, 2024

sdake commented Aug 12, 2024

bhaveshdavda commented Aug 12, 2024 • edited Loading

DevenBL commented Dec 28, 2024 • edited Loading

oscarbg commented Mar 19, 2023 •

edited

Loading

bhaveshdavda commented Aug 1, 2024 •

edited

Loading

bhaveshdavda commented Aug 12, 2024 •

edited

Loading

DevenBL commented Dec 28, 2024 •

edited

Loading