-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HMM support in UVM #338
Comments
Hi, HMM functionality is not available because our kernel driver does not yet support it. We are working on that, but we routinely do not provide timelines or ETAs for things like that, sorry for the vagueness there. |
Hi John! Didn't expect you to respond directly :) When you say "our kernel driver," what are you referring to specifically? The code in this repo or a binary blob somewhere else? Is there anything that can be done in the OSS codebase to accelerate HMM support or is it blocked by NVIDIA internal dependencies? Thanks! |
I'm referring to the code in this repo. In fact, maybe I should have written "kernel drivers", because both nvidia.ko and nvidia-uvm.ko are involved in supporting HMM. As for accelerating development, nvidia-uvm.ko in particular is built from some very complex source code, due to the need to handle every aspect of the CUDA programming model. Adding production-quality HMM support to that simply takes time. We realize that this is in demand and are working on it. |
That makes sense. Is there a public issue tracking HMM support? If not, would you mind commenting on this issue when there's something publicly available to test? Thanks again |
I think that this issue might be the public issue that tracks HMM support. :) Sure, I'll make a note to update this as part of the "release HMM support" steps. |
Hi @johnhubbard! There was a post on the NVIDIA tech blog yesterday (November 10th) that talks about HMM:
- https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/ It seems like HMM is still disabled in the open-gpu-kernel-modules/kernel-open/nvidia-uvm/uvm_hmm.c Lines 72 to 88 in 758b4ee
Am I missing something? Thanks! |
That blog post was in error. After receiving your question here, we have corrected the blog to reflect that HMM is not yet supported in our driver. Thanks for alerting us, and sorry for the incorrect information that went out. |
cc @sdake |
Worth noting that the public alpha of this was (silently) pushed as part of r530. Seems to work ok w/ some testing so far. |
Yes, an early version of HMM support is included in the r530 release. However, as I wrote here:
, it is not ready for production use. That's why it was "silently" included. Once it is ready for production use, we will formally announce that (as per my Aug 3, 2022 comment here). |
@johnhubbard can talk, if HMM support once ready for production use, will be enabled on closed source kernel driver also? I mean for pre Turing cards like Titan V (Volta).. also can talk about if Windows HMM support is planned eventually? (even if only on TCC mode or will come to WDMM mode also) thanks.. |
NVIDIA proprietary driver (530.30.02) using A30:
|
@sdake it was said was working using open kernel module only, but thanks for testing and confirming it doesn’t work on propietary (right now?).. as said earlier hope propietary kernel driver gets enabled also as unique way for pre turing cards (but I think HMM is Pascal+ only so only needed for Pascal and Volta generations).. |
HMM depends upon the open source version of the driver. The open source version of the driver, in turn, only works on Turing and later GPUs. As it says in the r530_00 release notes, Ch. 44, "The open flavor of kernel modules supports Turing, Ampere, and forward. The open kernel modules cannot support GPUs before Turing, because the open kernel modules depend on the GPU System Processor (GSP) first introduced in Turing." Therefore, HMM is only available on Turing and later GPUs. |
@johnhubbard thanks! so seems also no Windows support planned (even for WSL2), right? in any driver mode be either TCC or WDDM.. |
Right, no Windows support exists. The HMM feature required OS kernel changes, in addition to changes in our driver stack here. The open source Linux kernel, and the kernel community, made it possible to make such changes. On Windows, however, Microsoft has not made any such corresponding changes, so HMM is unavailable there. |
@oscarbg all good. I reported the stack trace from the production driver available from NVIDIA's deb repos using: I will try the open-source kernel driver, and report kernel traces or other bad behavior here. Does the GRID driver function with TY! |
thanks @sdake..
so not working.. |
HMM support is disabled by default in the r530 driver. That's why your sample is failing as shown above. |
There is no need to recompile the kernel. You can set a driver load parameter to enable HMM. Look at modinfo, and /etc/modules. I don't have the exact commands as I am typing this on a phone. Cheers |
@johnhubbard @sdake thanks both.. tested with the needed uvm module parameter and works! only open source kernel module works, as shared on this thread.. curious why closed source module also admits the same parameter but doesn’t work.. |
Yes, we're working hard to close the remaining feature gaps (such as Gsync ) in the open kernel modules. I can't promise particular releases, here, but yes: everything should ultimately converge in the open kernel modules. |
May I ask your use case for HMM? For ML I am unconvinced there are not
superior methods to manage memory. I was originally interested in ML use
case. Also thanks for pointing out this is only within the open source
drivers.
I am running A30s, A40s, A2s, and A100s. So when I tested with commercial
drivers I saw no measurable bemefit - because commercial drivers are not
yet enabled!
Thanks for your use case information.
Cheers
Steve
…On Thu, Apr 20, 2023 at 10:30 AM Andy Ritger ***@***.***> wrote:
Yes, we're working hard to close the remaining feature gaps (such as Gsync
) in the open kernel modules. I can't promise particular releases, here,
but yes: everything should ultimately converge in the open kernel modules.
—
Reply to this email directly, view it on GitHub
<#338 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFYRCJY7CMW76L5EW4OOVLXCFXDFANCNFSM547PNNRA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Has anyone benched this approach for ML workloads versus,say, Microsoft's awesome work with DeepSpeed? This feels like a solution seeking a problem. Thank you, |
HMM is now supported with CUDA 12.2. Please see https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/ for more information. |
@johnhubbard I'll just ask this simply: What does it take to have HMM enabled and
Are there any BIOS settings or kernel params to be passed via GRUB that are not documented anywhere? edit: Going to try and upgrade the kernel to
|
@bhaveshdavda there is not a simple answer, unfortunately. A working kernel.org is in this repository as a dockerfile which you can build locally. After running https://github.com/artificialwisdomai/origin Please let me know how it goes. Thanks |
Update. I finally got this working in a Kubernetes environment no less with the NVIDIA GPU Operator. Notes:
|
It isn't necessary to upgrade the kernel, but instead, it is necessary to configure the one you have properly. There is one additional config option required. I didn't need There is a Docker to build Debian upstream kernel (should work fine with Ubuntu as well) here: https://github.com/artificialwisdomai/origin/tree/main/platform/Dockerfiles/linux-kernel |
@sdake I agree with your statement about disabling ATS not being required and I too assumed ATS is an important security feature for PCIe. And I also feel like the stock Ubuntu 22.04 LTS kernel Edit:
|
Hello, I have the As seen here: And most recently: |
NVIDIA Open GPU Kernel Modules Version
515.57 Release
Does this happen with the proprietary driver (of the same version) as well?
Yes
Operating System and Version
Ubuntu 20.04.1 LTS
Kernel Release
5.13.0-1029-aws
Hardware: GPU
NVIDIA T4
Describe the bug
Hello!
HMM support has been mentioned in several NVIDIA docs and presentations since 2017 (including the announcement of the open-source kernel modules), but it seems to be disabled here (and doesn't work when using the proprietary driver).
open-gpu-kernel-modules/kernel-open/nvidia-uvm/uvm_hmm.h
Lines 35 to 50 in d8f3bcf
I assume the referenced bug/task is internal. Is there any information you can share on what additional work needs to happen to enable UVM-HMM (or potentially a timeline?).
See references and a repro below.
To Reproduce
I'm testing HMM using the following code (from one of the presentations linked below):
It currently just prints
Results: 0
and the following message is in the output ofdmesg
Bug Incidence
Always
nvidia-bug-report.log.gz
N/A
More Info
References:
Management
The text was updated successfully, but these errors were encountered: