-
Notifications
You must be signed in to change notification settings - Fork 176
Simulating complex memory with Qemu
EFI attributes may be used to mark some memory ranges as "soft-reserved" instead of normal RAM so that the kernel doesn't use them by default. This is useful for memory with different performance that should be reserved to specific uses/applications. They are exposed as DAX by default and possibly as NUMA node later.
This requires to boot in UEFI (instead of legacy BIOS), see the Qemu command-line below. For passing something like efi_fake_mem=1G@4G:0x40000(to mark 4-5GB range as soft-reserved), the kernel must have CONFIG_EFI_FAKE_MEMMAP=y (not enabled in Debian kernels by default).
The 0-4GB physical memory range is quite complicated when booting Qemu since it contains lots of reserved ranges, including 3-4GB reserved for PCI stuff. It's better to use ranges after 4GB to find large ranges of normal memory. So make the first NUMA node 3GB and use other nodes, they will be mapped after the PCI stuff, after 4GB.
If a single memory range is marked as soft-reserved but covers multiple nodes, strange things happen, the kernel creates a DAX covering both (with locality of the first) but fails to entirely register it, and then creates separated DAX as expected. To avoid issues, it's better to specify two ranges even if they are consecutive.
kvm \
-drive if=pflash,format=raw,file=./OVMF.fd \
-drive media=disk,format=qcow2,file=efi.qcow2 \
-smp 4 -m 6G \
-object memory-backend-ram,size=3G,id=m0 \
-object memory-backend-ram,size=1G,id=m1 \
-object memory-backend-ram,size=1G,id=m2 \
-object memory-backend-ram,size=1G,id=m3 \
-numa node,nodeid=0,memdev=m0,cpus=0-1 \
-numa node,nodeid=1,memdev=m1,cpus=2-3 \
-numa node,nodeid=2,memdev=m2 \
-numa node,nodeid=3,memdev=m3
OVMF is required for booting in UEFI mode (during both VM install and later).
On the kernel boot command-line, pass efi_fake_mem=1G@4G:0x40000,1G@6G:0x40000
to make NUMA node#1 (one with CPUs) and #3 (CPU-less) as soft-reserved. Their memory disappears, and a DAX device appears.
% cat /proc/iomem
100000000-13fffffff : hmem.0 <- node #1 is soft-reserved
100000000-13fffffff : Soft Reserved
100000000-13fffffff : dax0.0
140000000-17fffffff : System RAM <- node #2 is normal memory
180000000-1bfffffff : hmem.1 <- node #3 is soft-reserved
180000000-1bfffffff : Soft Reserved
180000000-1bfffffff : dax1.0
Those DAX devices under /sys/bus/dax/devices point to platform hmem devices but there isn't much useless in there.
dax0.0 -> ../../../devices/platform/hmem.0/dax0.0
dax1.0 -> ../../../devices/platform/hmem.1/dax1.0
dax0.0 has target_node=numa_node=1
in its sysfs attributes because node1 is online thanks to existing CPUs.
dax1.0 is offline since it contains neither CPUs nor RAM. It has target_node=3
as expected, but numa_node=0
since this must be a online node during boot. node#0 was chosen because it's close (we didn't specify any distance matrix on the Qemu command-line, the default 10=local, 20=remote is used, hence 20 is the minimal distance from node#3 to online nodes, and node#0 is the first one of those).
% daxctl reconfigure-device --mode=system-ram all
% cat /proc/iomem
[...]
100000000-13fffffff : hmem.0
100000000-13fffffff : Soft Reserved
100000000-13fffffff : dax0.0
100000000-13fffffff : System RAM (kmem) <- node#1 is back as a NUMA node
140000000-17fffffff : System RAM
180000000-1bfffffff : hmem.1
180000000-1bfffffff : Soft Reserved
180000000-1bfffffff : dax1.0
180000000-1bfffffff : System RAM (kmem) <- node#3 is back as a NUMA node
Add -machine pc,nvdimm=on
to qemu to enable nvdimms, then make maxmem
in -m
equal to RAM+NVDIMM size, and slots
in -m
equal to number of NVDIMMs. Then create the object and device, for instance:
kvm \
-machine pc,nvdimm=on \
-drive if=pflash,format=raw,file=./OVMF.fd \
-drive media=disk,format=qcow2,file=efi.qcow2 \
-smp 4 \
-m 6G,slots=1,maxmem=7G \
-object memory-backend-ram,size=3G,id=ram0 \
-object memory-backend-ram,size=1G,id=ram1 \
-object memory-backend-ram,size=1G,id=ram2 \
-object memory-backend-ram,size=1G,id=ram3 \
-numa node,nodeid=0,memdev=ram0,cpus=0-1 \
-numa node,nodeid=1,memdev=ram1,cpus=2-3 \
-numa node,nodeid=2,memdev=ram2 \
-numa node,nodeid=3,memdev=ram3 \
-numa node,nodeid=4 \
-object memory-backend-file,id=nvdimm1,share=on,mem-path=nvdimm.img,size=1G \
-device nvdimm,id=nvdimm1,memdev=nvdimm1,unarmed=off,node=4
You'll get a pmem0
in Linux, from namespace2.0 (likely not 0.0 because dax0.0 and dax1.0 are used for soft-reserved memory in this config):
% ndctl list
[
{
"dev":"namespace2.0",
"mode":"fsdax",
"map":"dev",
"size":1054867456,
"uuid":"937b5655-a581-4961-bbbc-f6a567a86b0f",
"sector_size":512,
"align":2097152,
"blockdev":"pmem2"
}
]
Convert it to DAX with
% ndctl create-namespace -f -e namespace2.0 -p pmem -t devdax
That DAX points to ndctl region device now:
/sys/bus/dax/devices/dax2.0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/dax2.0/dax2.0
That region contains a single mapping since there's only one NVDIMM here, and its type is nvdimm
:
% cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/mappings
1
% cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/mapping0
nmem0,0,1073741824,0
% cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/devtype
nvdimm
As usual, that DAX can be made a NUMA node:
% daxctl reconfigure-device --mode=system-ram dax2.0
All values must be given individually. To make node#2 (HBM) and node#4 (NVDIMM) close to node#0, and node#3 (HBM) close to node#1:
-numa dist,src=0,dst=0,val=10 -numa dist,src=0,dst=1,val=20 -numa dist,src=0,dst=2,val=12 -numa dist,src=0,dst=3,val=22 -numa dist,src=0,dst=4,val=15 \
-numa dist,src=1,dst=0,val=20 -numa dist,src=1,dst=1,val=10 -numa dist,src=1,dst=2,val=22 -numa dist,src=1,dst=3,val=12 -numa dist,src=1,dst=4,val=25 \
-numa dist,src=2,dst=0,val=12 -numa dist,src=2,dst=1,val=22 -numa dist,src=2,dst=2,val=10 -numa dist,src=2,dst=3,val=25 -numa dist,src=2,dst=4,val=30 \
-numa dist,src=3,dst=0,val=22 -numa dist,src=3,dst=1,val=12 -numa dist,src=3,dst=2,val=25 -numa dist,src=3,dst=3,val=10 -numa dist,src=3,dst=4,val=30 \
-numa dist,src=4,dst=0,val=15 -numa dist,src=4,dst=1,val=25 -numa dist,src=4,dst=2,val=30 -numa dist,src=4,dst=3,val=30 -numa dist,src=4,dst=4,val=10
-machine hmat=on
requires an initiator=X
attribute for each NUMA node.
It's also possible to specify HMAT values, TODO.
Doesn't work with Debian qemu-system-x86_64 1:6.2+dfsg-3 on 2022/03/31 (CXL is in the manpage, but clx-pxb device and machine type=q35 aren't accepted).