Skip to content

Simulating complex memory with Qemu

Brice Goglin edited this page Mar 31, 2022 · 19 revisions

Soft-Reserved Memory

EFI attributes may be used to mark some memory ranges as "soft-reserved" instead of normal RAM so that the kernel doesn't use them by default. This is useful for memory with different performance that should be reserved to specific uses/applications. They are exposed as DAX by default and possibly as NUMA node later.

Prerequisites

This requires to boot in UEFI (instead of legacy BIOS), see the Qemu command-line below. For passing something like efi_fake_mem=1G@4G:0x40000(to mark 4-5GB range as soft-reserved), the kernel must have CONFIG_EFI_FAKE_MEMMAP=y (not enabled in Debian kernels by default).

Choosing which memory range

The 0-4GB physical memory range is quite complicated when booting Qemu since it contains lots of reserved ranges, including 3-4GB reserved for PCI stuff. It's better to use ranges after 4GB to find large ranges of normal memory. So make the first NUMA node 3GB and use other nodes, they will be mapped after the PCI stuff, after 4GB.

If two NUMA nodes whose memory ranges are consecutive are marked as soft-reserved, it looks like we get a single range with the locality of the first one. So if you want too separate memories, don't use consecutive ranges, for instance two non-consecutive NUMA-node.

Configuring Qemu with 2 NUMAs + 2 CPU-less NUMA

kvm \
 -drive if=pflash,format=raw,file=./OVMF.fd \
 -drive media=disk,format=qcow2,file=efi.qcow2 \
 -smp 4 -m 6G \
 -object memory-backend-ram,size=3G,id=m0 \
 -object memory-backend-ram,size=1G,id=m1 \
 -object memory-backend-ram,size=1G,id=m2 \
 -object memory-backend-ram,size=1G,id=m3 \
 -numa node,nodeid=0,memdev=m0,cpus=0-1 \
 -numa node,nodeid=1,memdev=m1,cpus=2-3 \
 -numa node,nodeid=2,memdev=m2 \
 -numa node,nodeid=3,memdev=m3

OVMF is required for booting in UEFI mode (during both VM install and later).

Marking NUMA nodes as soft-reserved and getting hmem DAX device

On the kernel boot command-line, pass efi_fake_mem=1G@4G:0x40000,1G@6G:0x40000 to make NUMA node#1 (one with CPUs) and #3 (CPU-less) as soft-reserved. Their memory disappears, and a DAX device appears.

% cat /proc/iomem
100000000-13fffffff : hmem.0              <- node #1 is soft-reserved
  100000000-13fffffff : Soft Reserved
    100000000-13fffffff : dax0.0
140000000-17fffffff : System RAM          <- node #2 is normal memory
180000000-1bfffffff : hmem.1              <- node #3 is soft-reserved
  180000000-1bfffffff : Soft Reserved
    180000000-1bfffffff : dax1.0

Those DAX devices under /sys/bus/dax/devices point to platform hmem devices but there isn't much useless in there.

dax0.0 -> ../../../devices/platform/hmem.0/dax0.0
dax1.0 -> ../../../devices/platform/hmem.1/dax1.0

dax0.0 has target_node=numa_node=1 in its sysfs attributes because node1 is online thanks to existing CPUs.

dax1.0 is offline since it contains neither CPUs nor RAM. It has target_node=3 as expected, but numa_node=0 since this must be a online node during boot. node#0 was chosen because it's close (we didn't specify any distance matrix on the Qemu command-line, the default 10=local, 20=remote is used, hence 20 is the minimal distance from node#3 to online nodes, and node#0 is the first one of those).

Making NUMA nodes out of soft-reserved memory

% daxctl reconfigure-device --mode=system-ram all
% cat /proc/iomem
[...]
100000000-13fffffff : hmem.0
  100000000-13fffffff : Soft Reserved
    100000000-13fffffff : dax0.0
      100000000-13fffffff : System RAM (kmem) <- node#1 is back as a NUMA node
140000000-17fffffff : System RAM
180000000-1bfffffff : hmem.1
  180000000-1bfffffff : Soft Reserved
    180000000-1bfffffff : dax1.0
      180000000-1bfffffff : System RAM (kmem) <- node#3 is back as a NUMA node

NVDIMMs

NVDIMMs in Qemu

Add -machine pc,nvdimm=on to qemu to enable nvdimms, then make maxmem in -m equal to RAM+NVDIMM size, and slots in -m equal to number of NVDIMMs. Then create the object and device, for instance:

kvm \
 -machine pc,nvdimm=on \
 -drive if=pflash,format=raw,file=./OVMF.fd \
 -drive media=disk,format=qcow2,file=efi.qcow2 \
 -smp 4 \
 -m 6G,slots=1,maxmem=7G \
 -object memory-backend-ram,size=3G,id=ram0 \
 -object memory-backend-ram,size=1G,id=ram1 \
 -object memory-backend-ram,size=1G,id=ram2 \
 -object memory-backend-ram,size=1G,id=ram3 \
 -numa node,nodeid=0,memdev=ram0,cpus=0-1 \
 -numa node,nodeid=1,memdev=ram1,cpus=2-3 \
 -numa node,nodeid=2,memdev=ram2 \
 -numa node,nodeid=3,memdev=ram3 \
 -numa node,nodeid=4 \
 -object memory-backend-file,id=nvdimm1,share=on,mem-path=nvdimm.img,size=1G \
 -device nvdimm,id=nvdimm1,memdev=nvdimm1,unarmed=off,node=4

DAX and NUMA node in Linux

You'll get a pmem0 in Linux, from namespace2.0 (likely not 0.0 because dax0.0 and dax1.0 are used for soft-reserved memory in this config):

% ndctl list
[
  {
    "dev":"namespace2.0",
    "mode":"fsdax",
    "map":"dev",
    "size":1054867456,
    "uuid":"937b5655-a581-4961-bbbc-f6a567a86b0f",
    "sector_size":512,
    "align":2097152,
    "blockdev":"pmem2"
  }
]

Convert it to DAX with

% ndctl create-namespace -f -e namespace2.0 -p pmem -t devdax

That DAX points to ndctl region device now:

/sys/bus/dax/devices/dax2.0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/dax2.0/dax2.0

That region contains a single mapping since there's only one NVDIMM here, and its type is nvdimm:

% cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/mappings
1
% cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region2/mapping0
nmem0,0,1073741824,0
% cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/devtype
nvdimm

As usual, that DAX can be made a NUMA node:

% daxctl reconfigure-device --mode=system-ram dax2.0

SLIT distances

All values must be given individually. To make node#2 (HBM) and node#4 (NVDIMM) close to node#0, and node#3 (HBM) close to node#1:

 -numa dist,src=0,dst=0,val=10 -numa dist,src=0,dst=1,val=20 -numa dist,src=0,dst=2,val=12 -numa dist,src=0,dst=3,val=22 -numa dist,src=0,dst=4,val=15 \
-numa dist,src=1,dst=0,val=20 -numa dist,src=1,dst=1,val=10 -numa dist,src=1,dst=2,val=22 -numa dist,src=1,dst=3,val=12 -numa dist,src=1,dst=4,val=25 \
-numa dist,src=2,dst=0,val=12 -numa dist,src=2,dst=1,val=22 -numa dist,src=2,dst=2,val=10 -numa dist,src=2,dst=3,val=25 -numa dist,src=2,dst=4,val=30 \
-numa dist,src=3,dst=0,val=22 -numa dist,src=3,dst=1,val=12 -numa dist,src=3,dst=2,val=25 -numa dist,src=3,dst=3,val=10 -numa dist,src=3,dst=4,val=30 \
-numa dist,src=4,dst=0,val=15 -numa dist,src=4,dst=1,val=25 -numa dist,src=4,dst=2,val=30 -numa dist,src=4,dst=3,val=30 -numa dist,src=4,dst=4,val=10

HMAT

TODO

Clone this wiki locally