Comment 5 for bug 1855177

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote : Re: qemu nvdimm virtualization + linux 5.3.0-24-generic kernel PROBE ERROR

This way more complex than I thought and its not so easy to address. Lets see if I can summarize the issue here. Whenever developing the regressions tests for ndctl, it occurred to me the same backtrace, over and over, when realizing the tests:

----
[ 271.705646] memory add fail, invalid altmap
[ 271.705677] WARNING: CPU: 5 PID: 886 at arch/x86/mm/init_64.c:852 add_pages+0x5d/0x70
[ 271.705679] Modules linked in: nls_iso8859_1 edac_mce_amd dax_pmem_compat nd_pmem device_dax nd_btt dax_pmem_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev aesni_intel aes_x86_64 crypto_simd input_leds cryptd glue_helper serio_raw mac_hid qemu_fw_cfg nfit sch_fq_codel ip_tables x_tables autofs4 virtio_net psmouse net_failover virtio_blk i2c_piix4 failover pata_acpi floppy
[ 271.705707] CPU: 5 PID: 886 Comm: ndctl Not tainted 5.3.0-24-generic #26-Ubuntu
[ 271.705709] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 271.705720] RIP: 0010:add_pages+0x5d/0x70
[ 271.705721] Code: 33 c2 01 76 20 48 89 15 99 33 c2 01 48 89 15 a2 33 c2 01 48 c1 e2 0c 48 03 15 97 96 39 01 48 89 15 48 0e c2 01 5b 41 5c 5d c3 <0f> 0b eb ba 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44
[ 271.705722] RSP: 0018:ffffba02c0d2bbf0 EFLAGS: 00010282
[ 271.705723] RAX: 00000000ffffffea RBX: 000000000017ffc0 RCX: 0000000000000000
[ 271.705723] RDX: 0000000000000000 RSI: ffff9aaa3da97448 RDI: ffff9aaa3da97448
[ 271.705724] RBP: ffffba02c0d2bc00 R08: ffff9aaa3da97448 R09: 0000000000000004
[ 271.705724] R10: 0000000000000000 R11: 0000000000000001 R12: 000000000003fe40
[ 271.705725] R13: 0000000000000001 R14: ffffba02c0d2bc48 R15: ffff9aa975efaaf8
[ 271.705727] FS: 00007f70a62d4bc0(0000) GS:ffff9aaa3da80000(0000) knlGS:0000000000000000
[ 271.705728] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 271.705729] CR2: 00005594a0aaa158 CR3: 0000000138110000 CR4: 00000000000406e0
[ 271.705731] Call Trace:
[ 271.705734] arch_add_memory+0x41/0x50
[ 271.705737] devm_memremap_pages+0x47c/0x640
[ 271.705740] pmem_attach_disk+0x173/0x610 [nd_pmem]
[ 271.705741] ? devm_memremap+0x67/0xa0
[ 271.705743] nd_pmem_probe+0x7f/0xa0 [nd_pmem]
[ 271.705745] nvdimm_bus_probe+0x6b/0x170
[ 271.705747] really_probe+0xfb/0x3a0
[ 271.705749] driver_probe_device+0x5f/0xe0
[ 271.705750] device_driver_attach+0x5d/0x70
[ 271.705751] bind_store+0xd3/0x110
[ 271.705753] drv_attr_store+0x24/0x30
[ 271.705754] sysfs_kf_write+0x3e/0x50
[ 271.705755] kernfs_fop_write+0x11e/0x1a0
[ 271.705757] __vfs_write+0x1b/0x40
[ 271.705758] vfs_write+0xb9/0x1a0
[ 271.705759] ksys_write+0x67/0xe0
[ 271.705760] __x64_sys_write+0x1a/0x20
[ 271.705762] do_syscall_64+0x5a/0x130
[ 271.705764] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 271.705765] RIP: 0033:0x7f70a6189327
[ 271.705767] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 271.705767] RSP: 002b:00007ffc616998b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 271.705768] RAX: ffffffffffffffda RBX: 00007f70a62d4ae8 RCX: 00007f70a6189327
[ 271.705769] RDX: 0000000000000007 RSI: 00005594a0aa01a0 RDI: 0000000000000006
[ 271.705769] RBP: 0000000000000006 R08: 0000000000000006 R09: 7375622f7379732f
[ 271.705770] R10: 0000000000000000 R11: 0000000000000246 R12: 00005594a0aa01a0
[ 271.705770] R13: 0000000000000001 R14: 0000000000000007 R15: 00007ffc61699908
[ 271.705772] ---[ end trace 7ee621e68332018c ]---
----

And I realized that I could NOT re-generate the SECOND namespace (the first one always worked). First I had to read about how qemu emulated nvdimms and check why namespaces were not persistent on qemu nvdimms emulation, then I had to discover why it looked like virtual nvdimms had no labels (as RAW namespaces are always created by default) and then I had to understand why the mapping was failing, to realize the real issue.

First things first.

### QEMU emulated nvdimms:

https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt

Whenever backing filesystems are not DAX capable (on a REAL NVDIMM HW, for example) then after the instance is shutdown all nvdimm data (written to the backing files) are gone.

### QEMU virtual nvdimms lack of labels:

Label
-----

QEMU v2.7.0 and later implement the label support for vNVDIMM devices.
To enable label on vNVDIMM devices, users can simply add
"label-size=$SZ" option to "-device nvdimm", e.g.

 -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K

Note:

1. The minimal label size is 128KB.

2. QEMU v2.7.0 and later store labels at the end of backend storage.
   If a memory backend file, which was previously used as the backend
   of a vNVDIMM device without labels, is now used for a vNVDIMM
   device with label, the data in the label area at the end of file
   will be inaccessible to the guest. If any useful data (e.g. the
   meta-data of the file system) was stored there, the latter usage
   may result guest data corruption (e.g. breakage of guest file
   system).

### namespace1.0 always failing (with given back trace)

This is related to:

https://github.com/pmem/ndctl/issues/76

Specifically this comment:

https://github.com/pmem/ndctl/issues/76#issuecomment-440840503

"""
Linux needs 128MB alignment for each adjacent namespace. There isn't a fix because BIOS has no visibility or responsibility for Linux alignment constraints. Going forward Linux will eventually gain the capability to support fsdax mode with namespaces that collide within a section (128MB) until then the only workarounds are "raw" mode (not useful), or requiring fsdax namespaces to be created with "--align=1GB".

We faced something similar with section collisions with System RAM, but in that case we could interrogate the collision ahead of time. As it stands we don't find out about this collision until its too late. I'll try to think of something more clever, but the solution may devolve to just teaching the tooling to require large alignments.
"""

As we can see here:

rafaeldtinoco@ndctltest:~$ sudo cat /proc/iomem
...
100000000-13fffffff : System RAM
140000000-17ffbffff : Persistent Memory
  140000000-17ffbffff : namespace0.0
17ffc0000-1bff7ffff : Persistent Memory
  17ffc0000-1bff7ffff : namespace1.0
340000000-3bfffffff : PCI Bus 0000:00

When using 2 nvdimms in QEMU, both regions (thus namespaces) share boundaries and there is a special (to 128MB) alignment need for it. You can make a RAW namespace to work, but no other:

----

rafaeldtinoco@ndctltest:~$ sudo ndctl disable-region all
disabled 2 regions
rafaeldtinoco@ndctltest:~$ sudo ndctl zero-labels all
zeroed 2 nmems
rafaeldtinoco@ndctltest:~$ sudo ndctl enable-region all
enabled 2 regions
rafaeldtinoco@ndctltest:~$ sudo ndctl list -N
rafaeldtinoco@ndctltest:~$ sudo ndctl create-namespace -r region0 -m raw
{
  "dev":"namespace0.0",
  "mode":"raw",
  "size":"1023.75 MiB (1073.48 MB)",
  "uuid":"54921448-1043-4779-bd77-bb77f70b11eb",
  "sector_size":512,
  "blockdev":"pmem0"
}
rafaeldtinoco@ndctltest:~$ sudo ndctl create-namespace -r region1 -m raw
{
  "dev":"namespace1.0",
  "mode":"raw",
  "size":"1023.75 MiB (1073.48 MB)",
  "uuid":"c5d32b36-c4b4-4c37-a401-0209e2b2e58a",
  "sector_size":512,
  "blockdev":"pmem1"
}
---

but if I try other namespace mode:

---
rafaeldtinoco@ndctltest:~$ sudo ndctl disable-region all
disabled 2 regions
rafaeldtinoco@ndctltest:~$ sudo ndctl zero-labels all
zeroed 2 nmems
rafaeldtinoco@ndctltest:~$ sudo ndctl enable-region all
enabled 2 regions
rafaeldtinoco@ndctltest:~$ sudo ndctl list -N
rafaeldtinoco@ndctltest:~$ sudo ndctl create-namespace -r region0 -m fsdax
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"dev",
  "size":"1004.00 MiB (1052.77 MB)",
  "uuid":"5c8e1059-2714-4e9a-b47f-33bb617d4489",
  "sector_size":512,
  "align":2097152,
  "blockdev":"pmem0"
}
rafaeldtinoco@ndctltest:~$ sudo ndctl create-namespace -r region1 -m fsdax
libndctl: ndctl_pfn_enable: pfn1.0: failed to enable
  Error: namespace1.0: failed to enable

failed to create namespace: No such device or address
----

I face the boundaries problem.

Seabios can be fixed by:

https://github.com/pmem/ndctl/issues/76#issuecomment-440848371

making sure alignment is correct. As the kernel is already taking care of the issue:

https://github.com/0day-ci/linux/commit/e50ad2650daecc1135bb28befd278fa291b6afe9

it looks like QEMU in this case would have to address this alignment.

For now, ndctl tests being made for:

https://bugs.launchpad.net/ubuntu/+source/ndctl/+bug/1853506

will have to deal with a single virtual nvdimm.