QEMU emulated nvdimm regions alignment need (128MB) or ndctl create-namespace namespace1.0 might fail

Bug #1855177 reported by Rafael David Tinoco
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned
Focal
Fix Released
Medium
Unassigned
ndctl (Ubuntu)
Invalid
Medium
Unassigned
Focal
Invalid
Medium
Unassigned
qemu (Ubuntu)
Invalid
Medium
Unassigned
Focal
Invalid
Medium
Unassigned

Bug Description

I got a probe error for pfn1.0 (from both pfn0.0 and pfn1.0) when dealing with ndctl:

----
[11257.765457] memory add fail, invalid altmap
[11257.765489] WARNING: CPU: 6 PID: 5680 at arch/x86/mm/init_64.c:852 add_pages+0x5d/0x70
[11257.765489] Modules linked in: nls_iso8859_1 edac_mce_amd crct10dif_pclmul crc32_pclmul dax_pmem_compat device_dax dax_pmem_core nd_pmem nd_btt ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper input_leds joydev mac_hid nfit serio_raw qemu_fw_cfg sch_fq_codel ip_tables x_tables autofs4 virtio_net net_failover psmouse failover pata_acpi virtio_blk i2c_piix4 floppy
[11257.765505] CPU: 6 PID: 5680 Comm: ndctl Not tainted 5.3.0-24-generic #26-Ubuntu
[11257.765505] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[11257.765507] RIP: 0010:add_pages+0x5d/0x70
[11257.765509] Code: 33 c2 01 76 20 48 89 15 99 33 c2 01 48 89 15 a2 33 c2 01 48 c1 e2 0c 48 03 15 97 96 39 01 48 89 15 48 0e c2 01 5b 41 5c 5d c3 <0f> 0b eb ba 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44
[11257.765509] RSP: 0018:ffffa360c09dfbf0 EFLAGS: 00010282
[11257.765510] RAX: 00000000ffffffea RBX: 000000000017ffe0 RCX: 0000000000000000
[11257.765511] RDX: 0000000000000000 RSI: ffff8acb7db17448 RDI: ffff8acb7db17448
[11257.765512] RBP: ffffa360c09dfc00 R08: ffff8acb7db17448 R09: 0000000000000004
[11257.765512] R10: 0000000000000000 R11: 0000000000000001 R12: 000000000003fe20
[11257.765513] R13: 0000000000000001 R14: ffffa360c09dfc48 R15: ffff8acb7a7226f8
[11257.765515] FS: 00007febc9fd6bc0(0000) GS:ffff8acb7db00000(0000) knlGS:0000000000000000
[11257.765516] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11257.765517] CR2: 000055eec8aab398 CR3: 000000013a8fa000 CR4: 00000000000406e0
[11257.765519] Call Trace:
[11257.765523] arch_add_memory+0x41/0x50
[11257.765525] devm_memremap_pages+0x47c/0x640
[11257.765529] pmem_attach_disk+0x173/0x610 [nd_pmem]
[11257.765531] ? devm_memremap+0x67/0xa0
[11257.765532] nd_pmem_probe+0x7f/0xa0 [nd_pmem]
[11257.765542] nvdimm_bus_probe+0x6b/0x170
[11257.765547] really_probe+0xfb/0x3a0
[11257.765549] driver_probe_device+0x5f/0xe0
[11257.765550] device_driver_attach+0x5d/0x70
[11257.765551] bind_store+0xd3/0x110
[11257.765553] drv_attr_store+0x24/0x30
[11257.765554] sysfs_kf_write+0x3e/0x50
[11257.765555] kernfs_fop_write+0x11e/0x1a0
[11257.765557] __vfs_write+0x1b/0x40
[11257.765558] vfs_write+0xb9/0x1a0
[11257.765559] ksys_write+0x67/0xe0
[11257.765561] __x64_sys_write+0x1a/0x20
[11257.765567] do_syscall_64+0x5a/0x130
[11257.765693] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[11257.765696] RIP: 0033:0x7febc9e81327
[11257.765698] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[11257.765698] RSP: 002b:00007ffd599433f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[11257.765699] RAX: ffffffffffffffda RBX: 00007febc9fd6ae8 RCX: 00007febc9e81327
[11257.765700] RDX: 0000000000000007 RSI: 000055eec8a9bfa0 RDI: 0000000000000004
[11257.765701] RBP: 0000000000000004 R08: 0000000000000006 R09: 7375622f7379732f
[11257.765701] R10: 0000000000000000 R11: 0000000000000246 R12: 000055eec8a9bfa0
[11257.765702] R13: 0000000000000001 R14: 0000000000000007 R15: 00007ffd59943448
[11257.765703] ---[ end trace 442db04e33790cb5 ]---
[11257.782659] nd_pmem: probe of pfn1.0 failed with error -22
----

It seems that after this point I can't play with my second virtual nvdimm device (pfn1.0).

A namespace destroy works but a namespace creation does not:

rafaeldtinoco@ndctltest:~$ sudo ndctl list -B
[
  {
    "provider":"ACPI.NFIT",
    "dev":"ndbus0"
  }
]

rafaeldtinoco@ndctltest:~$ sudo ndctl list -D
[
  {
    "dev":"nmem1",
    "id":"8680-57341200",
    "handle":2,
    "phys_id":0
  },
  {
    "dev":"nmem0",
    "id":"8680-56341200",
    "handle":1,
    "phys_id":0
  }
]

rafaeldtinoco@ndctltest:~$ sudo ndctl list -R
[
  {
    "dev":"region1",
    "size":1073610752,
    "available_size":1073610752,
    "max_available_extent":1073610752,
    "type":"pmem",
    "iset_id":52512795602891997,
    "persistence_domain":"unknown"
  },
  {
    "dev":"region0",
    "size":1073610752,
    "available_size":0,
    "max_available_extent":0,
    "type":"pmem",
    "iset_id":52512752653219036,
    "persistence_domain":"unknown"
  }
]

Now, whenever trying to access namespace1.0 (from region1/nmem1/ndbus) I get:

[11257.782659] nd_pmem: probe of pfn1.0 failed with error -22
[11332.001388] pfn0.0 initialised, 257024 pages in 8ms
[11332.001818] pmem0: detected capacity change from 0 to 1052770304
[11359.739280] pfn0.1 initialised, 257024 pages in 0ms
[11362.643212] pfn0.0 initialised, 257024 pages in 0ms
[11362.644225] pmem0: detected capacity change from 0 to 1052770304
[11406.230365] pfn0.1 initialised, 257024 pages in 0ms
[11406.231281] pmem0: detected capacity change from 0 to 1052770304
[11517.785147] pfn0.0 initialised, 257024 pages in 4ms
[11517.785593] pmem0: detected capacity change from 0 to 1052770304
[11537.431697] pfn0.1 initialised, 257024 pages in 0ms
[11537.432256] pmem0: detected capacity change from 0 to 1052770304
[11627.965947] pfn0.0 initialised, 257024 pages in 0ms
[11627.966415] pmem0: detected capacity change from 0 to 1052770304
[11653.277667] pfn0.1 initialised, 257024 pages in 4ms
[11653.278086] pmem0: detected capacity change from 0 to 1052770304
[11708.696361] pfn0.0 initialised, 257024 pages in 0ms
[11708.697617] pmem0: detected capacity change from 0 to 1052770304
[11753.621295] nd_pmem btt0.0: No existing arenas
[11753.623118] pmem0s: detected capacity change from 0 to 1071484928
[11767.087424] pfn0.1 initialised, 257024 pages in 4ms
[11767.088272] pmem0: detected capacity change from 0 to 1052770304
[11775.815396] dax0.0 initialised, 257024 pages in 4ms
[12848.341346] pfn0.0 initialised, 257024 pages in 0ms
[12848.341785] pmem0: detected capacity change from 0 to 1052770304
[12851.897716] nd_pmem: probe of pfn1.0 failed with error -22
[13023.693246] pfn0.1 initialised, 257024 pages in 0ms
[13023.693662] pmem0: detected capacity change from 0 to 1052770304
[13026.517467] nd_pmem: probe of pfn1.0 failed with error -22
[13067.380701] pmem0: detected capacity change from 0 to 1073610752
[13117.568499] nd_pmem: probe of pfn1.0 failed with error -22
[13946.604199] pfn0.0 initialised, 257024 pages in 0ms
[13946.604777] pmem0: detected capacity change from 0 to 1052770304
[13957.948381] nd_pmem: probe of pfn1.0 failed with error -22

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

I'm reporting this because I'm not sure yet if its an emulation issue, since all nvdimm device access I'm doing here is through QEMU emulation feature, like showed here:

https://raw.githubusercontent.com/rafaeldtinoco/provision/master/kvm/libvirt/nvdimm.xml

or a kernel issue:

rafaeldtinoco@ndctltest:~$ uname -a
Linux ndctltest 5.3.0-24-generic #26-Ubuntu SMP Thu Nov 14 01:33:18 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

rafaeldtinoco@ndctltest:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu Focal Fossa (development branch)
Release: 20.04
Codename: focal

Changed in qemu (Ubuntu):
importance: Undecided → Low
status: New → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Low
status: New → Confirmed
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Even after restart the kernel module the error continues, suggesting the emulation is likely the root cause.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Something else is wrong.. can't plan with the second nvdimm even after re-creating the nvdimm backing files. I'm investigating.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Alright, if someone ever faces this, the label size can be the one to blame.

Judging by the manual:

https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt

Note:

1. The minimal label size is 128KB.

2. QEMU v2.7.0 and later store labels at the end of backend storage.
   If a memory backend file, which was previously used as the backend
   of a vNVDIMM device without labels, is now used for a vNVDIMM
   device with label, the data in the label area at the end of file
   will be inaccessible to the guest. If any useful data (e.g. the
   meta-data of the file system) was stored there, the latter usage
   may result guest data corruption (e.g. breakage of guest file
   system).

=> 128KB was not enough. Changing label area to 2MB "fixed" the issue. Funny is that I'm not even trying to use labels, I'm using full regions for namespaces BUT its likely that there is a single label in those cases (being written at the end of backing files).

=> I was also truncating the backing files, now I'm creating full zeroed files (I guess that for the MMIO nature of DAX & PMEM, having full files is either better OR mandatory).

rafaeldtinoco@ndctltest:~$ sudo ndctl disable-namespace all
disabled 2 namespaces

rafaeldtinoco@ndctltest:~$ sudo ndctl create-namespace -v -r region1 -m fsdax
{
  "dev":"namespace1.0",
  "mode":"fsdax",
  "map":"dev",
  "size":"1006.00 MiB (1054.87 MB)",
  "uuid":"51dec4e0-1414-418a-9263-6459d5c12194",
  "sector_size":512,
  "align":2097152,
  "blockdev":"pmem1"
}

rafaeldtinoco@ndctltest:~$ sudo ndctl create-namespace -v -r region0 -m fsdax
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"dev",
  "size":"1006.00 MiB (1054.87 MB)",
  "uuid":"33be9543-603c-4a7f-9f2d-98a3f3ff5ec0",
  "sector_size":512,
  "align":2097152,
  "blockdev":"pmem0"
}

Changed in linux (Ubuntu Focal):
status: Confirmed → Invalid
Changed in qemu (Ubuntu Focal):
status: Confirmed → Invalid
no longer affects: linux (Ubuntu Focal)
Changed in ndctl (Ubuntu):
status: New → Confirmed
Changed in qemu (Ubuntu Focal):
status: Invalid → Confirmed
Changed in linux (Ubuntu):
status: Invalid → Confirmed
no longer affects: qemu (Ubuntu Focal)
Changed in ndctl (Ubuntu):
importance: Undecided → Medium
Changed in linux (Ubuntu):
importance: Low → Medium
Changed in qemu (Ubuntu):
importance: Low → Medium
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (8.5 KiB)

This way more complex than I thought and its not so easy to address. Lets see if I can summarize the issue here. Whenever developing the regressions tests for ndctl, it occurred to me the same backtrace, over and over, when realizing the tests:

----
[ 271.705646] memory add fail, invalid altmap
[ 271.705677] WARNING: CPU: 5 PID: 886 at arch/x86/mm/init_64.c:852 add_pages+0x5d/0x70
[ 271.705679] Modules linked in: nls_iso8859_1 edac_mce_amd dax_pmem_compat nd_pmem device_dax nd_btt dax_pmem_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev aesni_intel aes_x86_64 crypto_simd input_leds cryptd glue_helper serio_raw mac_hid qemu_fw_cfg nfit sch_fq_codel ip_tables x_tables autofs4 virtio_net psmouse net_failover virtio_blk i2c_piix4 failover pata_acpi floppy
[ 271.705707] CPU: 5 PID: 886 Comm: ndctl Not tainted 5.3.0-24-generic #26-Ubuntu
[ 271.705709] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 271.705720] RIP: 0010:add_pages+0x5d/0x70
[ 271.705721] Code: 33 c2 01 76 20 48 89 15 99 33 c2 01 48 89 15 a2 33 c2 01 48 c1 e2 0c 48 03 15 97 96 39 01 48 89 15 48 0e c2 01 5b 41 5c 5d c3 <0f> 0b eb ba 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44
[ 271.705722] RSP: 0018:ffffba02c0d2bbf0 EFLAGS: 00010282
[ 271.705723] RAX: 00000000ffffffea RBX: 000000000017ffc0 RCX: 0000000000000000
[ 271.705723] RDX: 0000000000000000 RSI: ffff9aaa3da97448 RDI: ffff9aaa3da97448
[ 271.705724] RBP: ffffba02c0d2bc00 R08: ffff9aaa3da97448 R09: 0000000000000004
[ 271.705724] R10: 0000000000000000 R11: 0000000000000001 R12: 000000000003fe40
[ 271.705725] R13: 0000000000000001 R14: ffffba02c0d2bc48 R15: ffff9aa975efaaf8
[ 271.705727] FS: 00007f70a62d4bc0(0000) GS:ffff9aaa3da80000(0000) knlGS:0000000000000000
[ 271.705728] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 271.705729] CR2: 00005594a0aaa158 CR3: 0000000138110000 CR4: 00000000000406e0
[ 271.705731] Call Trace:
[ 271.705734] arch_add_memory+0x41/0x50
[ 271.705737] devm_memremap_pages+0x47c/0x640
[ 271.705740] pmem_attach_disk+0x173/0x610 [nd_pmem]
[ 271.705741] ? devm_memremap+0x67/0xa0
[ 271.705743] nd_pmem_probe+0x7f/0xa0 [nd_pmem]
[ 271.705745] nvdimm_bus_probe+0x6b/0x170
[ 271.705747] really_probe+0xfb/0x3a0
[ 271.705749] driver_probe_device+0x5f/0xe0
[ 271.705750] device_driver_attach+0x5d/0x70
[ 271.705751] bind_store+0xd3/0x110
[ 271.705753] drv_attr_store+0x24/0x30
[ 271.705754] sysfs_kf_write+0x3e/0x50
[ 271.705755] kernfs_fop_write+0x11e/0x1a0
[ 271.705757] __vfs_write+0x1b/0x40
[ 271.705758] vfs_write+0xb9/0x1a0
[ 271.705759] ksys_write+0x67/0xe0
[ 271.705760] __x64_sys_write+0x1a/0x20
[ 271.705762] do_syscall_64+0x5a/0x130
[ 271.705764] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 271.705765] RIP: 0033:0x7f70a6189327
[ 271.705767] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 271.705767] RSP: 002b:00007ffc616998b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 271.705768] RAX: ffffffffffffffda RBX: 00007f70a62d4ae8 RCX: 00007f70a6189327
...

Read more...

summary: - qemu nvdimm virtualization + linux 5.3.0-24-generic kernel PROBE ERROR
+ QEMU emulated nvdimm regions alignment need (128MB) or ndctl create-
+ namespace namespace1.0 might fail
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

The Linux Kernel fix is here:

commit 274b924088e9
Author: Jeff Moyer <email address hidden>
Date: Wed Aug 28 12:49:46 2019

    libnvdimm/pfn: Fix namespace creation on misaligned addresses

    Yi reported[1] that after commit a3619190d62e ("libnvdimm/pfn: stop
    padding pmem namespaces to section alignment"), it was no longer
    possible to create a device dax namespace with a 1G alignment. The
    reason was that the pmem region was not itself 1G-aligned. The code
    happily skips past the first 512M, but fails to account for a now
    misaligned end offset (since space was allocated starting at that
    misaligned address, and extending for size GBs). Reintroduce
    end_trunc, so that the code correctly handles the misaligned end
    address. This results in the same behavior as before the introduction
    of the offending commit.

    [1] https://lists.01.org/pipermail/linux-nvdimm/2019-July/022813.html

    Fixes: a3619190d62e ("libnvdimm/pfn: stop padding pmem namespaces ...")
    Reported-and-tested-by: Yi Zhang <email address hidden>
    Signed-off-by: Jeff Moyer <email address hidden>
    Link: https://<email address hidden>
    Signed-off-by: Dan Williams <email address hidden>

So marking kernel as Fix Released (as this is included in Focal).

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

From comment #5, i was able to make changes persistent with access='shared':

    <memory model='nvdimm' access='shared'>
      <source>
        <path>/tmp/.nbdpath1.2898617</path>
      </source>
      <target>
        <size unit='KiB'>1048576</size>
        <node>0</node>
        <label>
          <size unit='KiB'>256</size>
        </label>
      </target>
      <address type='dimm' slot='0'/>
    </memory>

so now backing files are updated (and can even be shared among 2 vms).

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

You have to make sure to set the label size as the same size as the minimum alignment required. For QEMU emulation I found it to be 2M, and, in this case, you can correctly work with 2 or more nvdimms. With proper alignment, you can also re-create namespaces - of other mode like devdax or fsdax - if you already have a namespace working.

Example:

    <!-- nvdimm node 0 -->
    <memory model='nvdimm' access='shared'>
      <source>
        <path>$_nvpath1</path>
      </source>
      <target>
        <size unit='KiB'>1048576</size>
        <node>0</node>
        <label>
          <size unit='KiB'>2048</size>
        </label>
      </target>
      <address type='dimm' slot='0'/>
    </memory>
    <!-- nvdimm node 1 -->
    <memory model='nvdimm' access='shared'>
      <source>
        <path>$_nvpath2</path>
      </source>
      <target>
        <size unit='KiB'>1048576</size>
        <node>1</node>
        <label>
          <size unit='KiB'>2048</size>
        </label>
      </target>
      <address type='dimm' slot='1'/>
    </memory>

So I'm closing this as NOT A BUG as the alignment can be configured accordingly.

For other examples you can check autopkgtests being proposed as a merge request at:

https://bugs.launchpad.net/ubuntu/+source/ndctl/+bug/1853506

Changed in ndctl (Ubuntu Focal):
status: Confirmed → Invalid
Changed in qemu (Ubuntu Focal):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.