ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3g1: Migration guest running with IO stress crashed@security_file_permission+0xf4/0x160.

Bug #1768115 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Invalid
Medium
Canonical Kernel Team
linux (Ubuntu)
Invalid
Medium
Canonical Kernel Team
Bionic
Invalid
Medium
Canonical Kernel Team

Bug Description

Problem Description: Migration Guest running with IO stress crashed@security_file_permission+0xf4/0x160 after couple of migrations.

Steps to re-create:

Source host - boslcp3
Destination host - boslcp4

1.boslcp3 & boslcp4 installed with latest kernel
root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT 2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:~#

root@boslcp4:~# uname -a
Linux boslcp4 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT 2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:~#

2. Installed guest boslcp3g1 with kernel and started LTP run from boslcp3 host

root@boslcp3g1:~# uname -a
Linux boslcp3g1 4.15.0-15-generic #16+bug166877 SMP Wed Apr 18 14:47:30 CDT 2018 ppc64le ppc64le ppc64le GNU/Linux

3. Started migrating boslcp3g1 guest from source to destination & viceversa.
4. After couple of migrations it crashed at boslcp4 & enters into xmon

8:mon> t
[c0000004f8a23d20] c0000000005a7674 security_file_permission+0xf4/0x160
[c0000004f8a23d60] c0000000003d1d30 rw_verify_area+0x70/0x120
[c0000004f8a23d90] c0000000003d375c vfs_read+0x8c/0x1b0
[c0000004f8a23de0] c0000000003d3d88 SyS_read+0x68/0x110
[c0000004f8a23e30] c00000000000b184 system_call+0x58/0x6c
--- Exception: c01 (System Call) at 000071f1779fe280
SP (7fffe99ece50) is in userspace
8:mon> S
msr = 8000000000001033 sprg0 = 0000000000000000
pvr = 00000000004e1202 sprg1 = c000000007a85800
dec = 00000000591e3e03 sprg2 = c000000007a85800
sp = c0000004f8a234a0 sprg3 = 0000000000010008
toc = c0000000016eae00 dar = 000000000000023c
srr0 = c0000000000c355c srr1 = 8000000000001033 dsisr = 40000000
dscr = 0000000000000000 ppr = 0010000000000000 pir = 00000011
amr = 0000000000000000 uamor = 0000000000000000
dpdes = 0000000000000000 tir = 0000000000000000 cir = 00000000
fscr = 0500000000000180 tar = 0000000000000000 pspb = 00000000
mmcr0 = 0000000080000000 mmcr1 = 0000000000000000 mmcr2 = 0000000000000000
pmc1 = 00000000 pmc2 = 00000000 pmc3 = 00000000 pmc4 = 00000000
mmcra = 0000000000000000 siar = 0000000000000000 pmc5 = 0000026c
sdar = 0000000000000000 sier = 0000000000000000 pmc6 = 00000861
ebbhr = 0000000000000000 ebbrr = 0000000000000000 bescr = 0000000000000000
iamr = 4000000000000000
pidr = 0000000000000034 tidr = 0000000000000000
cpu 0x8: Vector: 700 (Program Check) at [c0000004f8a23220]
    pc: c0000000000e4854: xmon_core+0x1f24/0x3520
    lr: c0000000000e4850: xmon_core+0x1f20/0x3520
    sp: c0000004f8a234a0
   msr: 8000000000041033
  current = 0xc0000004f89faf00
  paca = 0xc000000007a85800 softe: 0 irq_happened: 0x01
    pid = 24028, comm = top
Linux version 4.15.0-20-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #21-Ubuntu SMP Tue Apr 24 06:14:44 UTC 2018 (Ubuntu 4.15.0-20.21-generic 4.15.17)
cpu 0x8: Exception 700 (Program Check) in xmon, returning to main loop
[c0000004f8a23d20] c0000000005a7674 security_file_permission+0xf4/0x160
[c0000004f8a23d60] c0000000003d1d30 rw_verify_area+0x70/0x120
[c0000004f8a23d90] c0000000003d375c vfs_read+0x8c/0x1b0
[c0000004f8a23de0] c0000000003d3d88 SyS_read+0x68/0x110
[c0000004f8a23e30] c00000000000b184 system_call+0x58/0x6c
--- Exception: c01 (System Call) at 000071f1779fe280
SP (7fffe99ece50) is in userspace
8:mon> r
R00 = c00000000043b7fc R16 = 0000000000000000
R01 = c0000004f8a23c90 R17 = ffffffffffffff70
R02 = c0000000016eae00 R18 = 00000a51b4bebfc8
R03 = c000000279557200 R19 = 00007fffe99edbb0
R04 = c0000003242499c0 R20 = 00000a51b4c04db0
R05 = 0000000000020000 R21 = 00000a51b4c20e90
R06 = 0000000000000004 R22 = 0000000000040f00
R07 = ffffff8100000000 R23 = 00000a51b4c06560
R08 = ffffff8000000000 R24 = ffffffffffffff80
R09 = 0000000000000000 R25 = 00000a51b4bec2b8
R10 = 0000000000000000 R26 = 000071f177bb0b20
R11 = 0000000000000000 R27 = 0000000000000000
R12 = 0000000000002000 R28 = c000000279557200
R13 = c000000007a85800 R29 = c0000004c7734210
R14 = 0000000000000000 R30 = 0000000000000000
R15 = 0000000000000000 R31 = c0000003242499c0
pc = c00000000043b808 __fsnotify_parent+0x88/0x1a0
cfar= c0000000003f9e78 dget_parent+0xe8/0x150
lr = c00000000043b7fc __fsnotify_parent+0x7c/0x1a0
msr = 8000000000009033 cr = 28002222
ctr = c0000000006252b0 xer = 0000000000000000 trap = 300
dar = 000000000000023c dsisr = 40000000
8:mon> e
cpu 0x8: Vector: 300 (Data Access) at [c0000004f8a23a10]
    pc: c00000000043b808: __fsnotify_parent+0x88/0x1a0
    lr: c00000000043b7fc: __fsnotify_parent+0x7c/0x1a0
    sp: c0000004f8a23c90
   msr: 8000000000009033
   dar: 23c
 dsisr: 40000000
  current = 0xc0000004f89faf00
  paca = 0xc000000007a85800 softe: 0 irq_happened: 0x01
    pid = 24028, comm = top
Linux version 4.15.0-20-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #21-Ubuntu SMP Tue Apr 24 06:14:44 UTC 2018 (Ubuntu 4.15.0-20.21-generic 4.15.17)

6. Guest enters into xmon after migrating from boslcp3 to boslcp4.

>
> 8:mon> t
> [c0000004f8a23d20] c0000000005a7674 security_file_permission+0xf4/0x160
> [c0000004f8a23d60] c0000000003d1d30 rw_verify_area+0x70/0x120
> [c0000004f8a23d90] c0000000003d375c vfs_read+0x8c/0x1b0
> [c0000004f8a23de0] c0000000003d3d88 SyS_read+0x68/0x110
> [c0000004f8a23e30] c00000000000b184 system_call+0x58/0x6c
> --- Exception: c01 (System Call) at 000071f1779fe280
> SP (7fffe99ece50) is in userspace

> 8:mon> r
> R00 = c00000000043b7fc R16 = 0000000000000000
> R01 = c0000004f8a23c90 R17 = ffffffffffffff70
> R02 = c0000000016eae00 R18 = 00000a51b4bebfc8
> R03 = c000000279557200 R19 = 00007fffe99edbb0
> R04 = c0000003242499c0 R20 = 00000a51b4c04db0
> R05 = 0000000000020000 R21 = 00000a51b4c20e90
> R06 = 0000000000000004 R22 = 0000000000040f00
> R07 = ffffff8100000000 R23 = 00000a51b4c06560
> R08 = ffffff8000000000 R24 = ffffffffffffff80
> R09 = 0000000000000000 R25 = 00000a51b4bec2b8
> R10 = 0000000000000000 R26 = 000071f177bb0b20
> R11 = 0000000000000000 R27 = 0000000000000000
> R12 = 0000000000002000 R28 = c000000279557200
> R13 = c000000007a85800 R29 = c0000004c7734210
> R14 = 0000000000000000 R30 = 0000000000000000
> R15 = 0000000000000000 R31 = c0000003242499c0
> pc = c00000000043b808 __fsnotify_parent+0x88/0x1a0
> cfar= c0000000003f9e78 dget_parent+0xe8/0x150
> lr = c00000000043b7fc __fsnotify_parent+0x7c/0x1a0
> msr = 8000000000009033 cr = 28002222
> ctr = c0000000006252b0 xer = 0000000000000000 trap = 300
> dar = 000000000000023c dsisr = 40000000

> BUG_ON in jbd2_journal_write_metadata_buffer

I've included xmon crash data from a more recent crash, this time a BUG_ON in fs/jbd2/journal.c:jbd2_journal_write_metadata_buffer():

int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
                                  struct journal_head *jh_in,
                                  struct buffer_head **bh_out,
                                  sector_t blocknr)
{
        int need_copy_out = 0;
        int done_copy_out = 0;
        int do_escape = 0;
        char *mapped_data;
        struct buffer_head *new_bh;
        struct page *new_page;
        unsigned int new_offset;
        struct buffer_head *bh_in = jh2bh(jh_in);
        journal_t *journal = transaction->t_journal;

        /*
         * The buffer really shouldn't be locked: only the current committing
         * transaction is allowed to write it, so nobody else is allowed
         * to do any IO.
         *
         * akpm: except if we're journalling data, and write() output is
         * also part of a shared mapping, and another thread has
         * decided to launch a writepage() against this buffer.
         */
        J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));

This is not the same as the original bug, but I suspect they are part of a class of issues we're hitting while running under very particular circumstances which might not generally be seen during normal operation and triggering various corner cases. As such I think it makes sense to group them under this bug for the time being.

The general workload is running IO-heavy disk workloads on large guests (20GB memory, 16 vcpus) with SAN-based storage, and then performing migration during the workload. During migration we begin to see a high occurrence of rcu_sched stall warnings, and after 1-3 hours of operations we hit filesystem-related crashes like the ones posted. We've seen this with 2 separate FC cards, emulex and qlogic, where we invoke QEMU through libvirt as:

C_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-system-ppc64 -name guest=boslcp3g1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-7-boslcp3g1/master-key.aes -machine pseries-2.10,accel=kvm,usb=off,dump-guest-core=off,max-cpu-compat=power9 -cpu host -m 20480 -realtime mlock=off -smp 16,maxcpus=32,sockets=4,cores=8,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/7-boslcp3g1,size=10737418240 -numa node,nodeid=0,cpus=0-7,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=10737418240 -numa node,nodeid=1,cpus=8-15,memdev=ram-node1 -uuid bd110ed9-dcfc-4470-b4ae-6166a56819f0 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-7-boslcp3g1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot menu=on,strict=on -device spapr-pci-host-bridge,index=1,id=pci.1 -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 -drive file=/home/bionic-server-ppc64el.iso,format=raw,if=none,id=drive-scsi0-0-0-2,readonly=on,cache=none -device scsi-cd,bus=scsi0.0,channel=0,scsi-id=0,lun=2,drive=drive-scsi0-0-0-2,id=scsi0-0-0-2 -drive file=/dev/disk/by-id/dm-uuid-part1-mpath-3600507680183050d28000000000002a4,format=raw,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/dev/disk/by-id/dm-uuid-part2-mpath-3600507680183050d28000000000002a4,format=raw,if=none,id=drive-virtio-disk1,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1 -drive file=/dev/disk/by-id/dm-uuid-part3-mpath-3600507680183050d28000000000002a4,format=raw,if=none,id=drive-virtio-disk2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk2,id=virtio-disk2 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:72:d2:69,bus=pci.0,addr=0x1,bootindex=2 -chardev pty,id=charserial0 -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on

I will attach the libvirt XML separately

IBM is requesting some general filesystem skills from Canonical if they have some as we continue debugging...

Revision history for this message
bugproxy (bugproxy) wrote : xmon logs

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-167290 severity-high targetmilestone-inin1804
Revision history for this message
bugproxy (bugproxy) wrote : BUG_ON in jbd2_journal_write_metadata_buffer

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : guest libvirt xml

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in ubuntu-power-systems:
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
tags: added: triage-g
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-04-30 15:31 EDT-------
Both logs show that the dmesg buffer has been overrun, so by the time you get to xmon and run "dl" you've lost the messages that show what happened before things went wrong. You will need to be collecting console output from the beginning in order to show what happened.

tags: added: kernel-da-key
Guido (mcnomad81)
description: updated
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote :
Download full text (5.3 KiB)

------- Comment From <email address hidden> 2018-05-04 09:09 EDT-------
(In reply to comment #15)
> This is not the same as the original bug, but I suspect they are part of a
> class of issues we're hitting while running under very particular
> circumstances which might not generally be seen during normal operation and
> triggering various corner cases. As such I think it makes sense to group
> them under this bug for the time being.
>
> The general workload is running IO-heavy disk workloads on large guests
> (20GB memory, 16 vcpus) with SAN-based storage, and then performing
> migration during the workload. During migration we begin to see a high
> occurrence of rcu_sched stall warnings, and after 1-3 hours of operations
> we hit filesystem-related crashes like the ones posted. We've seen this with
> 2 separate FC cards, emulex and qlogic, where we invoke QEMU through libvirt
> as:

We been gathering additional traces while running under this scenario, and while so far most of the traces have been filesystem-related, we now have a couple that suggest the common thread between all of these is failures are related to RCU-managed data structures. I'll attach the summaries for these from xmon, these have the full dmesg log since guest start, as well as timestamps in dmesg noting where migrating has started/stopped, and "WATCHDOG" messages to note any large jumps in wall-clock time. For example (from boslcp3g1-migtest-fail-on-lcp5):

[ 5757.347542] migration iteration 7: started at Thu May 3 05:59:14 CDT 2018
<guest stopped will get stopped by libvirt 60 seconds, then after migration finishes it will be resumed and we immediately see the below>
[ 5935.727884] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 5935.728567] 1-...!: (670 GPs behind) idle=486/140000000000000/0 softirq=218179/218180 fqs=0
[ 5935.730091] 2-...!: (3750 GPs behind) idle=006/140000000000000/0 softirq=203335/203335 fqs=0
[ 5935.731076] 4-...!: (96 GPs behind) idle=c2e/140000000000000/0 softirq=168607/168608 fqs=0
[ 5935.731783] 5-...!: (2270 GPs behind) idle=e16/140000000000000/0 softirq=152608/152608 fqs=1
[ 5935.732959] 6-...!: (322 GPs behind) idle=3ca/140000000000001/0 softirq=169452/169453 fqs=1
[ 5935.735061] 8-...!: (6 GPs behind) idle=c36/140000000000001/0 softirq=280514/280516 fqs=1
[ 5935.736638] 9-...!: (5 GPs behind) idle=c1e/140000000000001/0 softirq=248247/248249 fqs=1
[ 5935.738112] 10-...!: (4 GPs behind) idle=62a/1/0 softirq=228207/228208 fqs=1
[ 5935.738868] 11-...!: (32 GPs behind) idle=afe/140000000000000/0 softirq=228817/228818 fqs=1
[ 5935.739122] 12-...!: (3 GPs behind) idle=426/1/0 softirq=192716/192717 fqs=1
[ 5935.739295] 14-...!: (5 GPs behind) idle=e56/140000000000000/0 softirq=133888/133892 fqs=1
[ 5935.739486] 15-...!: (7 GPs behind) idle=36e/140000000000000/0 softirq=161010/161013 fqs=1
...
[ 5935.740031] Unable to handle kernel paging request for data at address 0x00000008
[ 5935.740128] Faulting instruction address: 0xc000000000403d04

For the prior iterations where we don't crash we'd have messages like:

[ 2997.413561] WATCHDOG (Thu May 3 05:13:18 CDT 2018): time jump of 114 seconds
[ 3023.759629] migration iteration 1: completed ...

Read more...

Revision history for this message
bugproxy (bugproxy) wrote : boslcp3g1-migtest-fail-on-lcp5

------- Comment (attachment only) From <email address hidden> 2018-05-04 09:24 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : boslcp3g1-migtest2-fail-on-lcp5

------- Comment (attachment only) From <email address hidden> 2018-05-04 09:25 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla
Download full text (4.0 KiB)

------- Comment From <email address hidden> 2018-05-07 14:48 EDT-------
The RCU connection is possibly a red herring. I tested the above theory about RCU timeouts/warning being a trigger by modifying QEMU to allow guest timebase to be advanced artificially to trigger RCU timeouts/warnings in rapid succession and ran this for 8 hours using the same workload without seeing a crash. It seems migration is a necessary component to reproduce this.

I did further tests to capture the guest memory state before/after migration to see if there's possibly an issue with dirty-page tracking or something similar that could explain the crashes, and have data from 2 crashes that show a roughly 24 bytes difference between source/target after migration within the first 100MB of guest physical address range. I have more details in the summaries/logs I'm attaching, but one example is below (from "migtest" log):

root@boslcp5:~/dumps-cmp# xxd -s 0x013e0000 -l 128 0-2.boslcp5
013e0000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
013e0010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
013e0020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
013e0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
013e0040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
013e0050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
013e0060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
013e0070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
root@boslcp5:~/dumps-cmp# xxd -s 0x013e0000 -l 128 0-2.boslcp6
013e0000: 3403 0100 0002 0000 2f62 6c61 7374 2f76 4......./blast/v
013e0010: 6463 3400 0000 0000 0000 0000 0000 0000 dc4.............
013e0020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
013e0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
013e0040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
013e0050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
013e0060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
013e0070: 0000 0000 0000 0000 0000 0000 0000 0000 ................

"blast" is part of the LTP IO test suite running in the guest. It seems some data structure related to it is present in the source guest memory, but not on the target side. Part of the structure seems to be a trigger buffer, but the preceding value might be a point or something else and may explain the crashes if that ends up being zero'd on the target side. The other summary/log I'm attaching has almost an identical inconsistent between source/target from another guest using same workload and hitting a crash:

root@boslcp5:~/dumps-cmp-migtest2# xxd -s 38273024 -l 128 0-2.boslcp5
02480000: c000 0100 0000 0000 2f62 6c61 7374 2f76 ......../blast/v
02480010: 6462 3400 0000 0000 0000 0000 0000 0000 db4.............
02480020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
02480030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
02480040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
02480050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
02480060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
02480070: 0000 0000 0...

Read more...

Revision history for this message
bugproxy (bugproxy) wrote : summary of crash data migrating from boslcp6 to boslcp5

------- Comment (attachment only) From <email address hidden> 2018-05-07 14:50 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : summary of crash data migrating from boslcp5 to boslcp6

------- Comment (attachment only) From <email address hidden> 2018-05-07 14:51 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Traces of QEMU RAM migrations relative to dirty bitmap syncing

------- Comment (attachment only) From <email address hidden> 2018-05-08 15:57 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla
Download full text (10.5 KiB)

------- Comment From <email address hidden> 2018-05-08 16:37 EDT-------
Hit another instance of the RAM inconsistencies prior to resuming guest on target side (this one is migrating from boslcp6 to boslcp5 and crashing after it resumes execution on boslcp5). The signature is eerily similar to the ones above... the workload is blast from LTP but it's strange that 3 out of 3 so far have been the same data structure. Maybe there's a relationship between something the process is doing and dirty syncing?

root@boslcp5:~/vm_logs/1525768538/dumps# xxd -s 20250624 -l 128 0-2.vm0.iteration2a
01350000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
01350010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
01350020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
01350030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
01350040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
01350050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
01350060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
01350070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
root@boslcp5:~/vm_logs/1525768538/dumps# xxd -s 20250624 -l 128 0-2.vm0.iteration2a.boslcp6
01350000: d603 0100 0000 0000 2f62 6c61 7374 2f76 ......../blast/v
01350010: 6463 3400 0000 0000 0000 0000 0000 0000 dc4.............
01350020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
01350030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
01350040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
01350050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
01350060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
01350070: 0000 0000 0000 0000 0000 0000 0000 0000 ................

For this run I included traces of the various stages of memory migration on the QEMU side relative to dirty bitmap sync (attached above). The 3 phases are:

"ram_save_setup": enables dirty logging and sets up data structures used for tracking dirty pages. does the initial bitmap sync. QEMU keeps it's own copy which gets OR'd with the one provided by KVM on each bitmap sync. There's 2 blocks (ram-node0/ram-node1) each with their own bitmap / KVM memslot since guest was defined with 2 NUMA nodes. Only ram-node0 would be relevant here since it has offset 0 in guest physical memory address space.

"ram_save_pending": called before each iteration to see if there are pages still pending. When number of dirty pages in the QEMU bitmap drop below a certain value it does another sync with KVM's bitmap.

"ram_save_iterate": walks the QEMU dirty bitmap and sends corresponding pages until there's none left or some other limit (e.g. bandwidth throttling or max-pages-per-iteration) is hit.

"ram_save_pending"/"ram_save_iterate" keeps repeating until no more pages are left.

"ram_save_complete" does a final sync with KVM bitmap, sends final set of pages, then disables dirty logging and completes the migration.

"vm_stop" denotes with the guest VCPUs have all exited and stopped execution.

There's 2 migrations reflected in the posted traces, the first one can be ignored (everything between first ram_save_setup and first ram_sav...

Revision history for this message
bugproxy (bugproxy) wrote : (FIXED) Traces of QEMU RAM migrations relative to dirty bitmap syncing

------- Comment on attachment From <email address hidden> 2018-05-08 21:27 EDT-------

Please disregard previous attachment, the traces were from the previous (unrelated) migration.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-05-09 00:25 EDT-------
Is it essential to have two NUMA nodes for the guest memory to see this bug? Can we reproduce it without the NUMA node stuff in the xml?

Revision history for this message
bugproxy (bugproxy) wrote : Patch to take mmu_lock in kvm_radix_test_clear_dirty

------- Comment on attachment From <email address hidden> 2018-05-09 01:55 EDT-------

I notice that kvm_radix_test_clear_dirty() is not taking the kvm->mmu_lock, unlike all the other code that manipulates the partition-scoped page tables for the guest. Although I haven't been able to construct a scenario where this could cause a dirty bit to get lost, I think it is worth trying this patch, which makes kvm_radix_test_clear_dirty() take the kvm->mmu_lock and therefore serialize with the other code that manipulates ptes. This would not necessarily be the final patch but it should be good enough for testing. The patch is against the Ubuntu bionic 4.15.0-21.22 kernel.

The other thing that could be worth trying is to test with a current upstream kernel. There have been changes in the partition-scoped page table management since 4.15, and it would be worth knowing whether the bug still exists upstream.

Revision history for this message
bugproxy (bugproxy) wrote : Patch to add extra tlbsyncs

------- Comment on attachment From <email address hidden> 2018-05-09 07:20 EDT-------

Another patch to try. This one adds some tlbsync instructions which we were missing. It applies to the Ubuntu bionic kernel tree and should be applied to the host kernel (it will do nothing in the guest). The other patch to take the mmu_lock also is a host kernel patch and will do nothing in the guest. I suggest applying both to start with, and if that fixes the problem, then try with just this one.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-05-09 08:47 EDT-------
(In reply to comment #26)
> Is it essential to have two NUMA nodes for the guest memory to see this bug?
> Can we reproduce it without the NUMA node stuff in the xml?

I haven't attempted it on my end. Can give it a try. But we suspect https://bugzilla.linux.ibm.com/show_bug.cgi?id=167036https://bugzilla.linux.ibm.com/show_bug.cgi?id=167036 may be the same issue (but with Pegas) since see they're doing IO tests and various IO related failures after migration. In that particular config there were no additional NUMA nodes in the guest.

I am hoping to get the dump-bitmap-on-demand test you suggested going today and hopefully that can reproduce at a high enough frequency that I can try the kernel patches, disabling THP, and NUMA configurations within a reasonable timeframe. The test I kicked off yesterday to capture first 128MB of dirty bitmap ran all night without triggering...

Revision history for this message
bugproxy (bugproxy) wrote : Patch to add extra tlbsyncs and ptesyncs

------- Comment on attachment From <email address hidden> 2018-05-10 00:32 EDT-------

Try this one instead of the "Patch to add extra tlbsyncs". Ben H pointed out that it is ptesync which is important, not tlbsync (which is actually a no-op, though architecturally required), so this version adds a ptesync to the guest exit path as well as the tlbsync.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla
Download full text (3.2 KiB)

------- Comment From <email address hidden> 2018-05-13 20:45 EDT-------
(In reply to comment #30)
> Created attachment 127024 [details]
> Patch to add extra tlbsyncs and ptesyncs
>
> Try this one instead of the "Patch to add extra tlbsyncs". Ben H pointed out
> that it is ptesync which is important, not tlbsync (which is actually a
> no-op, though architecturally required), so this version adds a ptesync to
> the guest exit path as well as the tlbsync.

Just a quick update, we noticed 2 other bugs pop up a few days ago where they hit very similar traces to what's been recorded here even without doing live migration. For instance

Bug 167748 - LP1770784- ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp4g4:ubuntu1604:P8 compat: guest crashes in apparmor_file_alloc_security()

[13244.120032] NIP [c000000000503788] aa_file_perm+0x238/0x4c0
[13244.120035] LR [c000000000503784] aa_file_perm+0x234/0x4c0
[13244.120035] Call Trace:
[13244.120039] [c000000004ec7b90] [c000000000503784] aa_file_perm+0x234/0x4c0 (unreliable)
[13244.120041] [c000000004ec7c90] [c0000000004fe784] common_file_perm+0x134/0x250
[13244.120044] [c000000004ec7d10] [c00000000048f088] security_file_permission+0x78/0x150
[13244.120047] [c000000004ec7d50] [c0000000002e49ec] rw_verify_area+0x7c/0x190
[13244.120049] [c000000004ec7d90] [c0000000002e4d4c] vfs_write+0x8c/0x230
[13244.120052] [c000000004ec7de0] [c0000000002e5dbc] SyS_write+0x6c/0x110
[13244.120055] [c000000004ec7e30] [c000000000009484] system_call+0x38/0xe4

which has a similar beginning call trace to the initial bug reported here. And:

Bug 167480 - ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3g4:ubuntu1604:P8 compat guest crashed & enters into xmon after few hours of LTP run.

[c0000002b1c13c10] c0000000002faad8 path_openat+0x88/0x3c0
[c0000002b1c13c90] c0000000002fca3c do_filp_open+0xfc/0x170
[c0000002b1c13db0] c0000000002e31d0 do_sys_open+0x1c0/0x3b0
[c0000002b1c13e30] c000000000009484 system_call+0x38/0xe4

which is similar to the crash I saw with filp_close.

The major things those bugs have in common with this one is both the guest and host SAN disks are using block multiqueue instead of legacy deadline/cfq schedulers, which was seen to have issues WRT to other patches like the one mentioned here:

https://bugzilla.linux.ibm.com/show_bug.cgi?id=167294#c47

We started testing using last Ubuntu 4.15 with a set of patches that Ziviani put together there:

https://github.com/jrziviani/linux-devel/commits/BLK_MQ_CPU

but I was still eventually able to hit a crash. However, when I disabled blk-mq via "scsi_mod.use_blk_mq=n dm_mod.use_blk_mq=n" in host/guest I didn't see any crashes after 130 migrations. However, I did see IO errors + read-only filesystem after extended runs, which have also been seen elsewhere with no live migration involved. So there's a possibly the bug here is related to issues seen elsewhere without live migration, and some initial evidence to suggesting that the crashes may be related to blk-mq, but not enough to say anything for certain.

Ziviani is currently doing some testing with upstream kernel. If that still has issues then will move to testing potential KVM fixes but want to rule out blk-mq first since the issue...

Read more...

Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: Triaged → Incomplete
tags: added: p9
Manoj Iyer (manjo)
Changed in linux (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
status: Incomplete → Triaged
Frank Heimes (fheimes)
tags: added: triage-a
removed: triage-g
Changed in linux (Ubuntu):
status: New → Triaged
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
status: New → Triaged
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hello,

It looks like there has been allot of investigation for this bug. Have there been any commits identified that may resolve this bug?

Upstream v4.18-rc1 is now available. Maybe this kernel could be tested to see if a "Reverse" bisect could identify a commit to fix the issue.

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.18-rc1/

Thanks!

Changed in linux (Ubuntu Bionic):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → Joseph Salisbury (jsalisbury)
Changed in ubuntu-power-systems:
status: Triaged → Incomplete
Changed in linux (Ubuntu):
status: Triaged → Incomplete
Changed in linux (Ubuntu Bionic):
status: Triaged → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

This bug has not had any activity in a while, can any update be provided?

Changed in linux (Ubuntu Bionic):
importance: High → Medium
Changed in linux (Ubuntu):
importance: High → Medium
Changed in ubuntu-power-systems:
importance: High → Medium
Changed in linux (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu Bionic):
assignee: Joseph Salisbury (jsalisbury) → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
Manoj Iyer (manjo) wrote :

IBM, could you please test with the hwe (4.18) kernel and report back if this bug is still valid?

Revision history for this message
Manoj Iyer (manjo) wrote :

Closing this bug since there is no activity or response for the last 3 months. Please re-open if you think this is still an issue that needs to be resolved.

Changed in ubuntu-power-systems:
status: Incomplete → Invalid
Changed in linux (Ubuntu):
status: Incomplete → Invalid
Changed in linux (Ubuntu Bionic):
status: Incomplete → Invalid
To post a comment you must log in.