Comment 5 for bug 1768115

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-05-04 09:09 EDT-------
(In reply to comment #15)
> This is not the same as the original bug, but I suspect they are part of a
> class of issues we're hitting while running under very particular
> circumstances which might not generally be seen during normal operation and
> triggering various corner cases. As such I think it makes sense to group
> them under this bug for the time being.
>
> The general workload is running IO-heavy disk workloads on large guests
> (20GB memory, 16 vcpus) with SAN-based storage, and then performing
> migration during the workload. During migration we begin to see a high
> occurrence of rcu_sched stall warnings, and after 1-3 hours of operations
> we hit filesystem-related crashes like the ones posted. We've seen this with
> 2 separate FC cards, emulex and qlogic, where we invoke QEMU through libvirt
> as:

We been gathering additional traces while running under this scenario, and while so far most of the traces have been filesystem-related, we now have a couple that suggest the common thread between all of these is failures are related to RCU-managed data structures. I'll attach the summaries for these from xmon, these have the full dmesg log since guest start, as well as timestamps in dmesg noting where migrating has started/stopped, and "WATCHDOG" messages to note any large jumps in wall-clock time. For example (from boslcp3g1-migtest-fail-on-lcp5):

[ 5757.347542] migration iteration 7: started at Thu May 3 05:59:14 CDT 2018
<guest stopped will get stopped by libvirt 60 seconds, then after migration finishes it will be resumed and we immediately see the below>
[ 5935.727884] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 5935.728567] 1-...!: (670 GPs behind) idle=486/140000000000000/0 softirq=218179/218180 fqs=0
[ 5935.730091] 2-...!: (3750 GPs behind) idle=006/140000000000000/0 softirq=203335/203335 fqs=0
[ 5935.731076] 4-...!: (96 GPs behind) idle=c2e/140000000000000/0 softirq=168607/168608 fqs=0
[ 5935.731783] 5-...!: (2270 GPs behind) idle=e16/140000000000000/0 softirq=152608/152608 fqs=1
[ 5935.732959] 6-...!: (322 GPs behind) idle=3ca/140000000000001/0 softirq=169452/169453 fqs=1
[ 5935.735061] 8-...!: (6 GPs behind) idle=c36/140000000000001/0 softirq=280514/280516 fqs=1
[ 5935.736638] 9-...!: (5 GPs behind) idle=c1e/140000000000001/0 softirq=248247/248249 fqs=1
[ 5935.738112] 10-...!: (4 GPs behind) idle=62a/1/0 softirq=228207/228208 fqs=1
[ 5935.738868] 11-...!: (32 GPs behind) idle=afe/140000000000000/0 softirq=228817/228818 fqs=1
[ 5935.739122] 12-...!: (3 GPs behind) idle=426/1/0 softirq=192716/192717 fqs=1
[ 5935.739295] 14-...!: (5 GPs behind) idle=e56/140000000000000/0 softirq=133888/133892 fqs=1
[ 5935.739486] 15-...!: (7 GPs behind) idle=36e/140000000000000/0 softirq=161010/161013 fqs=1
...
[ 5935.740031] Unable to handle kernel paging request for data at address 0x00000008
[ 5935.740128] Faulting instruction address: 0xc000000000403d04

For the prior iterations where we don't crash we'd have messages like:

[ 2997.413561] WATCHDOG (Thu May 3 05:13:18 CDT 2018): time jump of 114 seconds
[ 3023.759629] migration iteration 1: completed at Thu May 3 05:13:25 CDT 2018
[ 3239.678964] migration iteration 2: started at Thu May 3 05:16:45 CDT 2018

The WATCHDOG is noting the amount of time the guest has seen jump after it resumes execution. These are generally on the order of 1-2 minutes here where we're doing migration via virsh migrate ... --timeout 60, which manually stops the guest if it hasn't finished migration within 60s.

We now know that the source of the skip in time actually originates from behavior on the source side of migration due to handling within QEMU, and the guest is reacting after it wakes up from migration. A patch has been sent which changes the behavior so that the guest doesn't see a jump in time after resuming:

http://lists.nongnu.org/archive/html/qemu-devel/2018-05/msg00928.html

The patch is still under discussion and it's not clear yet whether this is actually a QEMU bug or intended behavior. I'm still testing the bug is conjunciton with original workload and would like to see it run over the weekend before I can say with any certain, but so far it has run overnight whereas prior to the change it would crashes after an hour or 2 (though we have seen runs that survived as long as 8 hours so that's not definitive).

If that survives it would suggest that the source for the RCU-related crashes seems to occur as a result of a jump in the guest VCPU's timebase register. One interesting thing I've noticed is that with a QEMU that *doesn't have the patch above*, disabling RCU stall warning messages via:

echo 1 >/sys/module/rcupdate/parameters/rcu_cpu_stall_suppress

allowed the workload to run for 16 hours without crashing. This may suggest the warning messages, in conjunction with rcu_cpu_stall_timeout being exceeded due to jump in timebase register, are triggering issues with RCU. What I plan to try next is raising rcu_cpu_stall_timeout to a much higher value (currently 21 on Ubuntu 18.04 it seems) and see if that can survive a run over the weekend.

All this also suggests it may be possible to trigger the failures synthetically by doing a high IO workload, pausing the guest, advancing the VCPU tb_offset, and then resuming. This is something I plan to try soon to help narrow things down.