crash on s390 in kvm run due to background load on postcopy

Bug #1704829 reported by Christian Ehrhardt 
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Invalid
Undecided
bugproxy
qemu (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Hi,
I happened rather often (but not 100% reproducible) into an issue that I wanted to document and ask if that is some sort of known issue.

On migration with options like:
$ virsh migrate --live --postcopy --postcopy-after-precopy kvmguest-zesty-postcopy qemu+ssh://10.93.175.192/system

Note: All other migration types we test are working.
Even postcopy is good without background workload.
Just this combination of postcopy-after-precopy + background workload seems to make it fail.

FYI - BG-Load is on a 4 vcpu guest
- nohup stress-ng -m 1 --vm-keep --vm-bytes 256M 1>/dev/null 2>&1 &
- nohup md5sum /dev/urandom 1>/dev/null 2>&1 &
- nohup bash -c "while /bin/true; do dd if=/dev/urandom of=/var/tmp/mjb.1 bs=4M count=100; done" 1>/dev/null 2>&1
That load runs on 3 of those guests on a 8 CPU Host.
So we make more than the 8 cpus we have busy with the load.

Migration is accounted as success on initiator, but as paused on target
 46577 State: paused

The error I get from qmeu is on kvm run like:
cat /var/log/libvirt/qemu/kvmguest-zesty-postcopy.log
[...]
 46164 error: kvm run failed Bad address
 46165 PSW=mask 0404d00180000000 addr 0000000000831996 cc 00
 46166 R00=0000000021109f7a R01=000000002b218d03 R02=00000000f8e7ec87 R03=0000000053e86c4c
 46167 R04=000000005a446a8d R05=0000000099f29f74 R06=0000000037fee6fa R07=000000009640eb9d
 46168 R08=00000000ac47c987 R09=0000000089a8182d R10=070000004656507d R11=00000000b1edcf28
 46169 R12=00000000d915d7c0 R13=00000000008a5eb0 R14=000000000060f146 R15=000000001da3bc58
 46170 F00=000003ffc0f7eb58 F01=000002aa112cc260 F02=000002aa10c88b40 F03=0000000000008000
 46171 F04=0000000000008000 F05=000003ffc0f7eeb0 F06=000002aa112cc030 F07=000003ffc0f7ebfc
 46172 F08=000002aa10c8d100 F09=000003ffa9b92200 F10=0000000021deb968 F11=000002aa3f7a9820
 46173 F12=0000000021dea7c8 F13=000003ffcdcfeaa8 F14=000003ffefc7f390 F15=000003ffc0f7eea8
 46174 V00=000003ffc0f7eb580000000000000000 V01=000002aa112cc2600000000000000000
 46175 V02=000002aa10c88b400000000000000000 V03=00000000000080000000000000000000
 46176 V04=00000000000080000000000000000000 V05=000003ffc0f7eeb00000000000000000
 46177 V06=000002aa112cc0300000000000000000 V07=000003ffc0f7ebfc0000000000000000
 46178 V08=000002aa10c8d1000000000000000000 V09=000003ffa9b922000000000000000000
 46179 V10=0000000021deb9680000000000000000 V11=000002aa3f7a98200000000000000000
 46180 V12=0000000021dea7c80000000000000000 V13=000003ffcdcfeaa80000000000000000
 46181 V14=000003ffefc7f3900000000000000000 V15=000003ffc0f7eea80000000000000000
 46182 V16=00000000000000050000000000000000 V17=00000000000000060000000000000000
 46183 V18=40404040404040404040404040404040 V19=00000000000000050000000000000000
 46184 V20=0f0e0d0c0b0a09080706050403020100 V21=ffffffff00ffff000000000000000000
 46185 V22=0000ff00000000000000000000000000 V23=00000000000000000000000000000000
 46186 V24=00000000000000000000000000000000 V25=00000000000000000000000000000000
 46187 V26=00000000000000000000000000000000 V27=00000000000000000000000000000000
 46188 V28=00000000000000000000000000000000 V29=00000000000000000000000000000000
 46189 V30=000002aa0ba5bc300000000000000000 V31=00000000010e14190000000000000001
 46190 C00=0080000014866a10 C01=000000001d3d41c7 C02=0000000000011140 C03=0000000000000000
 46191 C04=0000000000000a74 C05=0000000000000400 C06=0000000010000000 C07=000000001d3d41c7
 46192 C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
 46193 C12=0000000000000000 C13=0000000000d6c007 C14=00000000db000000 C15=0000000000011280

FYI: Our machine is generally very slow, especially on I/O, but also on CPU when the builders are busy. Same test run good a few days ago, seems to depend on overall machine load adding up to the background load on migration test. Which in turn adds up to break it on s390x.
Note: It is also a very unfair comparison, we have 8 cores on s390x, while on x86 and ppc we have way more.

I haven't catched it "live" so far to debug it any further - only in automated testing I realized that this is at least occurring once every other week.

Affected releases seem to be Yakkety (libvirt 2.1 / qemu 2.6.1) and zesty (libvirt 2.5 / qmeu 2.8).
As soon as our Artful stack is fully done I'll add those.

For know a check against known issues would be nice.

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
assignee: nobody → bugproxy (bugproxy)
bugproxy (bugproxy)
tags: added: architecture-s39064 bugnameltc-156764 severity-high targetmilestone-inin1704
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-07-18 03:33 EDT-------
I assume the crash is on the target (not the source). Do you have any dmesg messages from that system? Does the target system have enough memory/swap?

Frank Heimes (fheimes)
tags: added: s390x
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Yes the crash was on the target.
The system is rebooted but since it didn't crash the host the logs should be intact.
But in /var/log/kern.log there are other kernel messages around that time but no crash or such (not sure a crash would get there thou without checking to be sure).

If it stumbles on the same issue again I'll report one as long as the system is still up.

On the memory it has ~10G and has to hold 0-3 guests with 2GB each (not fully used).
There is also a 1G safety swap disk.
So memory might be tight, but not exhausted.

I'll check that in detail as well once it fails again.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-08-08 09:08 EDT-------
@paelzer: any news ? or still no recuring error?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Good for the last three weeks with 2x a few hundred automated migrations in tests.
Closing for now - thanks for the Ping Heinz-Werner

Changed in qemu (Ubuntu):
status: New → Invalid
Changed in ubuntu-z-systems:
status: New → Invalid
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-08-09 04:42 EDT-------
IBM Bugzilla Status-> closed, not a Bug

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.