(arm64) VM fails to properly reboot

Bug #1731051 reported by Sean Feole on 2017-11-08
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Undecided
Unassigned
Pike
Undecided
Unassigned
qemu (Ubuntu)
Undecided
Unassigned
Artful
Undecided
Unassigned

Bug Description

[Impact]

 * Newer qemu crashes on older kernels (on arm) for using a feature that
   was not supported by these older kernels.

 * Backport of a fix - also the detection code itself already exists in
   qemu - this just makes sure that if the feature is not available that
   the related function is not queued to prevent a crash.

[Test Case]

 * (on arm64 for the actual case - is a no-change everywhere else)
   1. create a virtual machine that runs fine
   2. suspend it
      $ sudo virsh dompmsuspend ubuntu1710 --target mem
   3. wake it up
      $ sudo virsh dompmwakeup ubuntu1710
   => Before the fix this sequence crashed qemu as outlined in the initial
      report below

[Regression Potential]

 * This is only affecting arm (and thereby limiting regression to others)
   as well as being a backport and no "change from scratch" (limiting risk
   again). Then furthermore "all it does" is stop adding the ITS action
   which was a feature only added in Artfuls qemu. That said if there
   would be a case were the detection would be non-perfect, even then the
   user would just fall back to how it worked in zesty. That is a lot of
   IFs (=unlikely) and even if so impact would hopefully be minimal.
   So I think the regression assessment is very low for this change.

[Other Info]

 * Even more important for backports of this like Ubuntu Cloud Archive

---

The Pike cloud archive has a regression, compared to Ocata, where in rebooting a VM via virsh causes the VM to powerdown, and then exit. The VM does not automatically power back up, but can be restarted.

Repro:

Install 16.04.3 on an ARM64 host
Fully update the install
add-apt-repository cloud-archive:pike
apt-get update
apt-get install qemu-efi virt-manager libvirt-bin qemu-guest-agent qemu-system-aarch64
wget http://cdimage.ubuntu.com/ubuntu/releases/17.10/release/ubuntu-17.10-server-arm64.iso
create a new session via ssh (session B)
In session B: virt-install --accelerate --cdrom ubuntu-17.10-server-arm64.iso --disk size=10 --name ubuntu1710 --os-type linux --ram 1024
Once the install completes and the guest is at the login prompt, in session A: virsh reboot ubuntu1710 --mode acpi

Observed result:
The guest will powerdown as expected (from logs on session B), and then session B will be dumped back to the host shell. "virsh list" will not show the ubuntu1710 domain.

Expected result:
The guest powers back on, and boots back to the login prompt.

Analysis:
We observe these errors in various logs:

Nov 1 13:29:16 ubuntu libvirtd[2441]: 2017-11-01 20:29:16.882+0000: 2441: error : qemuMonitorIORead:595 : Unable to read from monitor: Connection reset by peer
Nov 1 13:29:16 ubuntu libvirtd[2441]: 2017-11-01 20:29:16.882+0000: 3101: error : qemuMonitorJSONCommandWithFd:309 : internal error: Missing monitor reply object

2017-11-01T20:29:16.538762Z qemu-system-aarch64: KVM_SET_DEVICE_ATTR failed: Group 4 attr 0x0000000000000001: No such device or address

We debugged this to an issue in the QEMU in Pike being incompatible with the 4.10 kernel of 16.04.3. The QEMU in this version attempts to use the ITS migration functionality during reboot. 4.10 does not support this. When the IOCTL fails, QEMU calls abort(), thus killing the VM.

We believe QEMU should not attempt to use this functionality if the host kernel does not support it. We suggest the attached patch to resolve the issue.

Sean Feole (sfeole) wrote :

The attachment "qemu-patch.txt" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch

Hi Sean,
thanks for the details and the suggested fix.

This is pretty much intertwined with the two bugs:
- bug 1710019 - Dannf trying to enable just said ITS feature in Xenial (for save restore)
- bug 1731066 - seems to be the same issue just on save/restore

As far as I understand the situation we want to ensure:
1. that once Dannf enables ITS support on older qemu in Xenial it is not triggering the same issue (so whatever we come up here is needed in Xenial as well if Dannf ports the other bug there)
2. That the solution to detect and not use ITS is the accepted one.

So for now I'll subscribe Dannf to be aware of this as well.

Furthermore the next step is to drive the change upstream into qemu 2.11 before backporting.
Because if we end up with multiple conditions for ITS detection we might be in a bad love triangle of LTS/UCA/Devel.
Therefore since you can test it and wrote the change I wanted to ask you Sean if you want to drive this into qemu upstream or if you want me to try to do so?
The change itself LGTM on a logical level, some whitespace damage thou.

Actually do you want/need a ppa with that?
Without checking any further I have thrown something into [1] - but no guarantees on build/work.
Just a polished version of your patch - but we can iterate on that as needed.

Once in Qemu we can check were we backport the fix and if it has to be prior or combined to the fixes in discussion at 1710019.

[1]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3032/+packages

incomplete waiting for Sean to clarify next steps

Changed in qemu (Ubuntu):
status: New → Incomplete

Arr I remember ther ewas something with fdt gogin back to Xenial, let me know if you need the ppa and I'll fix it up Sean.
But there are so many open questions that I wait before going the wrong direction atm.

Sean Feole (sfeole) wrote :

Hey Christian, I was going to test your build in the PPA but it failed, that would be great if you wouldn't mind fixing it up. I could easily verify it as I already have the test env built. The patch was proposed via a private LP bug, and is already being driven upstream here @ https://patchwork.kernel.org/patch/10039969/

Changed in qemu (Ubuntu):
status: Incomplete → Triaged

Hey,
great to hear that this is already on its way to upstream - with that hint I found it.
It is even further already in 2.11-rc0 as [1].

That in mind it would be easy to pick that for Artful, but you need an Xenial-UCA setup for your test based on Pike for your verification.

Instead of modifying the Artful/Bionic 2.10 I'll take the source from cloud-archive and add it there - a few dependency magic retries later it built - that should work better for your verification now. It built for three arches already, arm takes some more time - but it should be good.

[1]: https://git.qemu.org/?p=qemu.git;a=commit;h=3a575cd2c2411f139a95ace4b2523bc1dfd21755

Waiting for Sean to confirm the ppa fixes the issue to then SRU for Artful.
Marking Task to be clear on that.

Changed in qemu (Ubuntu):
status: Triaged → Incomplete
Sean Feole (sfeole) wrote :

It appears my update to this bug never saved properly:

I was able to test the following PPA: ppa:ci-train-ppa-service/3032

QEMU: 1:2.10+dfsg-0ubuntu4~ppa7

After executing the steps list in my description,

The Virtual Machine properly reboots as expected.

$ sudo virsh reboot ubuntu1710 --mode acpi
Domain ubuntu1710 is being rebooted

$ sudo virsh list --all
 Id Name State
----------------------------------------------------
 1 ubuntu1710 running
 - instance-00000004 shut off

I have also tested suspend capabilities which also appear to be somewhat working, at least better than where I was before:

$ sudo virsh dompmsuspend ubuntu1710 --target mem
Domain ubuntu1710 successfully suspended

$ sudo virsh list --all
 Id Name State
----------------------------------------------------
 2 ubuntu1710 running
 - instance-00000004 shut off

In the logs I see after the suspend is issued:

Nov 14 03:12:05 awrep4 libvirtd[3451]: 2017-11-14 03:12:05.265+0000: 3787: error : qemuDomainAgentAvailable:6078 : Guest agent is not responding: QEMU guest agent is not connected

Which I believe is the instance suspending itself thus disconnecting the guest agent, so probably nothing to worry about there.

But when the wakeup is issued: I can't reconnect to the instance:

$ sudo virsh dompmwakeup ubuntu1710
Domain ubuntu1710 successfully woken up

So I believe that this should be chased down in a different bug, and the original problem for which this bug was open (VM Reboot) has been fixed!!
Thanks Christian

Ok, thanks for the check Sean!
I also found no issues with the code in a little regression check against a artful ppa of this.

Now the plan of action from here is:
1. fix it in Bionic
2. fix it in Artful as SRU
3. UCA Team will pick up the change into Pike where you need it.

For 1 there is a current upload still in flight for bug LP: #1726394.
That was blocked migrating on a known issue that is now resolved, should be out of the way any minute then I can upload the fix for this to bionic.

Changed in qemu (Ubuntu):
status: Incomplete → Triaged
Changed in qemu (Ubuntu Artful):
status: New → Triaged
Changed in qemu (Ubuntu):
status: Triaged → In Progress

Ok, the former upload in Bionic migrated.
Pushing this fix now, we will continue once it is fully migrated (still somewhat long test queues atm).

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.10+dfsg-0ubuntu5

---------------
qemu (1:2.10+dfsg-0ubuntu5) bionic; urgency=medium

  * d/p/detect-ITS-and-skip-usage-on-older-kernel.patch to avoid crashes
    on arm64 when doing suspend/resume and reboots due to older kernels not
    supporting ITS (LP: #1731051).

 -- Christian Ehrhardt <email address hidden> Tue, 14 Nov 2017 08:30:29 +0100

Changed in qemu (Ubuntu):
status: In Progress → Fix Released

Uploaded to Artful-unapproved and added SRU Template - waiting on SRU Team now.

description: updated
Changed in qemu (Ubuntu Artful):
status: Triaged → In Progress

Also subscribed a Cloud Archive task as they are the most relevant "victim" of the bug by having a qemu backport to Xenial.

Hello Sean, or anyone else affected,

Accepted qemu into artful-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:2.10+dfsg-0ubuntu3.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-artful to verification-done-artful. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-artful. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in qemu (Ubuntu Artful):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-artful

Hi Sean,
I know you usually run this with UCA on 16.04.
It will be a bit of special modification - e.g. an older kernel to trigger or so, but should work.
Do you think you can test this on Artful as well?

Sean Feole (sfeole) wrote :

Hey Christian,

I was able to verify the packages in artful: QEMU, 1:2.10+dfsg-0ubuntu3.1

The Reboot command works as expected, $sudo virsh reboot <DOMAIN> --mode acpi

The VM will also suspend as expected, however as I experienced in Xenial was unable to contact the VM in after a wakeup.

Sean Feole (sfeole) wrote :

ubuntu@lundmark:~$ dpkg -l | grep qemu
ii ipxe-qemu 1.0.0+git-20161027.b991c67+really20150424.a25a16d-1ubuntu2 all PXE boot firmware - ROM images for qemu
ii qemu-block-extra:arm64 1:2.10+dfsg-0ubuntu3.1 arm64 extra block backend modules for qemu-system and qemu-utils
ii qemu-efi 0~20170911.5dfba97c-1ubuntu0.1 all transitional dummy package
ii qemu-efi-aarch64 0~20170911.5dfba97c-1ubuntu0.1 all UEFI firmware for 64-bit ARM virtual machines
ii qemu-guest-agent 1:2.10+dfsg-0ubuntu3.1 arm64 Guest-side qemu-system agent
ii qemu-kvm 1:2.10+dfsg-0ubuntu3.1 arm64 QEMU Full virtualization
ii qemu-system-aarch64 1:2.10+dfsg-0ubuntu3.1 arm64 QEMU full system emulation binaries (aarch64)
ii qemu-system-arm 1:2.10+dfsg-0ubuntu3.1 arm64 QEMU full system emulation binaries (arm)
ii qemu-system-common 1:2.10+dfsg-0ubuntu3.1 arm64 QEMU full system emulation binaries (common files)
ii qemu-utils 1:2.10+dfsg-0ubuntu3.1 arm64 QEMU utilities
ubuntu@lundmark:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 17.10
Release: 17.10
Codename: artful
ubuntu@lundmark:~$
ubuntu@lundmark:~$
ubuntu@lundmark:~$ sudo virsh list --all
 Id Name State
----------------------------------------------------
 7 ubuntu1710111 running
 - ubuntu1710 shut off

ubuntu@lundmark:~$ sudo virsh destroy ubuntu1710111
Domain ubuntu1710111 destroyed

ubuntu@lundmark:~$ sudo virsh start ubuntu1710111
Domain ubuntu1710111 started

ubuntu@lundmark:~$ sudo virsh reboot ubuntu1710111 --mode acpi
Domain ubuntu1710111 is being rebooted

VM Console
<SNIP>

[ OK ] Started Set console scheme.
[ OK ] Created slice system-getty.slice.
[ OK ] Started Getty on tty1.
[ OK ] Reached target Login Prompts.
[ OK ] Started LSB: QEMU Guest Agent startup script.
[ OK ] Started LSB: automatic crash report generation.

Ubuntu 17.10 ubuntu ttyAMA0

ubuntu login: [ OK ] Closed Load/Save RF Kill Switch Status /dev/rfkill Watch.
[ OK ] Stopped target Graphical Interface.
[ OK ] Stopped target Multi-User System.
[ OK ] Stopped target Login Prompts.
         Stopping Serial Getty on ttyAMA0...
         Stopping Snappy daemon...
         Stopping Getty on tty1...
         Stopping System Logging Service...
         Stopping Authorization Manager...

</SNIP>

Hi Sean, thanks for the verification.-
Setting tags accordingly.

For the issue after wakeup (which without this change here never worked at all) you might file a new bug. I have no idea yet what it is about but it is unrelated to this fix which at least fixes your reboot issues and doesn't make wakup worse :-)

tags: added: verification-done verification-done-artful
removed: verification-needed verification-needed-artful
Changed in cloud-archive:
status: New → Invalid
status: Invalid → Fix Released
Corey Bryant (corey.bryant) wrote :

Hello Sean, or anyone else affected,

Accepted qemu into pike-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:pike-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-pike-needed to verification-pike-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-pike-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-pike-needed
Sean Feole (sfeole) wrote :

Hey Corey,

all set with Xenial + pike-proposed with/ 4.10.0-40-generic aarch64 kernel

  Installed: 1:2.10+dfsg-0ubuntu3.1~cloud0
  Candidate: 1:2.10+dfsg-0ubuntu3.1~cloud0
  Version table:
 *** 1:2.10+dfsg-0ubuntu3.1~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu xenial-proposed/pike/main arm64 Packages
        100 /var/lib/dpkg/status

tags: added: verification-done-pike
removed: verification-pike-needed
Robie Basak (racb) wrote :

Please could someone check the autopkgtest failures listed against this SRU in http://people.canonical.com/~ubuntu-archive/pending-sru.html?

Sure Robie thanks for the ping on this older SRU,
in general since this is a arm64 only change it won't be the trigger of those test failures (non is on arm).
Never the less i'll check more in detail and let you know then.

Hi,
I got all resolved now except systemd on s390x which is bug 1736955.
Therefore I ask to migrate this SRU.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.10+dfsg-0ubuntu3.1

---------------
qemu (1:2.10+dfsg-0ubuntu3.1) artful; urgency=medium

  * d/p/detect-ITS-and-skip-usage-on-older-kernel.patch to avoid crashes
    on arm64 when doing suspend/resume and reboots due to older kernels not
    supporting ITS (LP: #1731051).

 -- Christian Ehrhardt <email address hidden> Wed, 15 Nov 2017 07:49:33 +0100

Changed in qemu (Ubuntu Artful):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for qemu has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for qemu has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package qemu - 1:2.10+dfsg-0ubuntu3.1~cloud0
---------------

 qemu (1:2.10+dfsg-0ubuntu3.1~cloud0) xenial-pike; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 qemu (1:2.10+dfsg-0ubuntu3.1) artful; urgency=medium
 .
   * d/p/detect-ITS-and-skip-usage-on-older-kernel.patch to avoid crashes
     on arm64 when doing suspend/resume and reboots due to older kernels not
     supporting ITS (LP: #1731051).

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers