QEMU

[arm64 ocata] newly created instances are unable to raise network interfaces

Bug #1719196 reported by Sean Feole on 2017-09-24

This bug affects 2 people

	Status	Importance	Assigned to
QEMU	Fix Released	Undecided	Unassigned
Ubuntu Cloud Archive	Fix Released	Undecided	Unassigned
Ocata	Fix Committed	High	Unassigned
libvirt	New	Undecided	Unassigned
libvirt (Ubuntu)	Invalid	Undecided	Unassigned
qemu (Ubuntu)	Fix Released	Undecided	Unassigned
Zesty	Fix Released	Undecided	Unassigned

Bug Description

[Impact]

* A change in qemu 2.8 (83d768b virtio: set ISR on dataplane
notifications) broke virtio handling on platforms without a
controller. Those encounter flaky networking due to missed IRQs

* Fix is a backport of the upstream fix b4b9862b: virtio: Fix no
interrupt when not creating msi controller

[Test Case]

* On Arm with Zesty (or Ocata) run a guest without PCI based devices

* Example in e.g. c#23

* Without the fix the networking does not work reliably (as it losses
IRQs), with the fix it works fine.

[Regression Potential]

* Changing the IRQ handling of virtio could affect virtio in general.
   But when reviwing the patch you'll see that it is small and actually
   only changes to enable IRQ on one more place. That could cause more
   IRQs than needed in the worst case, but those are usually not
   breaking but only slowing things down. Also this fix is upstream
   quite a while, increasing confidence.

[Other Info]

* There is currently 1720397 in flight in the SRU queue, so acceptance
of this upload has to wait until that completes.

---

arm64 Ocata ,

I'm testing to see I can get Ocata running on arm64 and using the openstack-base bundle to deploy it. I have added the bundle to the log file attached to this bug.

When I create a new instance via nova, the VM comes up and runs, however fails to raise its eth0 interface. This occurs on both internal and external networks.

ubuntu@openstackaw:~$ nova list
+--------------------------------------+---------+--------+------------+-------------+--------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+---------+--------+------------+-------------+--------------------+
| dcaf6d51-f81e-4cbd-ac77-0c5d21bde57c | sfeole1 | ACTIVE | - | Running | internal=10.5.5.3 |
| aa0b8aee-5650-41f4-8fa0-aeccdc763425 | sfeole2 | ACTIVE | - | Running | internal=10.5.5.13 |
+--------------------------------------+---------+--------+------------+-------------+--------------------+
ubuntu@openstackaw:~$ nova show aa0b8aee-5650-41f4-8fa0-aeccdc763425
+--------------------------------------+----------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | awrep3 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | awrep3.maas |
| OS-EXT-SRV-ATTR:instance_name | instance-00000003 |
| OS-EXT-STS:power_state | 1 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2017-09-24T14:23:08.000000 |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2017-09-24T14:22:41Z |
| flavor | m1.small (717660ae-0440-4b19-a762-ffeb32a0575c) |
| hostId | 5612a00671c47255d2ebd6737a64ec9bd3a5866d1233ecf3e988b025 |
| id | aa0b8aee-5650-41f4-8fa0-aeccdc763425 |
| image | zestynosplash (e88fd1bd-f040-44d8-9e7c-c462ccf4b945) |
| internal network | 10.5.5.13 |
| key_name | mykey |
| metadata | {} |
| name | sfeole2 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| status | ACTIVE |
| tenant_id | 9f7a21c1ad264fec81abc09f3960ad1d |
| updated | 2017-09-24T14:23:09Z |
| user_id | e6bb6f5178a248c1b5ae66ed388f9040 |
+--------------------------------------+----------------------------------------------------------+

As seen above the instances boot an run. Full Console output is attached to this bug.

[ OK ] Started Initial cloud-init job (pre-networking).
[ OK ] Reached target Network (Pre).
[ OK ] Started AppArmor initialization.
Starting Raise network interfaces...
[FAILED] Failed to start Raise network interfaces.
See 'systemctl status networking.service' for details.
Starting Initial cloud-init job (metadata service crawler)...
[ OK ] Reached target Network.
[ 315.051902] cloud-init[881]: Cloud-init v. 0.7.9 running 'init' at Fri, 22 Sep 2017 18:29:15 +0000. Up 314.70 seconds.
[ 315.057291] cloud-init[881]: ci-info: +++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++
[ 315.060338] cloud-init[881]: ci-info: +--------+------+-----------+-----------+-------+-------------------+
[ 315.063308] cloud-init[881]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
[ 315.066304] cloud-init[881]: ci-info: +--------+------+-----------+-----------+-------+-------------------+
[ 315.069303] cloud-init[881]: ci-info: | eth0: | True | . | . | . | fa:16:3e:39:4c:48 |
[ 315.072308] cloud-init[881]: ci-info: | eth0: | True | . | . | d | fa:16:3e:39:4c:48 |
[ 315.075260] cloud-init[881]: ci-info: | lo: | True | 127.0.0.1 | 255.0.0.0 | . | . |
[ 315.078258] cloud-init[881]: ci-info: | lo: | True | . | . | d | . |
[ 315.081249] cloud-init[881]: ci-info: +--------+------+-----------+-----------+-------+-------------------+
[ 315.084240] cloud-init[881]: 2017-09-22 18:29:15,393 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0xffffb10794e0>: Failed to establish a new connection: [Errno 101] Network is unreachable',))]

----------------

I have checked all services in neutron and made sure that they are running and restarted in neutron-gateway

[ + ] neutron-dhcp-agent
[ + ] neutron-l3-agent
[ + ] neutron-lbaasv2-agent
[ + ] neutron-metadata-agent
[ + ] neutron-metering-agent
[ + ] neutron-openvswitch-agent
[ + ] neutron-ovs-cleanup

and have also restarted and checked the nova-compute logs for their neutron counterparts.

[ + ] neutron-openvswitch-agent
[ + ] neutron-ovs-cleanup

There are some warnings/errors in the neutron-gateway logs which I have attached the full tarball in a separate attachment to this bug:

2017-09-24 14:39:33.152 10322 INFO ryu.base.app_manager [-] instantiating app ryu.controller.ofp_handler of OFPHandler
2017-09-24 14:39:33.153 10322 INFO ryu.base.app_manager [-] instantiating app ryu.app.ofctl.service of OfctlService
2017-09-24 14:39:39.577 10322 ERROR neutron.agent.ovsdb.impl_vsctl [req-2f084ae8-13dc-47dc-b351-24c8f3c57067 - - - - -] Unable to execute ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--id=@manager', 'create', 'Manager', 'target="ptcp:6640:127.0.0.1"', '--', 'add', 'Open_vSwitch', '.', 'manager_options', '@manager']. Exception: Exit code: 1; Stdin: ; Stdout: ; Stderr: ovs-vsctl: transaction error: {"details":"Transaction causes multiple rows in \"Manager\" table to have identical values (\"ptcp:6640:127.0.0.1\") for index on column \"target\". First row, with UUID e02a5f7f-bfd2-4a1d-ae3c-0321db4bd3fb, existed in the database before this transaction and was not modified by the transaction. Second row, with UUID 6e9aba3a-471a-4976-bffd-b7131bbe5377, was inserted by this transaction.","error":"constraint violation"}

These warnings/errors also occur on the nova-compute hosts in /var/log/neutron/

2017-09-22 18:54:52.130 387556 INFO ryu.base.app_manager [-] instantiating app ryu.app.ofctl.service of OfctlService
2017-09-22 18:54:56.124 387556 ERROR neutron.agent.ovsdb.impl_vsctl [req-e291c2f9-a123-422c-be7c-fadaeb5decfa - - - - -] Unable to execute ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--id=@manager', 'create', 'Manager', 'target="ptcp:6640:127.0.0.1"', '--', 'add', 'Open_vSwitch', '.', 'manager_options', '@manager']. Exception: Exit code: 1; Stdin: ; Stdout: ; Stderr: ovs-vsctl: transaction error: {"details":"Transaction causes multiple rows in \"Manager\" table to have identical values (\"ptcp:6640:127.0.0.1\") for index on column \"target\". First row, with UUID 9f27ddee-9881-4cbc-9777-2f42fee735e9, was inserted by this transaction. Second row, with UUID ccf0e097-09d5-449c-b353-6b69781dc3f7, existed in the database before this transaction and was not modified by the transaction.","error":"constraint violation"}

I'm not sure if the above error could be pertaining to the failure, I have also attached those logs to this bug as well...

Should neutron-openvswitch be assigned to the 'nova' availability zone?

I have also attached the ovs-vsctl show from a nova-compute host and the neutron-gateway host to ensure that the open v switch routes are correct.

I can supply more logs if required.

See original description

Tags:

Revision history for this message

Sean Feole (sfeole) wrote on 2017-09-24:

neutron-logs.tar Edit (400.0 KiB, application/x-tar)

tags:

added: arm64

Revision history for this message

Sean Feole (sfeole) wrote on 2017-09-24:

neutron-logs-for-novacompute.tar Edit (70.0 KiB, application/x-tar)

Revision history for this message

Sean Feole (sfeole) wrote on 2017-09-24:

ovs-vsctl-show.txt Edit (3.4 KiB, text/plain)

Revision history for this message

Sean Feole (sfeole) wrote on 2017-09-24:

ocata-arm64-neutronbug.txt Edit (76.7 KiB, text/plain)

Revision history for this message

Sean Feole (sfeole) wrote on 2017-09-24:

juju status
bundle file
vm console output
and nova service list

can be found in ocata-arm64-neutronbug.txt

Revision history for this message

Liam Young (gnuoy) wrote on 2017-09-28:

Running tcpdump on the guests tap device shows the dhcp request leaving and the reply coming back.

Revision history for this message

Andrew McLeod (admcleod) wrote on 2017-09-29:

This may be qemu related, rather than libvirt - further investigation required - but unlikely to be a problem with neutron-gateway charm

affects:

charm-neutron-gateway → libvirt

Revision history for this message

Andrew McLeod (admcleod) wrote on 2017-09-29:

I have confirmed this bug with another arm64 + ocata deployment.

Revision history for this message

Andrew McLeod (admcleod) wrote on 2017-09-29:

Further summary:

we can see dhcp requests leaving and returning to the tap interface associated with the guest, but the guest does not register the returned packet (does not acquire an address).

tags:

added: libvirt qemu uosci

Revision history for this message

Sean Feole (sfeole) wrote on 2017-10-02:

#10

I was also able to reproduce this again on Cavium ThunderX hardware,

ii ipxe-qemu 1.0.0+git-20150424.a25a16d-1ubuntu1 all PXE boot firmware - ROM images for qemu
ii qemu-block-extra:arm64 1:2.8+dfsg-3ubuntu2.3~cloud0 arm64 extra block backend modules for qemu-system and qemu-utils
ii qemu-efi 0~20160408.ffea0a2c-2 all UEFI firmware for virtual machines
ii qemu-kvm 1:2.8+dfsg-3ubuntu2.3~cloud0 arm64 QEMU Full virtualization
ii qemu-system-arm 1:2.8+dfsg-3ubuntu2.3~cloud0 arm64 QEMU full system emulation binaries (arm)
ii qemu-system-common 1:2.8+dfsg-3ubuntu2.3~cloud0 arm64 QEMU full system emulation binaries (common files)
ii qemu-utils 1:2.8+dfsg-3ubuntu2.3~cloud0 arm64 QEMU utilities

-------------------

ii libvirt-bin 2.5.0-3ubuntu5.4~cloud0 arm64 programs for the libvirt library
ii libvirt-clients 2.5.0-3ubuntu5.4~cloud0 arm64 Programs for the libvirt library
ii libvirt-daemon 2.5.0-3ubuntu5.4~cloud0 arm64 Virtualization daemon
ii libvirt-daemon-system 2.5.0-3ubuntu5.4~cloud0 arm64 Libvirt daemon configuration files
ii libvirt0:arm64 2.5.0-3ubuntu5.4~cloud0 arm64 library for interfacing with different virtualization systems
ii nova-compute-libvirt 2:15.0.6-0ubuntu1~cloud0 all OpenStack Compute - compute node libvirt support
ii python-libvirt 3.0.0-2~cloud0 arm64 libvirt Python bindings

Revision history for this message

Sean Feole (sfeole) wrote on 2017-10-02:

#11

Just to further add to comment #6, That this is not a neutron issue, Here is the tcpdump output of the guests tap device shows the dhcp request leaving and the reply coming back. For anyways that may be curious.

Guest MAC:

$ sudo ip netns exec qdhcp-a9958ab4-8a7e-4ded-b9a0-860bc42f79d9 tcpdump -A -l -i ns-c751afb3-2b
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ns-c751afb3-2b, link-type EN10MB (Ethernet), capture size 262144 bytes
13:51:11.458941 IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
`....$..................................:...................................
13:51:11.462966 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from fa:16:3e:1d:a8:82 (oui Unknown), length 300
E..H......9..........D.C.4q.....5QN{......................>.............................................................................................................................................................................................................c.Sc5....ubuntu7.......w.,/.y*..................................
13:51:11.463331 IP 10.5.5.2.bootps > 10.5.5.9.bootpc: BOOTP/DHCP, Reply, length 328
E..dY'..@...

Revision history for this message

Sean Feole (sfeole) wrote on 2017-10-04:

#12

notes1.txt Edit (37.8 KiB, text/plain)

Today I ran some tests and installed a Newton Deployment on arm64, which we already know works. I upgraded QEMU and Libvirt on one of the hypervisors from the xenial-updates/ocata cloud-archive.

See attached notes.

Libvirt - 1.3.1-1ubuntu10.14 -> 2.5.0-3ubuntu5.5~cloud0
QEMU - 1:2.5+dfsg-5ubuntu10.16 -> 1:2.8+dfsg-3ubuntu2.3~cloud0

I was able to reset the already built instance on the hypervisor and was able to receive a dhcp response from the ovs tap device. Eth0 came up as expected with an internal tenant IP.

Steps to reproduce.

1.) Install Newton & start a few VM's
2.) Choose 1 hypervisor , upgrade qemu & libvirt to versions from the Ocata cloud-archive.
3.) Reset the running VM so that it now runs with the latest QEMU/Libvirt
4.) Reset the Instance, see if it boots and network can be reached.

Revision history for this message

Sean Feole (sfeole) wrote on 2017-10-04:

#14

Taken from the upgraded hypervisor

ubuntu@aw3:/var/lib/nova/instances/2cec409e-de92-4d29-ad68-3f1d1f8be7fc$ sudo qemu-system-aarch64 --version
QEMU emulator version 2.8.0(Debian 1:2.8+dfsg-3ubuntu2.3~cloud0)
Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers

ubuntu@aw3:/var/lib/nova/instances/2cec409e-de92-4d29-ad68-3f1d1f8be7fc$ sudo libvirtd --version
libvirtd (libvirt) 2.5.0

Revision history for this message

Sean Feole (sfeole) wrote on 2017-10-05:

#15

xmlnotes.txt Edit (8.8 KiB, text/plain)

Download full text (3.7 KiB)

I was able to gather libvirt XMLs from both Newton and Ocata Instances, see attached

sfeole@BSG-75:~$ diff xmlocata xmlnewton
1,2c1,2
<
< main type='kvm' id='1'>
---
> ubuntu@lundmark:/var/lib/nova/instances/358596e4-135d-461d-a514-84116440014c$ sudo virsh dumpxml instance-00000001
> <domain type='kvm' id='1'>
4c4
< <uuid>7c0dcd78-d6b4-4575-a882-ee5d29c64fe0</uuid>
---
> <uuid>358596e4-135d-461d-a514-84116440014c</uuid>
7,9c7,9
< <nova:package version="15.0.6"/>
< <nova:name>sfeole14</nova:name>
< <nova:creationTime>2017-10-05 00:58:15</nova:creationTime>
---
> <nova:package version="14.0.8"/>
> <nova:name>sfeole-newton</nova:name>
> <nova:creationTime>2017-10-05 01:40:39</nova:creationTime>
18,19c18,19
< <nova:user uuid="d9f92a61c37948d9a29b8cc37e1bca05">admin</nova:user>
< <nova:project uuid="701441267bd148d3842f1696530b1c92">admin</nova:project>
---
> <nova:user uuid="12d13712253141ab845b89406507cd6c">admin</nova:user>
> <nova:project uuid="8b8fcaf183954f45b3eb15b27c52ec94">admin</nova:project>
21c21
< <nova:root type="image" uuid="f329117f-5da2-4d61-8341-89a969bf00e7"/>
---
> <nova:root type="image" uuid="fcdf7c26-4238-4594-b8ee-fd59f601fdcb"/>
34c34
< <type arch='aarch64' machine='virt-2.8'>hvm</type>
---
> <type arch='aarch64' machine='virt'>hvm</type>
58c58
< <source file='/var/lib/nova/instances/7c0dcd78-d6b4-4575-a882-ee5d29c64fe0/disk'/>
---
> <source file='/var/lib/nova/instances/358596e4-135d-461d-a514-84116440014c/disk'/>
61c61
< <source file='/var/lib/nova/instances/_base/035458feead7f83be80ed020442ebf815a4067ea'/>
---
> <source file='/var/lib/nova/instances/_base/ab7429e8558ea27fb54f70eb92fbd0c1c07dd595'/>
70c70
< <source file='/var/lib/nova/instances/7c0dcd78-d6b4-4575-a882-ee5d29c64fe0/disk.eph0'/>
---
> <source file='/var/lib/nova/instances/358596e4-135d-461d-a514-84116440014c/disk.eph0'/>
82a83,93
> <controller type='pci' index='1' model='dmi-to-pci-bridge'>
> <model name='i82801b11-bridge'/>
> <alias name='pci.1'/>
> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
> </controller>
> <controller type='pci' index='2' model='pci-bridge'>
> <model name='pci-bridge'/>
> <target chassisNr='2'/>
> <alias name='pci.2'/>
> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x0'/>
> </controller>
84,86c95,97
< <mac address='fa:16:3e:a6:4b:d4'/>
< <source bridge='qbra7012530-32'/>
< <target dev='tapa7012530-32'/>
---
> <mac address='fa:16:3e:8e:fc:48'/>
> <source bridge='qbrb5d335fc-46'/>
> <target dev='tapb5d335fc-46'/>
92c103
< <source path='/var/lib/nova/instances/7c0dcd78-d6b4-4575-a882-ee5d29c64fe0/console.log'/>
---
> <source path='/var/lib/nova/instances/358596e4-135d-461d-a514-84116440014c/console.log'/>
95a107,111
> <serial type='pty'>
> <source path='/dev/pts/3'/>
> <target port='1'/>
> <alias name='serial1'/>
> </serial>
97c113
< <source path='/var/lib/nova/instances/7c0dcd78-d6b4-4575-a882-ee5d29c64fe0/consol...

$ cat arm-template.xml 
<domain type='kvm'>
        <os>
                <type arch='aarch64' machine='virt'>hvm</type>
                <loader readonly='yes' type='pflash'>/usr/share/AAVMF/AAVMF_CODE.fd</loader>
                <nvram template='/usr/share/AAVMF/AAVMF_CODE.fd'>/tmp/AAVMF_CODE.fd</nvram>
                <boot dev='hd'/>
        </os>
        <features>
                <acpi/>
                <apic/>
                <pae/>
        </features>
        <cpu mode='custom' match='exact'>
                <model fallback='allow'>host</model>
        </cpu>
        <devices>
                <interface type='network'>
                        <source network='default'/>
                        <model type='virtio'/>
                </interface>
                <serial type='pty'>
                        <source path='/dev/pts/3'/>
                        <target port='0'/>
                </serial>
        </devices>
</domain>

$ uvt-kvm create --template arm-template.xml --password=ubuntu artful-test4 release=artful arch=arm64 label=daily

The biggest and obviously most important difference is in the networking that is used:
    <interface type='network'>
      <mac address='52:54:00:b1:db:89'/>
      <source network='default' bridge='virbr0'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>

While openstack generates a type="bridge" network where the bridge is not managed by libvirt (compared to the default net I use).
Never the less both setups create a bridge and tap the guest on it.
So this should functionally be equivalent other than the bridge setup right?

I tried to make a bridge type config matching to what I had before but close to yours:
    <interface type='bridge'>
      <mac address='52:54:00:b1:db:89'/>
      <source bridge='virbr0'/>
      <target dev='vnet0'/>   
      <model type='virtio'/>
      <alias name='net0'/>                
      <address type='virtio-mmio'/>
    </interface>

But this again works:
ubuntu@artful-test4:~$ journalctl -u systemd-networkd --no-pager
-- Logs begin at Wed 2017-10-11 11:12:10 UTC, end at Wed 2017-10-11 11:13:00 UTC. --
Oct 11 11:12:17 artful-test4 systemd[1]: Starting Network Service...
Oct 11 11:12:17 artful-test4 systemd-networkd[604]: Enumeration completed
Oct 11 11:12:17 artful-test4 systemd[1]: Started Network Service.
Oct 11 11:12:17 artful-test4 systemd-networkd[604]: enp1s0: IPv6 successfully enabled
Oct 11 11:12:17 artful-test4 systemd-networkd[604]: enp1s0: Gained carrier
Oct 11 11:12:18 artful-test4 systemd-networkd[604]: enp1s0: Gained IPv6LL
Oct 11 11:12:20 artful-test4 systemd-networkd[604]: enp1s0: DHCPv4 address 192.168.122.14/24 via 192.168.122.1
Oct 11 11:12:20 artful-test4 systemd-networkd[604]: Not connected to system bus, ignoring transient hostname.
Oct 11 11:12:30 artful-test4 systemd-networkd[604]: enp1s0: Configured

Dumping the XML to check if it might have rewritten a lot shows me that my config is good:
    <interface type='bridge'>
      <mac address='52:54:00:b1:db:89'/>
      <source bridge='virbr0'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='virtio-mmio'/>
    </interface>

So TL;DR a config almost the same as yours works.
Yet the difference is that livbirt has set up 'virbr0' in my case and I'd assume openstack did create qbrb5d335fc-46 on its own.

As next step I'd recommend iterating your config around different bridge scenarios to find what breaks it.
Then if reasonable try to exclude openstack from the equation as I shown above (or not if you think the fix needs to be in openstack).

I hope this helped to get closer to the root casue, but I thought that a working example on the new Ocata stack might be the best start with.

Note this was done on:
libvirt-daemon-system          3.6.0-1ubuntu4
qemu-kvm                       1:2.10+dfsg-0ubuntu1

Changed in libvirt (Ubuntu):
status:	New → Incomplete
Changed in qemu (Ubuntu):
status:	New → Incomplete

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-10-11:

#18

Note - be careful if you mean upstream qemu/libvirt (as the bug was filed) or the packages in Ubuntu (I added tasks for these).
I try to spot both, but will be more on-track for the latter.

Revision history for this message

Sean Feole (sfeole) wrote on 2017-10-11:

#19

I spent some time today trying to modify the ocata xml templates produced in /etc/libvirt/qemu/<INSTANCE-ID>.xml for each of the generated instances. Destroying & Undefining the existing instance, making some changes and redefining the xml, however it appears that nova regenerates these templates upon instance reset, thus removing any changes made to the xml template. Is there any way to disable this feature that does not require some serious modifications to the nova underpinnings?

Revision history for this message

Andrew McLeod (admcleod) wrote on 2017-10-17:

#20

The problem is with virtio-mmio.

https://bugzilla.redhat.com/show_bug.cgi?id=1422413

Instances launched with virtio-mmio on aarch64 will not get DHCP (will not have a nic)

xml with libvirt 2.5.0

I have updated libvirt-daemon to 3.6.0 on a particular compute node - when an instance is booted now, the nic section of the virsh xml looks like this:

The instance then gets a NIC and is able to get DHCP and complete cloud-init successfully.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-10-18:

#21

So,
libvirt knew about some change and picks the right default if you do not specify it.
But if you (Openstack in this case) specify virtio-mmio as type, then it fails - is that correct?

The patch in the referred RH-BZ is already in the qemu we have in Artful (and thereby Ocata).
So if that is supposed to be the issue there has to be a new one after that fix.

Also as I've shown in c#17 (yes it is long sorry) - virtio-mmio works with the bridge that libvirt is usually creating - again my assumption is that it is somehow related to how this bridge is created (openvswitch in your case I assume).

So is the real error "network fails when using virtio-mmio on openvswitch set up by openstack"?
I have no OVS around to quickly try something around that atm.

Would it be reasonable to teach Openstack to not define virtio-mmio in this case?
Libvirt will make the right default (hopefully also when driven by Openstack which sets some force options), and just work then.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-10-18:

#22

Hi,
@admcleod as discussed on IRC I realized Ocata maps to Zesty so that is qemu 2.8.
Therefore the referred patch is released in Artful, but not in Zesty.

I tagged up the bug tasks and provide a fix in [1] to test.

[1]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2995

Changed in qemu (Ubuntu):
status:	Incomplete → Fix Released
no longer affects:	libvirt (Ubuntu Zesty)
Changed in qemu (Ubuntu Zesty):
status:	New → Incomplete
status:	Incomplete → In Progress

Revision history for this message

Sean Feole (sfeole) wrote on 2017-10-18:

#23

To further reinforce what Christian said, the Newton cloud-archive also uses virtio-mmio for its address type. (See my comment#15)

Newton XML:

but we have proven this works with Newton. Is it fair to say this could be attributed to a change in OVS, between Newton and Ocata?

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2017-10-18:

#24

This appears to be: https://bugzilla.redhat.com/show_bug.cgi?id=1422413

Changed in libvirt (Ubuntu):
status:	Incomplete → New
status:	New → Incomplete

Corey Bryant (corey.bryant) on 2017-10-18

no longer affects:	cloud-archive/pike
Changed in cloud-archive:
status:	New → Fix Released

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-10-19:

#25

Yeah, the offending patch in RH-BZ 1422413 appeared in qemu 2.8.
So it would make sense to work with newtwon (2.6.1), and pike (2.10), but fail on Ocata (2.8).

I checked the ppa, in general it seems to work for me, so I'm now waiting for your verification if that really addresses the issue you are facing.
If that would be the case I can pass it through regression tests and afterwards to SRU (which currently holds another update that has to clear before doing so)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-10-19:

#26

Since we wait for the zesty verification I'm setting the status to incomplete for now.

Changed in qemu (Ubuntu Zesty):
status:	In Progress → Incomplete
Changed in libvirt (Ubuntu):
status:	Incomplete → Invalid

Revision history for this message

Thomas Huth (th-huth) wrote on 2017-10-19:

#27

Since the fix mentioned in the previous comments is already in upstream QEMU, I'm setting the upstream status to "Fix released".

Changed in qemu:
status:	New → Fix Released

Revision history for this message

Andrew McLeod (admcleod) wrote on 2017-10-19:

#28

I've tested with the packages from the ppa:

https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2995

qemu:
  Installed: 1:2.8+dfsg-3ubuntu2.7~ppa5cloud
qemu-system-arm:
  Installed: 1:2.8+dfsg-3ubuntu2.7~ppa5cloud
qemu-system-aarch64:
  Installed: 1:2.8+dfsg-3ubuntu2.7~ppa5cloud

Rebooted the instance and it aquired an IP address and booted.

more info, virsh dumpxml excerpt:

Revision history for this message

Sean Feole (sfeole) wrote on 2017-10-19:

#29

will test these and report back shortly.

Revision history for this message

Sean Feole (sfeole) wrote on 2017-10-19:

#30

novaout.txt Edit (6.7 KiB, text/plain)

I've testing with the same packages listed in comment #28, Confirmed that this now works..

See attached log

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-10-20:

#31

Ok, driving that into an SRU then - thanks for verifying.

Christian Ehrhardt  (paelzer) on 2017-10-20

description:

updated

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-10-20:

#32

Regression tests on the ppa are good as well, we need to double check the auto-run on proposed then to ensure this doesn't behave differently on other arches.
I cleaned up the changelog and UCA backport noise and made a proper SRU for zesty.

Note: there is currently another SRU in flight (already in verified state for 2 days), so acceptance from zesty-unapproved likely has to wait a bit until the former one clears completely

Brian Murray (brian-murray) on 2017-10-26

Changed in qemu (Ubuntu Zesty):
status:	Incomplete → Fix Committed

Revision history for this message

Brian Murray (brian-murray) wrote on 2017-10-26:

#33

I accepted this but Launchpad timed out when the SRU testing comment was being added.

Christian Ehrhardt  (paelzer) on 2017-10-27

tags:

added: verification-needed verification-needed-zesty

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-10-27:

#34

Thanks for the FYI Brian.
I see it in pending-SRUs as it should be.
I added the -needed tags to be "correct".

@Andrew/Sean - could you test proposed and set verified then (assuming it works like the ppa did)

Odd - this really seems to hit everything, not only LP timeouts.
There are also dep8 regressions listed on systemd which make no sense in relation to the fix.
The test history on arm is borked since February [1], so I ask you to override and ignore that.
On s390x it at least works sometimes, but the log [2] doesn't look much better, but there at least it hits a different issue than before - I'm retriggering this for now - but likely ask to ignore that as well - but I'll take a look at the retry first.

[1]: http://autopkgtest.ubuntu.com/packages/s/systemd/zesty/armhf
[2]: http://autopkgtest.ubuntu.com/packages/s/systemd/zesty/s390x

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-10-27:

#35

As assumed s390x passed now (so a flaky test), and as outlined before armhf we just have to give up.
Looking at the history an override might be the right thing to do.

Other than that all looks good, waiting for the verify by Sean/Andrew now.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-10-27:

#36

Ok, Zesty actually had an override for the arm test (had to learn the placement of those for SRUs).
So only up to the verification now.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2017-10-31: Please test proposed package

#37

Hello Sean, or anyone else affected,

Accepted qemu into ocata-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

sudo add-apt-repository cloud-archive:ocata-proposed
sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ocata-needed to verification-ocata-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ocata-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags:

added: verification-ocata-needed

Revision history for this message

dann frazier (dannf) wrote on 2017-10-31:

#38

Thanks Christian - I've now verified this. I took a stepwise approach:

1) We originally observed this issue w/ the ocata cloud archive on xenial, so I redeployed that. I verified that I was still seeing the problem. I then created a PPA[*] w/ an arm64 build of QEMU from the ocata-staging PPA, which is a backport of the zesty-proposed package, and upgraded my nova-compute nodes to this version. I rebooted my test guests, and the problem was resolved.

2) I then updated my sources.list to point to zesty (w/ proposed enabled), and upgraded qemu-system-arm. This way I could test the actual build in zesty-proposed, as opposed to my backport. This continued to work.

3) Finally, I dist-upgraded this system from xenial to zesty - so that I'm actually testing the zesty build in a zesty environment, and rebooted. Still worked :)

[*] https://launchpad.net/~dannf/+archive/ubuntu/lp1719196

dann frazier (dannf) on 2017-11-01

tags:

added: verification-done-zesty
removed: verification-needed-zesty

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-11-02:

#39

This bug was fixed in the package qemu - 1:2.8+dfsg-3ubuntu2.7

---------------
qemu (1:2.8+dfsg-3ubuntu2.7) zesty; urgency=medium

  * d/p/ubuntu/virtio-Fix-no-interrupt-when-not-creating-msi-contro.patch:
    on Arm fix no interrupt when not creating msi controller. That fixes
    broken networking if running with virtio-mmio only (LP: #1719196).

-- Christian Ehrhardt <email address hidden> Wed, 18 Oct 2017 16:17:34 +0200

Changed in qemu (Ubuntu Zesty):
status:	Fix Committed → Fix Released

Revision history for this message

Brian Murray (brian-murray) wrote on 2017-11-02: Update Released

#40

The verification of the Stable Release Update for qemu has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2017-12-11:

#41

Regression testing has passed successfully.

zesty-ocata-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1897.0150 sec.
- Passed: 93
- Skipped: 9
- Expected Fail: 0
- Unexpected Success: 0
- Failed: 0
Sum of execute time for each test: 1011.5607 sec.

zesty-ocata-proposed with dev charms:

======
Totals
======
Ran: 102 tests in 1933.5299 sec.
- Passed: 93
- Skipped: 9
- Expected Fail: 0
- Unexpected Success: 0
- Failed: 0
Sum of execute time for each test: 963.0546 sec.

xenial-ocata-proposed with stable charms:

======
Totals
======
Ran: 102 tests in 1767.3787 sec.
- Passed: 93
- Skipped: 9
- Expected Fail: 0
- Unexpected Success: 0
- Failed: 0
Sum of execute time for each test: 906.2188 sec.

xenial-ocata-proposed with dev charms:

======
Totals
======
Ran: 102 tests in 2051.1377 sec.
- Passed: 93
- Skipped: 9
- Expected Fail: 0
- Unexpected Success: 0
- Failed: 0
Sum of execute time for each test: 998.9001 sec.