libvirt-xen: Race between nova and a Xen script for updating the iptables

Bug #1461642 reported by Anthony PERARD on 2015-06-03
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Unassigned

Bug Description

This is with nova-network.

When we create an instance, libxl (used by libvirt) is going to call a script to setup the vif, add it to the bridge, and update the iptables. Sometime, the iptables call in the script fail, with exit status 4, and this result in an instance creation failure. (Nova would only report: "libvirtError: internal error: libxenlight failed to create new domain")

The script is:
/etc/xen/scripts/vif-bridge
(or xen.git/tools/hotplug/Linux/vif-bridge)

One way if fixing this would be to have libxl call a different script provided by OpenStack which could take a lock.

Anthony PERARD (anthony-perard) wrote :

One can work around this issue with this patch:
https://marc.info/?l=xen-devel&m=143317087603573

Bob Ball (bob-ball) wrote :

Confirmed as seen several times in the libvirt+xen CI, e.g. http://d7013eaae7e632dff837-028d11a4a642ead4d20755bd13d99a1b.r55.cf5.rackcdn.com/31/189731/1/check/dsvm-tempest-xen/f59dee5/logs/xen/index.html

Medium as it's a race condition which will affect any nova-network + libvirt+xen deployments and needs a fix to be scheduled.

This workaround was not deemed suitable by Xen as it doesn't solve the conceptual issue of both Xen + openstack trying to update iptables at the same time. Hopefully the concurrent updating isn't needed, but if it is then it's likely an OpenStack script will be needed and passed to libxl as a parameter to ensure the correct updates are made and OpenStack remains in control of the networking

Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
Matt Riedemann (mriedem) wrote :

xen project ci fails pretty regularly with this which is annoying since we don't have something like elastic-recheck on 3rd party CI to tell us what the failure is.

Changed in nova:
importance: Medium → High
tags: added: xen-ci
Changed in nova:
assignee: nobody → Anthony PERARD (anthony-perard)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/199092
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cd1766287862162aadf1c111a4807f7618f34578
Submitter: Jenkins
Branch: master

commit cd1766287862162aadf1c111a4807f7618f34578
Author: Anthony PERARD <email address hidden>
Date: Mon Jul 6 17:47:17 2015 +0100

    libvirt-vif: Allow to configure a script on bridge interface

    While running with libvirt-xen driver, it is possible to have the Xen
    toolstack running a different script than the default on a vif. This patch
    allow Nova to change this script.

    Also do not set script to the empty string '' in designer.py for a linux
    bridge. The empty string for script does not appear to be use anywhere in
    the libvirt code when the vif is a bridge.

    Change-Id: Ib6d6542d22decccfa68a058d362a42d60e6c2cca
    Partial-Bug: #1461642

Russell Bryant (russellb) wrote :

This merged patch caused some breakage for me ... see http://paste.openstack.org/show/467063/. Reverting it fixes it for me.

Anthony PERARD (anthony-perard) wrote :

Hi Russell,

In you paste, you have:

    <interface type='bridge'>
      <mac address='fa:16:3e:76:d4:40'/>
      <source bridge='br-int'/>
      <virtualport type='openvswitch'>
        <parameters interfaceid='9942b499-62e9-4cc2-ad9a-9ecf93b9ada7'/>
      </virtualport>
      <script path=''/>
      <target dev='tap9942b499-62'/>
      <model type='virtio'/>
      <driver name='qemu'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>

The problem is the script node set to an empty string. But I don't understand where this empty string could come from. I removed one in nova/virt/libvirt/designer.py.

Do you have something in you environnement, maybe a patch on Nova, that would set script?

Prinika (nairprinikasankaran) wrote :
Download full text (5.0 KiB)

I am seeing the same Issue. I have libvirt version 1.2.2.
Do we need a libvirt version higher than 1.2.2 for this change to work?

2015-09-18 01:30:29.723 ERROR nova.compute.manager [req-c5b00bd7-943b-44cd-847b-064286501d6a admin admin] [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] Instance failed to spaw
n
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] Traceback (most recent call last):
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] File "/opt/stack/nova/nova/compute/manager.py", line 2152, in _build_resourc
es
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] yield resources
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] File "/opt/stack/nova/nova/compute/manager.py", line 2006, in _build_and_run
_instance
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] block_device_info=block_device_info)
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2451, in spawn
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] block_device_info=block_device_info)
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 4522, in _create_do
main_and_network
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] xml, pause=pause, power_on=power_on)
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 4452, in _create_do
main
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] guest.launch(pause=pause)
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 141, in launch
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] self._encoded_xml, errors='ignore')
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 1
95, in __exit__
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] six.reraise(self.type_, self.value, self.tb)
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 136, in launch
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] return self._domain.createWithFlags(flags)
2015-09-18 01:30:29.723 TRACE nova.compute.manager [instance: 6d9297cc-d97f-40f9-a504-a17b79d279a5] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 183,...

Read more...

Prinika (nairprinikasankaran) wrote :

That appear to be the issue, I did not see this one. Thanks, and sorry.

Anyway, there is already a patch to fix this:
https://review.openstack.org/#/c/225585/2

Russell Bryant (russellb) wrote :

Yeah, that fix makes sense. In my case I'm in an env doing direct ovs plugging.

Matt Riedemann (mriedem) wrote :

The XenProject CI still has a pretty high failure rate on this bug, what else needs to be done here?

Matt Riedemann (mriedem) wrote :

If the problem is a latent bug in older libvirt, is there a way we can workaround this? Can we catch and detect the error and retry? Or could the xenproject CI set network_allocate_retries>1 so the compute manager would retry on failure?

Matt Riedemann (mriedem) wrote :
Download full text (5.4 KiB)

I guess network_allocate_retries won't help since that's not in the driver spawn code path that's failing:

2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [req-2b591b2c-36f0-494b-a5b2-06dc2a783c49 tempest-MultipleCreateTestJSON-64962745 tempest-MultipleCreateTestJSON-1348265215] [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] Instance failed to spawn
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] Traceback (most recent call last):
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] File "/opt/stack/new/nova/nova/compute/manager.py", line 2193, in _build_resources
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] yield resources
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] File "/opt/stack/new/nova/nova/compute/manager.py", line 2039, in _build_and_run_instance
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] block_device_info=block_device_info)
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2767, in spawn
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] block_device_info=block_device_info)
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 4906, in _create_domain_and_network
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] xml, pause=pause, power_on=power_on)
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 4837, in _create_domain
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] guest.launch(pause=pause)
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 142, in launch
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] self._encoded_xml, errors='ignore')
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] self.force_reraise()
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-03-04 07:17:13.370 8173 ERROR nova.compute.manager [instance: eb3ec27f-53c0-4ce7-a6ad-58f9bb37b25d] six.reraise(self.type_, self.value, self.tb)
2016-03-04 07:17:13...

Read more...

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/199093
Reason: This patch has been sitting unchanged for more than 12 weeks. I am therefore going to abandon it to keep the nova review queue sane. Please feel free to restore the change if you're still working on it.

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/201257
Reason: This patch has been sitting unchanged for more than 12 weeks. I am therefore going to abandon it to keep the nova review queue sane. Please feel free to restore the change if you're still working on it.

Sean Dague (sdague) wrote :

Patch is stalled. Also, as it's nova-net only, and that's going away, is this relevant to neutron? If not we should just let it die off.

Changed in nova:
status: In Progress → Incomplete
importance: High → Medium
assignee: Anthony PERARD (anthony-perard) → nobody
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers