libvirt error libvirtError: Requested operation is not valid: nwfilter is in use hit in _post_live_migration

Bug #1438803 reported by Joe Gordon
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann

Bug Description

http://logs.openstack.org/18/169318/1/check/check-tempest-dsvm-aiopcpu-full/0f015f3/logs/10.209.65.164-subnode/screen-n-cpu.txt.gz#_2015-03-31_14_41_02_677

2015-03-31 14:41:02.677 20527 WARNING nova.virt.libvirt.driver [-] [instance: bc1bfb78-d8b0-4fad-9a6c-aa9e5c228f8c] Error monitoring migration: Requested operation is not valid: nwfilter is in use
2015-03-31 14:41:02.678 20527 DEBUG nova.virt.libvirt.driver [-] [instance: bc1bfb78-d8b0-4fad-9a6c-aa9e5c228f8c] Live migration monitoring is all done _live_migration /opt/stack/new/nova/nova/virt/libvirt/driver.py:5653
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/poll.py", line 115, in wait
    listener.cb(fileno)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5646, in _live_migration
    dom, finish_event)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5576, in _live_migration_monitor
    migrate_data)
  File "/opt/stack/new/nova/nova/exception.py", line 88, in wrapped
    payload)
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 85, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/new/nova/nova/exception.py", line 71, in wrapped
    return f(self, context, *args, **kw)
  File "/opt/stack/new/nova/nova/compute/manager.py", line 352, in decorated_function
    kwargs['instance'], e, sys.exc_info())
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 85, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/new/nova/nova/compute/manager.py", line 340, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/new/nova/nova/compute/manager.py", line 5281, in _post_live_migration
    network_info)
  File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 6094, in unfilter_instance
    network_info=network_info)
  File "/opt/stack/new/nova/nova/virt/libvirt/firewall.py", line 343, in unfilter_instance
    self.nwfilter.unfilter_instance(instance, network_info)
  File "/opt/stack/new/nova/nova/virt/libvirt/firewall.py", line 274, in unfilter_instance
    _nw.undefine()
  File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 4753, in undefine
    if ret == -1: raise libvirtError ('virNWFilterUndefine() failed')
libvirtError: Requested operation is not valid: nwfilter is in use
Removing descriptor: 4

It looks like this error is causing live migration to fail

Tags: libvirt
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/169429

Joe Gordon (jogo)
tags: added: libvirt
Revision history for this message
Matt Riedemann (mriedem) wrote :

From the comment in the firewall driver code it's an expected situation it looks like:

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/firewall.py#L277

The libvirtError shouldn't be getting back up to the compute manager though, that should be translated to a NovaException before the virt driver sends it back up to the compute manager, but that's a side issue.

Maybe the unfilter_instance method just needs a retry loop?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Just checking logstash on the job failing, 88% of the time this is in the nova-network job:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwidGVtcGVzdC5hcGkuY29tcHV0ZS50ZXN0X2xpdmVfYmxvY2tfbWlncmF0aW9uLkxpdmVCbG9ja01pZ3JhdGlvblRlc3RKU09OLnRlc3RfbGl2ZV9ibG9ja19taWdyYXRpb25cIiBBTkQgbWVzc2FnZTpcIkZBSUxFRFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDI3ODIyNTYxNDc0LCJtb2RlIjoidHJlbmQiLCJhbmFseXplX2ZpZWxkIjoiYnVpbGRfbmFtZSJ9

I remember a while back (several months, maybe juno?) there was a bad ssh bug in the gate and sdague and dansmith added some logging and locking to the firewall code in nova, maybe there is a lock we should be using in nova.virt.libvirt.firewall.unfilter_instance?

Revision history for this message
Matt Riedemann (mriedem) wrote :

This was around the time that dansmith was added stuff in this flow: https://review.openstack.org/#/c/104325/

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is when we refresh the network info in the compute manager _post_live_migration method:

http://logs.openstack.org/18/169318/1/check/check-tempest-dsvm-aiopcpu-full/0f015f3/logs/10.209.65.164-subnode/screen-n-cpu.txt.gz#_2015-03-31_14_41_02_446

Then there are some iptables changes..

Then less than a second later we blow up on that nwfilter.undefine() in-use failure:

http://logs.openstack.org/18/169318/1/check/check-tempest-dsvm-aiopcpu-full/0f015f3/logs/10.209.65.164-subnode/screen-n-cpu.txt.gz#_2015-03-31_14_41_02_677

I noticed in the n-net logs that 6 seconds later the fixed_ip that the instances are using during the migration is leased for dhcp:

http://logs.openstack.org/18/169318/1/check/check-tempest-dsvm-aiopcpu-full/0f015f3/logs/10.209.65.164-subnode/screen-n-net.txt.gz#_2015-03-31_14_41_08_169

I'm not familiar enough with the networking flow to know if that's a problem or not.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Another question here is, should we rollback the live migration when we fail here in _post_live_migration? As in, should the driver catch the libvirtError, call recover_method, and then re-raise the post-live migration error?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/169520

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: New → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

We're not going to mark this as an rc blocker for kilo since we just recently got the aiocpu job running so we're shaking out bugs. This isn't a regression and it might be due to the older versions of libvirt/qemu we're running with in the gate, so we'll mark this high but no an rc blocker - we can backport a fix to stable/kilo if we get a workaround working later.

Changed in nova:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/169429
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=42c6438ee3639309f70f22aa89856ccdcfd986c6
Submitter: Jenkins
Branch: master

commit 42c6438ee3639309f70f22aa89856ccdcfd986c6
Author: Joe Gordon <email address hidden>
Date: Tue Mar 31 10:06:59 2015 -0700

    Stacktrace on live migration monitoring

    If we catch a error during the live migration monitoring we should
    stacktrace since this is an unexpected error.

    Just add exc_info in order to not change the translated string since we
    are in a string freeze.

    Change-Id: I75dacb49b1f2881b6030e0e1ac3e577debc4942c
    Related-Bug: #1438803

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/169520
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=20a95915c96643c8a07aad984989d104a3d2a696
Submitter: Jenkins
Branch: master

commit 20a95915c96643c8a07aad984989d104a3d2a696
Author: Matt Riedemann <email address hidden>
Date: Tue Mar 31 14:38:47 2015 -0700

    libvirt: retry to undefine network filters during _post_live_migration

    Sometimes post live migration fails because libvirt raises an error
    saying the network filter is still in use.

    Use the live_migration_retry_count config option (like in
    pre-live-migrate) to retry the operation until it's successful or we
    timeout.

    Also adds some debug logging to _post_live_migration in the compute
    manager before calling driver.unfilter_instance and driver.cleanup,
    which calls unfilter_instance, so that when we hit this we can see
    which path we're coming from.

    Closes-Bug: #1438803

    Change-Id: Idffbe2857fbb23fafab1591dea82f5d64edac4bc

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → liberty-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: liberty-1 → 12.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.