Timeout waiting for vif plugging callback for instance

Bug #1333654 reported by Salvatore Orlando
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Salvatore Orlando
Icehouse
Fix Released
Undecided
Unassigned

Bug Description

The neutron full job is exhibiting a rather high number of cases where network-vif-plugged timeout are reported.
http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOiBcIlRpbWVvdXQgd2FpdGluZyBmb3IgdmlmIHBsdWdnaW5nIGNhbGxiYWNrIGZvciBpbnN0YW5jZVwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDAzNjA5MTk0NDg4LCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

95.78% of this kind of messages appear for the neutron full job. However, only a fraction of those cause build failures, but that's because the way the tests are executed.
This error is currently being masked by another bug as tempest tries to get the console log of a VM in error state: https://bugs.launchpad.net/tempest/+bug/1332414

This bug will target both neutron and nova pending a better triage.
Fixing this is of paramount importance to get the full job running.

Note: This is different from https://bugs.launchpad.net/nova/+bug/1321872 and https://bugs.launchpad.net/nova/+bug/1329546

Changed in neutron:
importance: Undecided → High
assignee: nobody → Salvatore Orlando (salvatore-orlando)
milestone: none → juno-2
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :
Download full text (4.3 KiB)

It seems the root cause of this bug is a missing check in the server_external_events extension.
The server_external_events extension handles notifications from Neutron for VIF plug events.
When a VIF is plugged, a network-vif-plugged event is sent; when a VIF is unplugged neutron sends a network-vif-unplugged.
In order to optimize communication between these two services, Neutron packs events together when possible.

On the other hand, when processing multiple events on the nova side, if processing for one of those events fails for any reason - all the other events are not processed anymore.

In the specific case of this bug, we always observe the network-vif-plugged being correctly sent from the nova side. However this event is not processed because in the same request there was also a network-vif-unplugged which failed. The traceback in nova looks like this [1].

The failure is happening because the server-external-events controller is trying to dispatch this event to a compute node; unfortunately the instance for which the vif is being unplugged does not have an host. This apparently weird condition however can happen in a few instances. In this case, it's being triggered by a shelve action [2].

In this case the network-vif-unplugged should not be processed simply because there's nothing to do. In general, when more events are packed into a single call to server-external-events, failure in processing one event should not stop event processing. This API extension is not supposed to have an all-or-none behaviour, so it's ok to proceed in case of failures.

Q: So what are you going to do?
A: log an exception in case of failure while dispatching an event and move on to the next event.
Q: Why are you doing this?
A: To prevent an error occurring on an instance to affect the correct spawning of another instance. Also to remove the 1st reason of failure of the neutron full job.
Q: Man, you're ignoring an exception. This is bad. In some communities you might be hanged for this!
A: If you look at the current code there's no handling of any exception whatsoever - and neutron does not bother whether it receives a 200 or a 500, so in my opinion exceptions are already ignored
Q: Yeah but it seems there might be some design flaw here, and you should address that rather than put some tape or hiding the dust under the carpet.
A: I agree with the first part of the statement, not really with the latter. Neutron packs events together only to minimize the number of calls to nova. It is a bug if a failure in one event prevents processing of the others. It is however agreeable that maybe neutron should be informed of which events where processed successfully and which ones not.
Q: How do I know you're not lying? Show me a logstash query!
A: This is not easy. It's not easy to build a query to find situations where more than event is packed together and then one of those events fails. The failure itself is rather common - and in most cases it does not cause a build failure [3]. However, looking only at failed builds when this exception appears, it's very likely to find exactly this failure mode [4].
Q: I've seen it happens only with the full job. Why is that?
A:...

Read more...

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

At the end of the day the exception was raised even before dispatching the call to the RPC layer, so the bug should be fixable in an even easier way - just by avoiding evens for instances without an host are processed.

Looking at how the notification process works, it seems it won't make any harm to skip them.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Addressed by: https://review.openstack.org/#/c/103865/

I'm not sure if I'm missing something in the commit message, but it was not automatically added.

no longer affects: neutron
Changed in nova:
status: New → In Progress
assignee: nobody → Salvatore Orlando (salvatore-orlando)
Dan Smith (danms)
tags: added: icehouse-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/103865
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4f8ccd7b95c27180a1cfe689e3c6f46bde5f803b
Submitter: Jenkins
Branch: master

commit 4f8ccd7b95c27180a1cfe689e3c6f46bde5f803b
Author: Salvatore Orlando <email address hidden>
Date: Mon Jun 30 16:29:32 2014 -0700

    Do not process events for instances without host

    In some cases Neutron might send events such as 'VIF unplugged'
    for instances which are either being deleted or shelved. When
    that happens there will be a failure in dispatching the event
    to the appropriate compute node - as there is no host for the
    instance.

    As multiple neutron events can be stashed in a single call
    it is important to avoid that this kind of errors will prevent
    processing of other events in the same call.

    This patch does not process events for instances without a host,
    marking them as failed.

    When the above condition occurs, the create event request will
    return a 207 response code. For specific events, a 422
    unprocessable entity code will be set.

    This patch also preserve the characteristic that events are
    returned in the response in the same order they were found in
    the request.

    Change-Id: I18062b81e50c722ec96b4296ac39384493683ae3
    Closes-Bug: #1333654

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/108806

Changed in nova:
milestone: none → juno-2
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/109798

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/icehouse)

Reviewed: https://review.openstack.org/109798
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=811cab7f3d2cfd4488dd29df59a95df6e1f85a06
Submitter: Jenkins
Branch: stable/icehouse

commit 811cab7f3d2cfd4488dd29df59a95df6e1f85a06
Author: Salvatore Orlando <email address hidden>
Date: Mon Jun 30 16:29:32 2014 -0700

    Do not process events for instances without host

    In some cases Neutron might send events such as 'VIF unplugged'
    for instances which are either being deleted or shelved. When
    that happens there will be a failure in dispatching the event
    to the appropriate compute node - as there is no host for the
    instance.

    As multiple neutron events can be stashed in a single call
    it is important to avoid that this kind of errors will prevent
    processing of other events in the same call.

    This patch does not process events for instances without a host,
    marking them as failed.

    When the above condition occurs, the create event request will
    return a 207 response code. For specific events, a 422
    unprocessable entity code will be set.

    This patch also preserve the characteristic that events are
    returned in the response in the same order they were found in
    the request.

    Change-Id: I18062b81e50c722ec96b4296ac39384493683ae3
    Closes-Bug: #1333654
    (cherry picked from commit 4f8ccd7b95c27180a1cfe689e3c6f46bde5f803b)

tags: added: in-stable-icehouse
Chuck Short (zulcss)
tags: removed: icehouse-backport-potential
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-2 → 2014.2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/108806
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.