Evacuate test intermittently fails with network-vif-plugged timeout exception

Bug #1813789 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Lee Yarwood

Bug Description

The nova-live-migration job has 2 nodes and in addition to running live migration tests it also runs evacuate tests for image-backed and volume-backed servers.

Seeing intermittent failures, with debug in this change:

https://review.openstack.org/#/c/571325/

It looks like the network-vif-plugged event is coming to the API before nova-compute has registered a callback for that event, so nova-compute does not process the event and then times out waiting for the event to complete by the time it registered.

The API processes the network-vif-plugged for that server here:

http://logs.openstack.org/25/571325/10/gate/nova-live-migration/52e1cd0/logs/screen-n-api.txt.gz#_Jan_29_01_11_49_707004

Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755 <email address hidden>[22319]: DEBUG nova.api.openstack.wsgi [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling method: <bound method ServerExternalEventsController.create of <nova.api.openstack.compute.server_external_events.ServerExternalEventsController object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed", "tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif-plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]} {{(pid=22322) _process_stack /opt/stack/new/nova/nova/api/openstack/wsgi.py:520}}

Jan 29 01:11:49.765687 ubuntu-xenial-rax-ord-0002201755 <email address hidden>[22319]: INFO nova.api.openstack.compute.server_external_events [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Creating event network-vif-plugged:e241f79f-fb0d-4961-b0c8-aea9de2755bf for instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 on ubuntu-xenial-rax-ord-0002201758
Jan 29 01:11:49.776578 ubuntu-xenial-rax-ord-0002201755 <email address hidden>[22319]: DEBUG nova.compute.api [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 is migrating, copying events to all relevant hosts: set([u'ubuntu-xenial-rax-ord-0002201758', u'ubuntu-xenial-rax-ord-0002201755']) {{(pid=22322) _get_relevant_hosts /opt/stack/new/nova/nova/compute/api.py:4818}}
Jan 29 01:11:49.786947 ubuntu-xenial-rax-ord-0002201755 <email address hidden>[22319]: INFO nova.api.openstack.requestlog [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] 10.210.224.40 "POST /compute/v2.1/os-server-external-events" status: 200 len: 183 microversion: 2.1 time: 0.084393

Which is weird because that's before the vif plugging timeout in the compute logs. From the compute logs, we plugged the vif here:

Jan 29 01:11:55.571807 ubuntu-xenial-rax-ord-0002201755 nova-compute[15903]: INFO os_vif [None req-252304a1-6eff-4ff3-aa4d-b4e0ab87601c demo admin] Successfully plugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:66:03:76,bridge_name='br-int',has_traffic_filtering=True,id=e241f79f-fb0d-4961-b0c8-aea9de2755bf,network=Network(2812d9e2-bfa8-4397-9a81-5dab9fe5a03e),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tape241f79f-fb')

So it looks like maybe we're getting the network-vif-plugged event before we're ready for it and so we miss the event and then timeout? That's pretty weird though because the libvirt driver's spawn method is what should be registering and waiting for the vif plugged event:

https://github.com/openstack/nova/blob/c134feda3d9527dbc9735e4ae9cd35c4782f1fb4/nova/virt/libvirt/driver.py#L5673

Looks like the compute does register the event callback for network-vif-plugged here:

Jan 29 01:11:55.567189 ubuntu-xenial-rax-ord-0002201755 nova-compute[15903]: DEBUG nova.compute.manager [None req-252304a1-6eff-4ff3-aa4d-b4e0ab87601c demo admin] [instance: 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67] Preparing to wait for external event network-vif-plugged-e241f79f-fb0d-4961-b0c8-aea9de2755bf {{(pid=15903) prepare_for_instance_event /opt/stack/new/nova/nova/compute/manager.py:325}}

Which is too late:

Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755 <email address hidden>[22319]: DEBUG nova.api.openstack.wsgi [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling method: <bound method ServerExternalEventsController.create of <nova.api.openstack.compute.server_external_events.ServerExternalEventsController object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed", "tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif-plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]} {{(pid=22322) _process_stack /opt/stack/new/nova/nova/api/openstack/wsgi.py:520}}

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timeout%20waiting%20for%20%5B('network-vif-plugged'%5C%22%20AND%20message%3A%5C%22for%20instance%20with%20vm_state%20error%20and%20task_state%20rebuild_spawning.%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

4 hits in the last 7 days, check and gate, all failures.

Revision history for this message
Artom Lifshitz (notartom) wrote :

We're hitting this in our internal CI, but for revert_resize instead of evacuate [1]. Same root cause though: by the time we've started listening to the event, it's already been received and discarded.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1678681

Revision history for this message
Matt Riedemann (mriedem) wrote :

What's probably happening is we're getting the network-vif-plugged event due to the port binding change here:

https://github.com/openstack/nova/blob/a4e6340bdcc2fda8fed41abc64ea554d9a8ab713/nova/compute/manager.py#L3212

That happens before we spawn the instance which plugs VIFs and would also register it's own network-vif-plugged callback code:

https://github.com/openstack/nova/blob/a4e6340bdcc2fda8fed41abc64ea554d9a8ab713/nova/compute/manager.py#L3277

The same issue exists in finish_revert_resize:

https://github.com/openstack/nova/blob/a4e6340bdcc2fda8fed41abc64ea554d9a8ab713/nova/compute/manager.py#L4089

https://github.com/openstack/nova/blob/a4e6340bdcc2fda8fed41abc64ea554d9a8ab713/nova/compute/manager.py#L4108

Which was reported in a downstream Red Hat BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1678681

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/639396

Changed in nova:
assignee: nobody → Artom Lifshitz (notartom)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/644881

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Artom Lifshitz (<email address hidden>) on branch: master
Review: https://review.opendev.org/639396
Reason: Abandoned, see commit message of [1] for explanation. Looks like we still need to wait in the virt driver, just not in all cases. Patch on top will be modified.

[1] https://review.opendev.org/#/c/660782/1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Artom Lifshitz (<email address hidden>) on branch: master
Review: https://review.opendev.org/639396

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/713342

Changed in nova:
assignee: Artom Lifshitz (notartom) → Lee Yarwood (lyarwood)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/713674
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b097959c1cbc9af1d90c7502286bc3e20972201f
Submitter: Zuul
Branch: master

commit b097959c1cbc9af1d90c7502286bc3e20972201f
Author: Lee Yarwood <email address hidden>
Date: Wed Mar 18 15:17:27 2020 +0000

    nova-live-migration: Ensure subnode is fenced during evacuation testing

    As stated in the forced-down API [1]:

    > Setting a service forced down without completely fencing it will
    > likely result in the corruption of VMs on that host.

    Previously only the libvirtd service was stopped on the subnode prior to
    calling this API, allowing n-cpu, q-agt and the underlying guest domains
    to continue running on the host.

    This change now ensures all devstack services are stopped on the subnode
    and all active domains destroyed.

    It is hoped that this will resolve bug #1813789 where evacuations have
    timed out due to VIF plugging issues on the new destination host.

    [1] https://docs.openstack.org/api-ref/compute/?expanded=update-forced-down-detail#update-forced-down

    Related-Bug: #1813789
    Change-Id: I8af2ad741ca08c3d88efb9aa817c4d1470491a23

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/713961

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/713962

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/713961
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=91d410b92f86524dffb781c7a49ecb3ec8ac8a56
Submitter: Zuul
Branch: stable/train

commit 91d410b92f86524dffb781c7a49ecb3ec8ac8a56
Author: Lee Yarwood <email address hidden>
Date: Wed Mar 18 15:17:27 2020 +0000

    nova-live-migration: Ensure subnode is fenced during evacuation testing

    As stated in the forced-down API [1]:

    > Setting a service forced down without completely fencing it will
    > likely result in the corruption of VMs on that host.

    Previously only the libvirtd service was stopped on the subnode prior to
    calling this API, allowing n-cpu, q-agt and the underlying guest domains
    to continue running on the host.

    This change now ensures all devstack services are stopped on the subnode
    and all active domains destroyed.

    It is hoped that this will resolve bug #1813789 where evacuations have
    timed out due to VIF plugging issues on the new destination host.

    [1] https://docs.openstack.org/api-ref/compute/?expanded=update-forced-down-detail#update-forced-down

    NOTE(lyarwood): The following change is squashed here to allow both to
    pass the gate without encoutering additional failures.

    nova-live-migration: Only stop n-cpu and q-agt during evacuation testing

    I8af2ad741ca08c3d88efb9aa817c4d1470491a23 started to correctly fence the
    subnode ahead of evacuation testing but missed that c-vol and g-api
    where also running on the host. As a result the BFV evacuation test will
    fail if the volume being used is created on the c-vol backend hosted on
    the subnode.

    This change now avoids this by limiting the services stopped ahead of
    the evacuation on the subnode to n-cpu and q-agt.

    Change-Id: Ia7c317e373e4037495d379d06eda19a71412d409
    Closes-Bug: #1868234
    (cherry picked from commit 1e16b3184d4e298c454ede7c56040f6d70276a0c)

    Related-Bug: #1813789
    Change-Id: I8af2ad741ca08c3d88efb9aa817c4d1470491a23
    (cherry picked from commit b097959c1cbc9af1d90c7502286bc3e20972201f)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/713962
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e7cd8b49e21852b5fafec7215d40e758cde92a35
Submitter: Zuul
Branch: stable/stein

commit e7cd8b49e21852b5fafec7215d40e758cde92a35
Author: Lee Yarwood <email address hidden>
Date: Wed Mar 18 15:17:27 2020 +0000

    nova-live-migration: Ensure subnode is fenced during evacuation testing

    As stated in the forced-down API [1]:

    > Setting a service forced down without completely fencing it will
    > likely result in the corruption of VMs on that host.

    Previously only the libvirtd service was stopped on the subnode prior to
    calling this API, allowing n-cpu, q-agt and the underlying guest domains
    to continue running on the host.

    This change now ensures all devstack services are stopped on the subnode
    and all active domains destroyed.

    It is hoped that this will resolve bug #1813789 where evacuations have
    timed out due to VIF plugging issues on the new destination host.

    [1] https://docs.openstack.org/api-ref/compute/?expanded=update-forced-down-detail#update-forced-down

    NOTE(lyarwood): The following change is squashed here to allow both to
    pass the gate without encoutering additional failures.

    nova-live-migration: Only stop n-cpu and q-agt during evacuation testing

    I8af2ad741ca08c3d88efb9aa817c4d1470491a23 started to correctly fence the
    subnode ahead of evacuation testing but missed that c-vol and g-api
    where also running on the host. As a result the BFV evacuation test will
    fail if the volume being used is created on the c-vol backend hosted on
    the subnode.

    This change now avoids this by limiting the services stopped ahead of
    the evacuation on the subnode to n-cpu and q-agt.

    Change-Id: Ia7c317e373e4037495d379d06eda19a71412d409
    Closes-Bug: #1868234
    (cherry picked from commit 1e16b3184d4e298c454ede7c56040f6d70276a0c)

    Related-Bug: #1813789
    Change-Id: I8af2ad741ca08c3d88efb9aa817c4d1470491a23
    (cherry picked from commit b097959c1cbc9af1d90c7502286bc3e20972201f)
    (cherry picked from commit 91d410b92f86524dffb781c7a49ecb3ec8ac8a56)

tags: added: in-stable-stein
Lee Yarwood (lyarwood)
Changed in nova:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.