evacuate test fails due to timeout waiting for evacuate to complete

Bug #1806925 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann

Bug Description

In the post-test hook in the nova-live-migration job where we test evacuate, we're doing the following:

1. create an image-backed and volume-backed server on the subnode
2. stop libvirtd on the local node
3. run evacuate to see it fail because nova-compute is disabled on the local node
4. restart libvirtd, wait for the local nova-compute service to be enabled, and then evacuate each server

In this failure, the evacuate times out because libvirtd is still unavailable on the local node after we started the evacuate:

http://logs.openstack.org/54/620154/1/gate/nova-live-migration/f040b76/logs/devstack-gate-post_test_hook.txt.gz#_2018-12-05_10_05_50_130

2018-12-05 10:05:50.130 | + /opt/stack/new/nova/gate/test_evacuate.sh:evacuate_and_wait_for_active:114 : nova evacuate evacuate-test

nova-compute on the local host is back up here:

Dec 05 10:05:49.341595 ubuntu-xenial-ovh-bhs1-0000944602 nova-compute[16115]: INFO nova.virt.libvirt.driver [None req-e14feea2-2abc-43cc-b51f-f416f9dd5692 None None] Connection event '1' reason 'None'

The evacuate starts here:

http://logs.openstack.org/54/620154/1/gate/nova-live-migration/f040b76/logs/screen-n-cpu.txt.gz#_Dec_05_10_05_54_156579

Dec 05 10:05:54.156579 ubuntu-xenial-ovh-bhs1-0000944602 nova-compute[16115]: INFO nova.compute.manager [None req-c2f2a1d3-527f-4885-8e4f-e82003a6d472 demo admin] [instance: 19ef59e3-de5a-42b2-b0aa-d069702deedf] Evacuating instance

After that I don't see any failures, but the evacuation doesn't complete within the 30 second timeout - maybe the timeout isn't long enough?

It looks like while we timeout, we're waiting for the network-vif-plugged event from neutron:

http://logs.openstack.org/54/620154/1/gate/nova-live-migration/f040b76/logs/screen-n-cpu.txt.gz#_Dec_05_10_06_04_554322

Dec 05 10:06:04.554322 ubuntu-xenial-ovh-bhs1-0000944602 nova-compute[16115]: DEBUG nova.compute.manager [None req-c2f2a1d3-527f-4885-8e4f-e82003a6d472 demo admin] [instance: 19ef59e3-de5a-42b2-b0aa-d069702deedf] Preparing to wait for external event network-vif-plugged-7d5ba599-9c7a-4e41-9fe4-3aff44a75458 {{(pid=16115) prepare_for_instance_event /opt/stack/new/nova/nova/compute/manager.py:327}}

The VIF is plugged here:

Dec 05 10:06:04.620986 ubuntu-xenial-ovh-bhs1-0000944602 nova-compute[16115]: INFO os_vif [None req-c2f2a1d3-527f-4885-8e4f-e82003a6d472 demo admin] Successfully plugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:e5:b1:9f,bridge_name='br-int',has_traffic_filtering=True,id=7d5ba599-9c7a-4e41-9fe4-3aff44a75458,network=Network(22273876-0d80-4450-8913-0102f3f79ccf),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tap7d5ba599-9c')

And we timeout about a second or so later, but vif plugging usually takes about 5 seconds to get the event back from neutron, and this is a slower ovh node, so our timeout is likely just not long enough. To compare, tempest's compute build_timeout is 300 seconds:

https://github.com/openstack/tempest/blob/eac094a8cf834d035316a900107f601adcc42ff5/tempest/config.py#L288

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Actually devstack configures the tempest build_timeout for compute to 196 seconds:

http://logs.openstack.org/54/620154/1/gate/nova-live-migration/f040b76/logs/tempest_conf.txt.gz

[compute]
min_compute_nodes = 2
max_microversion = latest
flavor_ref_alt = 84
flavor_ref = 42
image_ref_alt = f6457f66-bae9-4618-931e-5211ce99a478
image_ref = f6457f66-bae9-4618-931e-5211ce99a478
build_timeout = 196

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/623011

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/623011
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3b1463b968426ba8db1c5d458f6461e2375045e8
Submitter: Zuul
Branch: master

commit 3b1463b968426ba8db1c5d458f6461e2375045e8
Author: Matt Riedemann <email address hidden>
Date: Wed Dec 5 10:46:06 2018 -0500

    Use tempest [compute]/build_timeout in evacuate tests

    Waiting 30 seconds for an evacuate to complete is not enough
    time on some slower CI test nodes. This change uses the
    same build timeout configuration from tempest to determine
    the overall evacuate timeout in our evacuate tests.

    Change-Id: Ie5935ae54d2cbf1a4272e93815ee5f67d3ffe2eb
    Closes-Bug: #1806925

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.