DVR multinode job: test_shelve_instance failure due to SSHTimeout

Bug #1522824 reported by Oleg Bondarev on 2015-12-04
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
High
shihanzhang

Bug Description

gate-tempest-dsvm-neutron-multinode-full fails from time to time due to tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance failure:

Captured traceback:
2015-12-04 01:17:12.569 | ~~~~~~~~~~~~~~~~~~~
2015-12-04 01:17:12.569 | Traceback (most recent call last):
2015-12-04 01:17:12.570 | File "tempest/test.py", line 127, in wrapper
2015-12-04 01:17:12.570 | return f(self, *func_args, **func_kwargs)
2015-12-04 01:17:12.570 | File "tempest/scenario/test_shelve_instance.py", line 101, in test_shelve_instance
2015-12-04 01:17:12.570 | self._create_server_then_shelve_and_unshelve()
2015-12-04 01:17:12.570 | File "tempest/scenario/test_shelve_instance.py", line 93, in _create_server_then_shelve_and_unshelve
2015-12-04 01:17:12.570 | private_key=keypair['private_key'])
2015-12-04 01:17:12.570 | File "tempest/scenario/manager.py", line 645, in get_timestamp
2015-12-04 01:17:12.571 | private_key=private_key)
2015-12-04 01:17:12.571 | File "tempest/scenario/manager.py", line 383, in get_remote_client
2015-12-04 01:17:12.571 | linux_client.validate_authentication()
2015-12-04 01:17:12.571 | File "tempest/common/utils/linux/remote_client.py", line 63, in validate_authentication
2015-12-04 01:17:12.571 | self.ssh_client.test_connection_auth()
2015-12-04 01:17:12.571 | File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/tempest_lib/common/ssh.py", line 167, in test_connection_auth
2015-12-04 01:17:12.571 | connection = self._get_ssh_connection()
2015-12-04 01:17:12.572 | File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/tempest_lib/common/ssh.py", line 87, in _get_ssh_connection
2015-12-04 01:17:12.572 | password=self.password)
2015-12-04 01:17:12.572 | tempest_lib.exceptions.SSHTimeout: Connection to the 172.24.5.209 via SSH timed out.
2015-12-04 01:17:12.572 | User: cirros, Password: None

Oleg Bondarev (obondarev) wrote :

Instance console logs shows that there was a problem with fetching metadata:

2015-12-04 01:17:12.854 | checking http://169.254.169.254/2009-04-04/instance-id
2015-12-04 01:17:12.854 | failed 1/20: up 186.40. request failed
2015-12-04 01:17:12.855 | failed 2/20: up 188.70. request failed
2015-12-04 01:17:12.855 | failed 3/20: up 190.82. request failed
2015-12-04 01:17:12.855 | failed 4/20: up 192.95. request failed
2015-12-04 01:17:12.855 | failed 5/20: up 195.09. request failed
2015-12-04 01:17:12.855 | failed 6/20: up 197.36. request failed
2015-12-04 01:17:12.855 | failed 7/20: up 199.50. request failed
2015-12-04 01:17:12.855 | failed 8/20: up 201.62. request failed
2015-12-04 01:17:12.856 | failed 9/20: up 203.74. request failed
2015-12-04 01:17:12.856 | failed 10/20: up 205.88. request failed
2015-12-04 01:17:12.856 | failed 11/20: up 208.15. request failed
2015-12-04 01:17:12.856 | failed 12/20: up 210.28. request failed
2015-12-04 01:17:12.856 | failed 13/20: up 212.42. request failed
2015-12-04 01:17:12.856 | failed 14/20: up 214.54. request failed
2015-12-04 01:17:12.856 | failed 15/20: up 216.66. request failed
2015-12-04 01:17:12.857 | failed 16/20: up 218.94. request failed
2015-12-04 01:17:12.857 | failed 17/20: up 221.07. request failed
2015-12-04 01:17:12.857 | failed 18/20: up 223.21. request failed
2015-12-04 01:17:12.857 | failed 19/20: up 225.34. request failed
2015-12-04 01:17:12.857 | failed 20/20: up 227.52. request failed
2015-12-04 01:17:12.858 | failed to read iid from metadata. tried 20

description: updated
Oleg Bondarev (obondarev) wrote :

 - the test is doing shelve and then unshelve for a vm;
 - the vm is destroyed and then can be spawned on a different host;
 - the neutron port of the vm is not deleted however - when this vm is unshelved, nova just updates port's host if needed.
 - nova should wait for network-vif-plugged event from neutron just like in usual vm creation scenario;

Logs show that nova does not wait for network-vif-plugged event in failure cases - this causes vm to be spawned before neutron ovs agent processes new port, which leads to problems with fetching metadata.

Oleg Bondarev (obondarev) wrote :

The race is:
 - while shelving vm, libvirt driver deletes vm vifs on original compute;
 - ovs agent needs time to detect devices removal and to report update_device_down to neutron server;
 - meanwhile the test respawns vm on new compute and asks neutron to update vm port's host;
 - by the time neutron server receives update_device_down from ovs agent, vm port is already bound to another host;
 - so neutron server skips port status update in this case - port stays in ACTIVE state;
 - nova only waits for neutron events for non-ACTIVE ports;

I think the fix should be on nova side: wait for network-vif-unplugged when unplugging vifs. Neutron already sends such events when port status is changed form ACTIVE to DOWN.

Oleg Bondarev (obondarev) wrote :

Changing project to nova due to reasons described in comment #3

affects: neutron → nova
Oleg Bondarev (obondarev) wrote :

On a second thought it might be not fair to require nova to wait for some events from neutron on cleanup. Also in case of live migration vifs on source node are deleted after vm is already migrated and ports are active on destination node, so neutron will not send any network-vif-unplugged events in this case. Shelve-unshelve seems a corner case and I'd like to avoid hacks in vm cleanup logic.

The other idea for the fix (on neutron side now) would be to change port status to smth like PENDING_BUILD right after db update. Nova will count such ports as non-ACTIVE and will wait for network-vif-plugged events for them. When agent requests info for the port, neutron server will update status to BUILD. Later when agent reports device up, the port will be put back into ACTIVE state and network-vif-plugged event will be sent to nova.

Changing project back to neutron.

affects: nova → neutron

Fix proposed to branch: master
Review: https://review.openstack.org/253569

Changed in neutron:
status: New → In Progress

Change abandoned by Oleg Bondarev (<email address hidden>) on branch: master
Review: https://review.openstack.org/253569
Reason: in favor of https://review.openstack.org/#/c/215467/

Changed in neutron:
assignee: Oleg Bondarev (obondarev) → shihanzhang (shihanzhang)

Reviewed: https://review.openstack.org/215467
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c5fa665de3173f3ad82cc3e7624b5968bc52c08d
Submitter: Jenkins
Branch: master

commit c5fa665de3173f3ad82cc3e7624b5968bc52c08d
Author: shihanzhang <email address hidden>
Date: Fri Aug 21 09:51:59 2015 +0800

    ML2: update port's status to DOWN if its binding info has changed

    This fixes the problem that when two or more ports in a network
    are migrated to a host that did not previously have any ports in
    the same network, the new host is sometimes not told about the
    IP/MAC addresses of all the other ports in the network. In other
    words, initial L2population does not work, for the new host.

    This is because the l2pop mechanism driver only sends catch-up
    information to the host when it thinks it is dealing with the first
    active port on that host; and currently, when multiple ports are
    migrated to a new host, there is always more than one active port so
    the condition above is never triggered.

    The fix is for the ML2 plugin to set a port's status to DOWN when
    its binding info changes.

    This patch also fixes the bug when nova thinks it should not wait
    for any events from neutron because all ports are already active.

    Closes-bug: #1483601
    Closes-bug: #1443421
    Closes-Bug: #1522824
    Related-Bug: #1450604

    Change-Id: I342ad910360b21085316c25df2154854fd1001b2

Changed in neutron:
status: In Progress → Fix Released

This issue was fixed in the openstack/neutron 8.0.0.0b3 development milestone.

Reviewed: https://review.openstack.org/300539
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a38cb93dde1633005e9e66e6b7ecec9e726304bb
Submitter: Jenkins
Branch: stable/liberty

commit a38cb93dde1633005e9e66e6b7ecec9e726304bb
Author: venkata anil <email address hidden>
Date: Fri Apr 1 14:52:01 2016 +0000

    ML2: update port's status to DOWN if its binding info has changed

    This fixes the problem that when two or more ports in a network
    are migrated to a host that did not previously have any ports in
    the same network, the new host is sometimes not told about the
    IP/MAC addresses of all the other ports in the network. In other
    words, initial L2population does not work, for the new host.

    This is because the l2pop mechanism driver only sends catch-up
    information to the host when it thinks it is dealing with the first
    active port on that host; and currently, when multiple ports are
    migrated to a new host, there is always more than one active port so
    the condition above is never triggered.

    The fix is for the ML2 plugin to set a port's status to DOWN when
    its binding info changes.

    This patch also fixes the bug when nova thinks it should not wait
    for any events from neutron because all ports are already active.

    Closes-bug: #1483601
    Closes-bug: #1443421
    Closes-Bug: #1522824
    Related-Bug: #1450604
    (cherry picked from commit c5fa665de3173f3ad82cc3e7624b5968bc52c08d)

    Conflicts: neutron/plugins/ml2/drivers/l2pop/mech_driver.py

    Change-Id: I342ad910360b21085316c25df2154854fd1001b2

tags: added: in-stable-liberty

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/300559

Change abandoned by Dave Walker (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/306300
Reason:
stable/kilo closed for 2015.1.4

This release is now pending its final release and no freeze exception has
been seen for this changeset. Therefore, I am now abandoning this change.

If this is not correct, please urgently raise a thread on openstack-dev.

More details at: https://wiki.openstack.org/wiki/StableBranch

This issue was fixed in the openstack/neutron 7.1.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers