[swarm][ostf] Instance live migration (failure)

Bug #1650545 reported by Valentyn Yakovlev
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Timofey Durakov

Bug Description

Steps to reproduce:
            1. Create cluster (HA) with Neutron VLAN
            2. Add 1 controller + cinder nodes
            3. Add 2 compute + cinder nodes
            4. Upload network template
            5. Delete 'management' network-group
            6. Run network verification
            7. Deploy cluster
            8. Run network verification
            9. Run health checks (OSTF)

Expected result:
 ostf passed

Actual result:
AssertionError: Failed 1 OSTF tests; should fail 0 tests. Names of failed tests:
  - Instance live migration (failure) Time limit exceeded while waiting for Instance migration to finish. Please refer to OpenStack logs for more details.

Version: 9.2 snapshot #638

Revision history for this message
Valentyn Yakovlev (vyakovlev) wrote :
Revision history for this message
Valentyn Yakovlev (vyakovlev) wrote :
summary: - [gate][ostf] Instance live migration (failure)
+ [ostf] Instance live migration (failure)
summary: - [ostf] Instance live migration (failure)
+ [swarm][ostf] Instance live migration (failure)
Changed in fuel:
milestone: none → 9.2
Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
tags: added: swarm-fail
Changed in fuel:
importance: Undecided → High
description: updated
Changed in fuel:
status: New → Confirmed
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Download full text (3.9 KiB)

So the traceback is the following

2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_rep
ly
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher incoming.message))
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 185, in _dispatch
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 127, in _do_dispatch
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args)
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 110, in wrapped
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher payload)
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher self.force_reraise()
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 89, in wrapped
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher return f(self, context, *args, **kw)
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 409, in decorated_function
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 387, in decorated_function
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher kwargs['instance'], e, sys.exc_info())
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher self.force_reraise()
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2016-12-19 19:19:20.114 30169 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 375, in decorated_functio...

Read more...

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

What we essentially see is that we get /var/log/nova-all.log:<180>Dec 19 19:54:36 node-1 nova-compute: 2016-12-19 19:54:36.416 6272 WARNING nova.virt.libvirt.driver [req-a8978d02-be68-4fa8-b754-f484027dd583 126223f68fdc44528b2d1ba0a6b454b9 a04bbf46f0df45bea28ef07b6fe6736c - - -] [instance: d9ee6d48-e00f-4795-bc24-e9d45f29a458] Timeout waiting for vif plugging callback for instance d9ee6d48-e00f-4795-bc24-e9d45f29a458

while the whole block of

            with self.virtapi.wait_for_instance_event(
                    instance, events, deadline=timeout,
                    error_callback=self._neutron_failed_callback):
                self._plug_vifs_with_retries(instance, network_info,
                                             CONF.live_migration_retry_count)
                LOG.debug('FFFFUU')

actually gets to the end. we can see the debug log print and then the control is passed to the context manager which should go through __exit__ phase, but it stalls there

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

We can clearly see that wait_for_instance_event waits for network-vif-plugged-a7f7db42-a283-4138-a845-c6fa678e863e event and times out.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

At the same time neutron sends the event to Nova, but it gets jammed somewhere

/var/log/remote/node-4.test.domain.local/neutron-server.log:2016-12-19T20:14:37.565613+00:00 info: 2016-12-19 20:14:37.564 19788 INFO neutron.notifiers.nova [-] Nova event response: {u'status': u'completed', u'tag': u'a7f7db42-a283-4138-a845-c6fa678e863e', u'name': u'network-vif-plugged', u'server_uuid': u'd9ee6d48-e00f-4795-bc24-e9d45f29a458', u'code': 200}

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Download full text (3.4 KiB)

Nova-api gets the notification but for some reason does not pass it to the compute node.

2016-12-19T20:14:37.303624+00:00 debug: 2016-12-19 20:14:37.301 10260 DEBUG nova.osapi_compute.wsgi.server [req-7c32cc93-6dd9-4f9f-a076-69cd58d3397d - - - - -] (10260) accepted ('10.109.31.11', 49392) server /usr/lib/python2.7/dist-packages/eventlet/wsgi.py:867
2016-12-19T20:14:37.315661+00:00 debug: 2016-12-19 20:14:37.309 10260 DEBUG keystoneauth.session [-] REQ: curl -g -i -X GET http://10.109.31.11:35357/v3/auth/tokens -H "X-Subject-Token: {SHA1}b5863c80860e1e3a4aeb18e429b0acb0a8bf3aff" -H "User-Agent: python-keystoneclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}b8fa776b06ec1631ab4a64ebdc42f04a4c76e950" _http_log_request /usr/lib/python2.7/dist-packages/keystoneauth1/session.py:248
2016-12-19T20:14:37.462264+00:00 debug: 2016-12-19 20:14:37.460 10260 DEBUG keystoneauth.session [-] RESP: [200] Content-Length: 9037 X-Subject-Token: {SHA1}b5863c80860e1e3a4aeb18e429b0acb0a8bf3aff Vary: X-Auth-Token Server: Apache Connection: close Date: Mon, 19 Dec 2016 20:14:37 GMT Content-Type: application/json x-openstack-request-id: req-8388e89b-af4b-4f7b-8f6c-5ade68fe9f30
RESP BODY: {"token": {"methods": ["password"], "roles": [{"id": "3658f873bbff4800a6831c809ca12655", "name": "admin"}], "expires_at": "2016-12-19T20:49:40.000000Z", "project": {"domain": {"id": "default", "name": "Default"}, "id": "f861e139f0e1422fb9bfe053ecbcb609", "name": "services"}, "catalog": "<removed>", "user": {"domain": {"id": "default", "name": "Default"}, "id": "61d1df16b531499c80519dfb02a721a7", "name": "nova"}, "audit_ids": ["GN1b97EUQ9ycoSVx3ElmPA"], "issued_at": "2016-12-19T19:49:40.000000Z"}}
 _http_log_response /usr/lib/python2.7/dist-packages/keystoneauth1/session.py:277
2016-12-19T20:14:37.480592+00:00 debug: 2016-12-19 20:14:37.478 10260 DEBUG nova.api.openstack.wsgi [req-17063fb1-bb80-4652-a13c-d44599a1cbe1 61d1df16b531499c80519dfb02a721a7 f861e139f0e1422fb9bfe053ecbcb609 - - -] Action: 'create', calling method: <bound method ServerExternalEventsController.create of <nova.api.openstack.compute.server_external_events.ServerExternalEventsController object at 0x7fb6408790d0>>, body: {"events": [{"status": "completed", "tag": "a7f7db42-a283-4138-a845-c6fa678e863e", "name": "network-vif-plugged", "server_uuid": "d9ee6d48-e00f-4795-bc24-e9d45f29a458"}]} _process_stack /usr/lib/python2.7/dist-packages/nova/api/openstack/wsgi.py:696
2016-12-19T20:14:37.545120+00:00 info: 2016-12-19 20:14:37.543 10260 INFO nova.api.openstack.compute.server_external_events [req-17063fb1-bb80-4652-a13c-d44599a1cbe1 61d1df16b531499c80519dfb02a721a7 f861e139f0e1422fb9bfe053ecbcb609 - - -] Creating event network-vif-plugged:a7f7db42-a283-4138-a845-c6fa678e863e for instance d9ee6d48-e00f-4795-bc24-e9d45f29a458
2016-12-19T20:14:37.549065+00:00 debug: 2016-12-19 20:14:37.546 10260 DEBUG oslo_messaging._drivers.amqpdriver [req-17063fb1-bb80-4652-a13c-d44599a1cbe1 61d1df16b531499c80519dfb02a721a7 f861e139f0e1422fb9bfe053ecbcb609 - - -] CAST unique_id: 9ff0ec1172154252ab3354d8529f883e size: 3727 exchange: nova topic: compute _send /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpd...

Read more...

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Therefore I am reassigning this bug to Mos Nova team for further investigation

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → MOS Nova (mos-nova)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

What is also interesting - the source host node-5 did get the event notification:

2016-12-19T20:14:37.569210+00:00 debug: 2016-12-19 20:14:37.568 30207 DEBUG nova.compute.manager [req-17063fb1-bb80-4652-a13c-d44599a1cbe1 61d1df16b531499c80519dfb02a721a7
 f861e139f0e1422fb9bfe053ecbcb609 - - -] [instance: d9ee6d48-e00f-4795-bc24-e9d45f29a458] Received event network-vif-plugged-a7f7db42-a283-4138-a845-c6fa678e863e external_i
nstance_event /usr/lib/python2.7/dist-packages/nova/compute/manager.py:6739
2016-12-19T20:14:37.571089+00:00 debug: 2016-12-19 20:14:37.570 30207 DEBUG oslo_concurrency.lockutils [req-17063fb1-bb80-4652-a13c-d44599a1cbe1 61d1df16b531499c80519dfb02
a721a7 f861e139f0e1422fb9bfe053ecbcb609 - - -] Lock "d9ee6d48-e00f-4795-bc24-e9d45f29a458-events" acquired by "nova.compute.manager._pop_event" :: waited 0.000s inner /usr/
lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270
2016-12-19T20:14:37.572376+00:00 debug: 2016-12-19 20:14:37.571 30207 DEBUG oslo_concurrency.lockutils [req-17063fb1-bb80-4652-a13c-d44599a1cbe1 61d1df16b531499c80519dfb02
a721a7 f861e139f0e1422fb9bfe053ecbcb609 - - -] Lock "d9ee6d48-e00f-4795-bc24-e9d45f29a458-events" released by "nova.compute.manager._pop_event" :: held 0.001s inner /usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

But the destination host is also waiting for it and times out when it does not come. It looks like that either nova api does not send events to all the queues or that nova-compute instances do not subscribe to the appropriate queues

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This seems to be a duplicate of https://bugs.launchpad.net/mos/+bug/1590490, but I am not sure. Fix for https://bugs.launchpad.net/mos/+bug/1590490 is in place, but the event notfication is still sent only to the source host's nova-compute daemon. This seems to be a critical issue since this test has not been green since July 19th

Changed in fuel:
importance: High → Critical
Revision history for this message
Timofey Durakov (tdurakov) wrote :

the root cause of the issues is the way nova performs routing of external events from neutron. There were several patches in upstream, that fix the same problem for other migration types. Now I'm working on this one https://review.openstack.org/#/c/244489/50 to be merged, after that we could replicate fix on stable branches.

Changed in fuel:
assignee: MOS Nova (mos-nova) → Timofey Durakov (tdurakov)
tags: added: blocker-for-qa
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

This commit https://review.fuel-infra.org/#/c/28377/ introduced the regression issue, we need to roll-out this change to fix the issue.

Revision history for this message
Timofey Durakov (tdurakov) wrote :

@Timur, this commit fixes another issue, so instead of reverting it, I'm working on the fix for that problem

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Timofey Durakov (tdurakov) wrote :

https://review.openstack.org/#/c/413555/ - upstream fix for the bug

tags: added: tempest
tags: removed: tempest
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/nova (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/29577
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: f423376952b81f8fe469a0c9741e5070b1d89a13
Author: Timofey Durakov <email address hidden>
Date: Tue Dec 27 11:48:00 2016

Create migration context for live-migration

Since a5b920a external instance event routing is
done by checking migration_context for instance.
The problem here is that claims aren't used for
live-migration yet, and no migration_context is created.

This patch fixes this problem by explicitly creating
migration_context on conductor's side, to ensure routing will
work properly.

While this change is a hotfix the rationale for doing that it's
backportable nature.

Change-Id: Id4d3e856c33d16fe80af46aa3c3f6edb77ab2269
Related-Bug: #1414559
Closes-Bug: #1650545
(cherry picked from commit c9839ce2da3b0c7f0e8467d69c695ea0e6236705)

Changed in fuel:
status: In Progress → Fix Committed
tags: added: on-verification
Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :
tags: removed: on-verification
Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (mcp/newton)

Fix proposed to branch: mcp/newton
Change author: Timofey Durakov <email address hidden>
Review: https://review.fuel-infra.org/33578

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (11.0/ocata)

Fix proposed to branch: 11.0/ocata
Change author: Timofey Durakov <email address hidden>
Review: https://review.fuel-infra.org/34439

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (mcp/ocata)

Fix proposed to branch: mcp/ocata
Change author: Timofey Durakov <email address hidden>
Review: https://review.fuel-infra.org/34781

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/nova (11.0/ocata)

Change abandoned by Roman Podoliaka <email address hidden> on branch: 11.0/ocata
Review: https://review.fuel-infra.org/34439
Reason: 11.0/ocata is deprecated in favor of mcp/ocata

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/nova (mcp/ocata)

Change abandoned by Roman Podoliaka <email address hidden> on branch: mcp/ocata
Review: https://review.fuel-infra.org/34781
Reason: already included to https://review.fuel-infra.org/#/c/34846/

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/nova (mcp/newton)

Change abandoned by Roman Podoliaka <email address hidden> on branch: mcp/newton
Review: https://review.fuel-infra.org/33578
Reason: already included to https://review.fuel-infra.org/#/c/33658/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.