live migration failing when nova-conductor is started before other services

Bug #1598860 reported by Dave Walker on 2016-07-04
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
John Garbutt

Bug Description

Deploying OpenStack current master via kolla source, with kvm and ceph/rbd enabled and attempting to perform a live-migration, it becomes stuck in state migrating with this appearing in the nova-compute log.

Looking at nova git log, i'm seeing a bunch of live-migration changes in the last few days. I suspect there might have been a regression.

2016-07-04 10:00:15.643 1 INFO nova.compute.resource_tracker [req-1f6df1d8-a88a-4f27-8ce2-609ea25da4e2 - - - - -] Compute_service record updated for compute01:compute01
2016-07-04 10:00:56.731 1 ERROR root [req-a2b97b2c-f479-4b7c-b3d9-142c1ab1b25f b3bedc85b7674ce2b7e0589a7427dc81 22ab55c3ffe343639d872b1e4db6abb4 - - -] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/exception_wrapper.py", line 66, in wrapped\n return f(self, context, *args, **kw)\n', ' File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/utils.py", line 608, in decorated_function\n *args, **kwargs)\n', ' File "/usr/lib64/python2.7/inspect.py", line 980, in getcallargs\n \'arguments\' if num_required > 1 else \'argument\', num_total))\n', 'TypeError: live_migration() takes exactly 7 arguments (6 given)\n']
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server [req-a2b97b2c-f479-4b7c-b3d9-142c1ab1b25f b3bedc85b7674ce2b7e0589a7427dc81 22ab55c3ffe343639d872b1e4db6abb4 - - -] Exception during message handling
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 133, in _process_incoming
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 150, in dispatch
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 121, in _do_dispatch
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/exception_wrapper.py", line 71, in wrapped
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server f, self, context, *args, **kw)
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/exception_wrapper.py", line 85, in _get_call_dict
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server context, *args, **kw)
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server File "/usr/lib64/python2.7/inspect.py", line 980, in getcallargs
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server 'arguments' if num_required > 1 else 'argument', num_total))
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server TypeError: live_migration() takes exactly 7 arguments (6 given)
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server

Daniel Berrange (berrange) wrote :

Can you get the original exception, not the re-wrapped exception, as this stack trace is useless.

Dave Walker (davewalker) wrote :

This seems to be a racey nova-conductor negotiating RPC version.

...
Fresh deployment
nova boot ..
nova live-migration ..
<wedged at migrating, with trace from above>

2016-07-04 09:40:43.989 1 INFO nova.compute.rpcapi [req-b498fc12-4264-4422-b7f6-6d8567787ac1 - - - - -] Automatically selected compute RPC version 4.0 from minimum service version 0
...

...
nova reset-state --active ..
restart nova-conductor(s)
nova live-migration ..
<success>

2016-07-04 11:47:29.647 1 INFO nova.compute.rpcapi [req-3d408658-67d2-401e-9f33-00bdd80f3c31 - - - - -] Automatically selected compute RPC version 4.13 from minimum service version 14
...

summary: - live migration failing
+ live migration failing when nova-conductor is started before other
+ services
John Garbutt (johngarbutt) wrote :

So I think this happens because the RPC version goes very low.

tags: added: live-migration
John Garbutt (johngarbutt) wrote :

I should be clearer, I think this is where we call with a really old version of live-migrate that no longer works, because nova-conductor thinks no min_service version is 0, so uses a version of RPC that is too old for the current version of Nova.

When nova-compute starts up layer, we should now have an up to date min_version, but looks like we cached the older one somewhere.

Although looking at the code, it doesn't seem like this should be possible. Need some more digging.

Changed in nova:
importance: Undecided → High
status: New → Triaged
assignee: nobody → John Garbutt (johngarbutt)

Fix proposed to branch: master
Review: https://review.openstack.org/339072

Changed in nova:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/339072
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc6b3ab6ef59b4d3f9e540f1cc07326494cd8565
Submitter: Jenkins
Branch: master

commit dc6b3ab6ef59b4d3f9e540f1cc07326494cd8565
Author: John Garbutt <email address hidden>
Date: Thu Jul 7 16:20:27 2016 +0100

    Don't cache RPC pin when service_version is 0

    Its possible for RPC calls to be made before a nova-compute process
    starts up. When this happens we cache the RPC pin of kilo. This causes
    breakages in live-migrate, but in Mitaka we only work with Liberty.

    Change-Id: I5f1e0904c34c4fe26fc75998441602925d2f0593
    Closes-Bug: #1598860

Changed in nova:
status: In Progress → Fix Released

This issue was fixed in the openstack/nova 14.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers