cold migration fails for ceph volume instances

Bug #1777157 reported by Vladislav Belogrudov
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Undecided
Unassigned

Bug Description

queens release,

running instance from ceph volume, in horizon: 'migrate'

Instance is in ERROR state. Horizon reports:

Error: Failed to perform requested operation on instance "instance3", the
instance has an error status: Please try again later [Error: list index out
of range].

yet another instance got:

Error: Failed to perform requested operation on instance "instance4", the
instance has an error status: Please try again later [Error: Conflict
updating instance 6b837382-2a75-46a6-9a09-8c3f90f0ffd7. Expected:
{'task_state': [u'resize_prep']}. Actual: {'task_state': None}].

Revision history for this message
Vladislav Belogrudov (vlad-belogrudov) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (4.9 KiB)

Hmm, do you only have a single compute host? It appears that you're trying to resize the instance to the same host it's already on:

2018-06-15 16:23:46.876 7 ERROR nova.compute.manager [req-69050681-be73-48c5-be25-993f05b0b47e e511323a68a740f597880a1b056e3918 28f4e0797d544d758700e5be42aa82c6 - default default] [instance: daa5b10b-0655-4178-9413-66697004d7f8] Error: Unable to migrate instance (daa5b10b-0655-4178-9413-66697004d7f8) to current host (compute6).: UnableToMigrateToSelf: Unable to migrate instance (daa5b10b-0655-4178-9413-66697004d7f8) to current host (compute6).
2018-06-15 16:23:46.876 7 ERROR nova.compute.manager [instance: daa5b10b-0655-4178-9413-66697004d7f8] Traceback (most recent call last):
2018-06-15 16:23:46.876 7 ERROR nova.compute.manager [instance: daa5b10b-0655-4178-9413-66697004d7f8] File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/manager.py", line 4141, in prep_resize
2018-06-15 16:23:46.876 7 ERROR nova.compute.manager [instance: daa5b10b-0655-4178-9413-66697004d7f8] node, migration, clean_shutdown)
2018-06-15 16:23:46.876 7 ERROR nova.compute.manager [instance: daa5b10b-0655-4178-9413-66697004d7f8] File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/manager.py", line 4085, in _prep_resize
2018-06-15 16:23:46.876 7 ERROR nova.compute.manager [instance: daa5b10b-0655-4178-9413-66697004d7f8] instance_id=instance.uuid, host=self.host)
2018-06-15 16:23:46.876 7 ERROR nova.compute.manager [instance: daa5b10b-0655-4178-9413-66697004d7f8] UnableToMigrateToSelf: Unable to migrate instance (daa5b10b-0655-4178-9413-66697004d7f8) to current host (compute6).
2018-06-15 16:23:46.876 7 ERROR nova.compute.manager [instance: daa5b10b-0655-4178-9413-66697004d7f8]

And then due to that, we hit this:

2018-06-15 16:23:46.954 7 ERROR nova.compute.manager [req-69050681-be73-48c5-be25-993f05b0b47e e511323a68a740f597880a1b056e3918 28f4e0797d544d758700e5be42aa82c6 - default default] [instance: daa5b10b-0655-4178-9413-66697004d7f8] Setting instance vm_state to ERROR: IndexError: list index out of range
2018-06-15 16:23:46.954 7 ERROR nova.compute.manager [instance: daa5b10b-0655-4178-9413-66697004d7f8] Traceback (most recent call last):
2018-06-15 16:23:46.954 7 ERROR nova.compute.manager [instance: daa5b10b-0655-4178-9413-66697004d7f8] File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/manager.py", line 7445, in _error_out_instance_on_exception
2018-06-15 16:23:46.954 7 ERROR nova.compute.manager [instance: daa5b10b-0655-4178-9413-66697004d7f8] yield
2018-06-15 16:23:46.954 7 ERROR nova.compute.manager [instance: daa5b10b-0655-4178-9413-66697004d7f8] File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/manager.py", line 4162, in prep_resize
2018-06-15 16:23:46.954 7 ERROR nova.compute.manager [instance: daa5b10b-0655-4178-9413-66697004d7f8] filter_properties, host_list)
2018-06-15 16:23:46.954 7 ERROR nova.compute.manager [instance: daa5b10b-0655-4178-9413-66697004d7f8] File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/manager.py", line 4221, in _reschedule_resize_or_reraise
2018-06-15 16:23:46.954 7 ERROR nova.compute.manager [insta...

Read more...

Revision history for this message
Matt Riedemann (mriedem) wrote :

I opened bug 1777540 to track the IndexError issue.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

@Matt: This could not be a resize as the flavor_id does not change https://github.com/openstack/nova/blob/530c24596a74a6115f2e944176c46e4f3d199811/nova/compute/manager.py#L4075
But I have no idea how can we ask the Nova API to migrate an instance to itself.

Revision history for this message
Vladislav Belogrudov (vlad-belogrudov) wrote :

i have 3 computes, and can migrate ephemeral disk instances, also live migration works for all.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I've managed to push the code towards the same code path that the logs shows. I used 17.0.2 queens as that matched the line numbers in my log with the numbers in the log attached. I used 2.7 and 3.5 nova functional env for the reproduction.

* define two compute hosts (host1, host2)
* boot an instance on host1
* make sure that host2 is less desirable for the scheduler as a migration target by consuming resource from it but keep enough resources that it can be an allocation candidate still
* make sure that allow_resize_to_same_host = True so that scheduler will also consider host1 as a migration target
* make sure that the virt driver on host1 does not have the capability to migrate to the same host
* call migrate without forcing a host

Now the following happens
* scheduler gets allocation candidates in the following order (host1, host2)
* conductor tries to migrate the instance to host1 but that fails on host1 compute.manager._prep_resize with UnableToMigrateToSelf as the virt driver has no capability [1]
* UnableToMigrateToSelf is handled in the exception block in prep_resize and calls _reschedule_resize_or_reraise [2]
* that does the reschedule and conductor now selects host2 and do the allocation successfully so _reschedule returns True
* this means that nova ends up sending a resize.error (about the failure to migrate to host1) at [3]
* and this leads to 'inspect.trace()[-1]' call in [4] that fails for the bug author. But does not fail for me. inspect.trace() should return a non empty list [5][6] if called from a exception handling context. We are in an except block as we are executing [2]. It is also proven by the fact that the sys.exc_info() return a non (None, None, None) result at [7] that is printed at [8] and visible both in the bug reporters and in my logs.

So I'm clueless what happens.

@Vladislav: Could you provide all three compute logs and the conductor log? Could you please leave a bit more context in the logs before the first ERROR line?

@Vladislav: What is your exact environment? Which version of Queens? Do you have any custom nova code modification top of the upstream Queens version?

[1] https://github.com/openstack/nova/blob/307382f58d38778b480d2d030e427759a44c204b/nova/compute/manager.py#L4085
[2] https://github.com/openstack/nova/blob/307382f58d38778b480d2d030e427759a44c204b/nova/compute/manager.py#L4162
[3] https://github.com/openstack/nova/blob/307382f58d38778b480d2d030e427759a44c204b/nova/compute/manager.py#L4221
[4] https://github.com/openstack/nova/blob/307382f58d38778b480d2d030e427759a44c204b/nova/notifications/objects/exception.py#L42
[5] https://docs.python.org/2.7/library/inspect.html#inspect.trace
[6] https://docs.python.org/3.5/library/inspect.html#inspect.trace
[7] https://github.com/openstack/nova/blob/307382f58d38778b480d2d030e427759a44c204b/nova/compute/manager.py#L4159
[8] https://github.com/openstack/nova/blob/307382f58d38778b480d2d030e427759a44c204b/nova/compute/manager.py#L1313

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Here is the functional test I used for the trials: https://review.openstack.org/#/c/576593

Changed in nova:
status: New → Incomplete
Revision history for this message
Vladislav Belogrudov (vlad-belogrudov) wrote :

My setup:

+----------------+---------------------+
| Host | Groups |
+----------------+---------------------+
| 10.196.244.201 | [control,storage] |
| 10.196.244.202 | [control,storage] |
| 10.196.244.203 | [control,cephstore] |
| 10.196.244.204 | [compute,cephstore] |
| 10.196.244.205 | [compute] |
| 10.196.244.206 | [compute,cephstore] |
| 10.196.244.207 | [network,database] |
| 10.196.244.208 | [database] |
| 10.196.244.209 | [network] |
| 10.196.244.211 | [storage] |
+----------------+---------------------+

instance starts on 10.196.244.206, right after migrate command it fails with:

| fault | {u'message': u"Conflict updating instance dfad1511-daf4-4b4b-8c6d-ae9edead7d8f. Expected: {'task_state': [u'resize_prep']}. Actual: {'task_state': None}", u'code': 500, u'details': u' File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/manager.py", line 202, in decorated_function\n return function(self, context, *args, **kwargs)\n File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/manager.py", line 4242, in resize_instance\n self._revert_allocation(context, instance, migration)\n File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/manager.py", line 4239, in resize_instance\n instance_type, clean_shutdown)\n File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/manager.py", line 4256, in _resize_instance\n instance.save(expected_task_state=task_states.RESIZE_PREP)\n File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 210, in wrapper\n ctxt, self, fn.__name__, args, kwargs)\n File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 245, in object_action\n objmethod=objmethod, args=args, kwargs=kwargs)\n File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 174, in call\n retry=self.retry)\n File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/transport.py", line 131, in _send\n timeout=timeout, retry=retry)\n File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send\n retry=retry)\n File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 550, in _send\n raise result\n', u'created': u'2018-06-22T10:25:51Z'}

This is kolla based environment,

(nova-compute)[nova@compute5 /]$ pip freeze | grep nova
nova==13.0.0.0rc4.dev10675
python-novaclient==10.1.0

Revision history for this message
Vladislav Belogrudov (vlad-belogrudov) wrote :
Revision history for this message
Vladislav Belogrudov (vlad-belogrudov) wrote :
Revision history for this message
Vladislav Belogrudov (vlad-belogrudov) wrote :
Revision history for this message
Vladislav Belogrudov (vlad-belogrudov) wrote :
Revision history for this message
Vladislav Belogrudov (vlad-belogrudov) wrote :
Revision history for this message
Vladislav Belogrudov (vlad-belogrudov) wrote :

instances can live migrate from host to host without problem,

compute with storage <-> compute with storage
compute with storage <-> compute without storage

The same is true for instances with ephemeral disks in Ceph

Changed in nova:
status: Incomplete → New
Revision history for this message
Vladislav Belogrudov (vlad-belogrudov) wrote :

cannot reproduce it - will reopen and will recollect logs with debug if it happens again

Changed in nova:
status: New → Invalid
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

> (nova-compute)[nova@compute5 /]$ pip freeze | grep nova
> nova==13.0.0.0rc4.dev10675

13.0.0 means Mitaka and dev10675 means you have 10675 commits top of Mitaka so my failed trials top of Queens 17.0.2 is explained. Could you please try to reproduce the problem top of an Upstream Nova version (or at least try to have limited amount of forking in your environment where you reproduce)? It is hard to tell what happens in your env running different code than what I can see in my env.

Revision history for this message
TingtingYu (qearl) wrote :

I meet this problem too. and the problem is same to you @Balazs Gibizer (balazs-gibizer) .
 My nova version is queens 17.0.5
pip freeze | grep nova
nova==17.0.5

But how to solve it?

TingtingYu (qearl)
Changed in nova:
status: Invalid → Confirmed
Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

we've reproduced it on Queens, it seems this happens when resize to same host is enabled in Nova and scheduler actually attempts to resize to same host during migration

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :
Download full text (4.9 KiB)

Correction - the resize to the same host is disabled:

(logs of nova-compute where the instance was/is running before migration)

we see an (supposedly harmless in this case) error

2018-11-08 21:24:17,128.128 29103 ERROR nova.compute.manager [req-ff9ca9cf-fdb8-4fcc-b1e8-95863516948f d3b371ed5f8c46e29553a76d758a50d2 dd818e8db2704171b3efa0292df48bbb - default default] [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] Error: Unable to migrate instance (dd82d475-2d2a-4471-8b2d-7022a2237ebf) to current host (cmp2).: UnableToMigrateToSelf: Unable to migrate instance (dd82d475-2d2a-4471-8b2d-7022a2237ebf) to current host (cmp2).
2018-11-08 21:24:17,128.128 29103 ERROR nova.compute.manager [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] Traceback (most recent call last):
2018-11-08 21:24:17,128.128 29103 ERROR nova.compute.manager [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 4166, in prep_resize
2018-11-08 21:24:17,128.128 29103 ERROR nova.compute.manager [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] node, migration, clean_shutdown)
2018-11-08 21:24:17,128.128 29103 ERROR nova.compute.manager [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 4110, in _prep_resize
2018-11-08 21:24:17,128.128 29103 ERROR nova.compute.manager [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] instance_id=instance.uuid, host=self.host)
2018-11-08 21:24:17,128.128 29103 ERROR nova.compute.manager [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] UnableToMigrateToSelf: Unable to migrate instance (dd82d475-2d2a-4471-8b2d-7022a2237ebf) to current host (cmp2).
2018-11-08 21:24:17,128.128 29103 ERROR nova.compute.manager [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf]

and then immediately this:

2018-11-08 21:24:17,187.187 29103 ERROR nova.compute.manager [req-ff9ca9cf-fdb8-4fcc-b1e8-95863516948f d3b371ed5f8c46e29553a76d758a50d2 dd818e8db2704171b3efa0292df48bbb - default default] [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] Setting instance vm_state to ERROR: IndexError: list index out of range
2018-11-08 21:24:17,187.187 29103 ERROR nova.compute.manager [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] Traceback (most recent call last):
2018-11-08 21:24:17,187.187 29103 ERROR nova.compute.manager [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 7545, in _error_out_instance_on_exception
2018-11-08 21:24:17,187.187 29103 ERROR nova.compute.manager [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] yield
2018-11-08 21:24:17,187.187 29103 ERROR nova.compute.manager [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 4187, in prep_resize
2018-11-08 21:24:17,187.187 29103 ERROR nova.compute.manager [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] filter_properties, host_list)
2018-11-08 21:24:17,187.187 29103 ERROR nova.compute.manager [instance: dd82d475-2d2a-4471-8b2d-7022a2237ebf] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 4246, in _reschedule_res...

Read more...

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

sorry for confusion, the allow_migrate_to_same_host is set to True (this is why UnableToMigrateToSelf is raised at all).

apparently the traceback disappears from the context somewhere around this call

https://github.com/openstack/nova/blob/b5e8702f5737fff432cd0e4b461f80b368ac42ff/nova/compute/manager.py#L1341

I'll investigate further

Revision history for this message
Margarita Shakhova (shakhova-margarita) wrote :

The option allow_migrate_to_same_host was deprecated since Liberty:
 https://bugs.launchpad.net/openstack-manuals/+bug/1557032

Revision history for this message
Margarita Shakhova (shakhova-margarita) wrote :

In my case cold migration works if allow_resize_to_same host was set to False and fails if this option was set to True.
In case when allow_resize_to_same host=False resize always moves the instance to another host.

Revision history for this message
Margarita Shakhova (shakhova-margarita) wrote :

Looks like this bug is a duplicate of #1790204

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.