Bug #1583504 “The instances which didn't be evacuated will be de...” : Bugs : OpenStack Compute (nova)

alex (chinabjalex) on 2016-05-19

Changed in nova:
assignee:	nobody → alex (chinabjalex)

Revision history for this message

Sarafraj Singh (sarafraj-singh) wrote on 2016-05-23:

#1

Can you please attach logs to this bug?

tags:	added: compute
Changed in nova:
status:	New → Incomplete

Revision history for this message

alex (chinabjalex) wrote on 2016-05-30:

#2

Download full text (5.9 KiB)

nova-api.log

/var/log/nova/nova-api.log:43606:2016-05-30 14:15:24.424 3987 DEBUG nova.api.openstack.wsgi [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Action: 'action', calling method: <bound method EvacuateController._evacuate of <nova.api.openstack.compute.evacuate.EvacuateController object at 0x639b310>>, body: {"evacuate": {"onSharedStorage": true}} _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:789
/var/log/nova/nova-api.log:43607:2016-05-30 14:15:24.425 3987 DEBUG nova.compute.api [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] [instance: eb46eb30-08ae-4faa-a62e-b5f100bcddf5] Fetching instance by UUID get /usr/lib/python2.7/site-packages/nova/compute/api.py:2077
/var/log/nova/nova-api.log:43608:2016-05-30 14:15:24.462 3987 DEBUG nova.compute.api [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] [instance: eb46eb30-08ae-4faa-a62e-b5f100bcddf5] vm evacuation scheduled evacuate /usr/lib/python2.7/site-packages/nova/compute/api.py:3433
/var/log/nova/nova-api.log:43609:2016-05-30 14:15:24.465 3987 DEBUG nova.servicegroup.drivers.db [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Seems service is down. Last heartbeat was 2016-05-30 06:14:08. Elapsed time is 76.465787 is_up /usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py:80
/var/log/nova/nova-api.log:43610:2016-05-30 14:15:24.515 3987 DEBUG nova.objects.instance [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Lazy-loading `flavor' on Instance uuid eb46eb30-08ae-4faa-a62e-b5f100bcddf5 obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:884
/var/log/nova/nova-api.log:43611:2016-05-30 14:15:24.570 3987 DEBUG nova.objects.instance [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Lazy-loading `flavor' on Instance uuid eb46eb30-08ae-4faa-a62e-b5f100bcddf5 obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:884
/var/log/nova/nova-api.log:43612:2016-05-30 14:15:24.617 3987 INFO nova.osapi_compute.wsgi.server [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] 172.16.197.31 "POST /v2/2c2643bad4d6489a99be97fd69c4fdcc/servers/eb46eb30-08ae-4faa-a62e-b5f100bcddf5/action HTTP/1.1" status: 200 len: 191 time: 0.1989682

nova-scheduler.log

/var/log/nova/nova-scheduler.log:9490:2016-05-30 14:15:24.642 1425 WARNING nova.scheduler.host_manager [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] No compute service record found for host compute2
/var/log/nova/nova-scheduler.log:9491:2016-05-30 14:15:24.642 1425 INFO nova.scheduler.host_manager [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Host filter ignoring hosts: compute1
/var/log/nova/nov...

nova-api.log

/var/log/nova/nova-api.log:43606:2016-05-30 14:15:24.424 3987 DEBUG nova.api.openstack.wsgi [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Action: 'action', calling method: <bound method EvacuateController._evacuate of <nova.api.openstack.compute.evacuate.EvacuateController object at 0x639b310>>, body: {"evacuate": {"onSharedStorage": true}} _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:789
/var/log/nova/nova-api.log:43607:2016-05-30 14:15:24.425 3987 DEBUG nova.compute.api [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] [instance: eb46eb30-08ae-4faa-a62e-b5f100bcddf5] Fetching instance by UUID get /usr/lib/python2.7/site-packages/nova/compute/api.py:2077
/var/log/nova/nova-api.log:43608:2016-05-30 14:15:24.462 3987 DEBUG nova.compute.api [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] [instance: eb46eb30-08ae-4faa-a62e-b5f100bcddf5] vm evacuation scheduled evacuate /usr/lib/python2.7/site-packages/nova/compute/api.py:3433
/var/log/nova/nova-api.log:43609:2016-05-30 14:15:24.465 3987 DEBUG nova.servicegroup.drivers.db [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Seems service is down. Last heartbeat was 2016-05-30 06:14:08. Elapsed time is 76.465787 is_up /usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py:80
/var/log/nova/nova-api.log:43610:2016-05-30 14:15:24.515 3987 DEBUG nova.objects.instance [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Lazy-loading `flavor' on Instance uuid eb46eb30-08ae-4faa-a62e-b5f100bcddf5 obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:884
/var/log/nova/nova-api.log:43611:2016-05-30 14:15:24.570 3987 DEBUG nova.objects.instance [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Lazy-loading `flavor' on Instance uuid eb46eb30-08ae-4faa-a62e-b5f100bcddf5 obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:884
/var/log/nova/nova-api.log:43612:2016-05-30 14:15:24.617 3987 INFO nova.osapi_compute.wsgi.server [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] 172.16.197.31 "POST /v2/2c2643bad4d6489a99be97fd69c4fdcc/servers/eb46eb30-08ae-4faa-a62e-b5f100bcddf5/action HTTP/1.1" status: 200 len: 191 time: 0.1989682

nova-scheduler.log

/var/log/nova/nova-scheduler.log:9490:2016-05-30 14:15:24.642 1425 WARNING nova.scheduler.host_manager [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] No compute service record found for host compute2
/var/log/nova/nova-scheduler.log:9491:2016-05-30 14:15:24.642 1425 INFO nova.scheduler.host_manager [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Host filter ignoring hosts: compute1
/var/log/nova/nova-scheduler.log:9492:2016-05-30 14:15:24.642 1425 DEBUG nova.scheduler.filter_scheduler [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] There are 0 hosts available but 1 instances requested to build. select_destinations /usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py:87

nova-conductor.log

/var/log/nova/nova-conductor.log:2621:2016-05-30 14:15:24.650 3943 WARNING nova.scheduler.utils [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Failed to compute_task_rebuild_server: No valid host was found. There are not enough hosts available.
/var/log/nova/nova-conductor.log:2635:2016-05-30 14:15:24.651 3943 WARNING nova.scheduler.utils [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] [instance: eb46eb30-08ae-4faa-a62e-b5f100bcddf5] Setting instance to ACTIVE state.
/var/log/nova/nova-conductor.log:2636:2016-05-30 14:15:24.652 3943 WARNING nova.objects.instance [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] [instance: eb46eb30-08ae-4faa-a62e-b5f100bcddf5] Trying to sync the availability_zone seem to fail
/var/log/nova/nova-conductor.log:2637:2016-05-30 14:15:24.708 3943 DEBUG nova.objects.instance [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Lazy-loading `metadata' on Instance uuid eb46eb30-08ae-4faa-a62e-b5f100bcddf5 obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:884
/var/log/nova/nova-conductor.log:2638:2016-05-30 14:15:24.735 3943 DEBUG nova.objects.instance [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Lazy-loading `info_cache' on Instance uuid eb46eb30-08ae-4faa-a62e-b5f100bcddf5 obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:884
/var/log/nova/nova-conductor.log:2639:2016-05-30 14:15:24.775 3943 WARNING nova.conductor.manager [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] [instance: eb46eb30-08ae-4faa-a62e-b5f100bcddf5] No valid host found for rebuild
/var/log/nova/nova-conductor.log:2640:2016-05-30 14:15:24.782 3943 ERROR oslo_messaging.rpc.dispatcher [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Exception during message handling: No valid host was found. There are not enough hosts available.

nova-compute.log(source node. after restart nova compute service)

2016-05-30 14:17:52.303 31422 INFO nova.virt.libvirt.driver [-] [instance: eb46eb30-08ae-4faa-a62e-b5f100bcddf5] Instance destroyed successfully.

Changed in nova:
status:	Incomplete → In Progress

Revision history for this message

alex (chinabjalex) wrote on 2016-05-30:

#3

Revision history for this message

alex (chinabjalex) wrote on 2016-05-30:

#4

Download full text (3.5 KiB)

Below is the instance detail. It's host is stilling compute1(the source node)
[root@controller ~]# nova show eb46eb30-08ae-4faa-a62e-b5f100bcddf5
+--------------------------------------+------------------------------------------------------------+
| Property | Value |
+--------------------------------------+------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | aaaaaaa |
| OS-EXT-SRV-ATTR:host | compute1 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute1 |
| OS-EXT-SRV-ATTR:instance_name | instance-0000003b |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2016-05-30T06:11:41.000000 |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| admin-net network | 192.168.0.61 |
| config_drive | True |
| created | 2016-05-30T02:53:37Z |
| flavor | mini (100000000) |
| hostId | da6fc182d7a4592413365df255c835eb99d233d8e1c479642caa7a9f |
| id | eb46eb30-08ae-4faa-a62e-b5f100bcddf5 |
| image | cirros-0.3.1-x86_32 (6e1c83af-e57e-44eb-8a7d-0296b5f464ef) |
| key_name | - |
| metadata | {"admin_pass": "a", "admin_user": "root"} |
| name | vm1 |
| os-extended-volumes:volumes_attached | [] |
| os_type | linux |
| progress | 0 |
| security_groups | default |
| status | ACTIVE ...

Below is the instance detail. It's host is stilling compute1(the source node)
[root@controller ~]# nova show eb46eb30-08ae-4faa-a62e-b5f100bcddf5
+--------------------------------------+------------------------------------------------------------+
| Property                             | Value                                                      |
+--------------------------------------+------------------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                                     |
| OS-EXT-AZ:availability_zone          | aaaaaaa                                                    |
| OS-EXT-SRV-ATTR:host                 | compute1                                                   |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | compute1                                                   |
| OS-EXT-SRV-ATTR:instance_name        | instance-0000003b                                          |
| OS-EXT-STS:power_state               | 0                                                          |
| OS-EXT-STS:task_state                | -                                                          |
| OS-EXT-STS:vm_state                  | active                                                     |
| OS-SRV-USG:launched_at               | 2016-05-30T06:11:41.000000                                 |
| OS-SRV-USG:terminated_at             | -                                                          |
| accessIPv4                           |                                                            |
| accessIPv6                           |                                                            |
| admin-net network                    | 192.168.0.61                                               |
| config_drive                         | True                                                       |
| created                              | 2016-05-30T02:53:37Z                                       |
| flavor                               | mini (100000000)                                           |
| hostId                               | da6fc182d7a4592413365df255c835eb99d233d8e1c479642caa7a9f   |
| id                                   | eb46eb30-08ae-4faa-a62e-b5f100bcddf5                       |
| image                                | cirros-0.3.1-x86_32 (6e1c83af-e57e-44eb-8a7d-0296b5f464ef) |
| key_name                             | -                                                          |
| metadata                             | {"admin_pass": "a", "admin_user": "root"}                  |
| name                                 | vm1                                                        |
| os-extended-volumes:volumes_attached | []                                                         |
| os_type                              | linux                                                      |
| progress                             | 0                                                          |
| security_groups                      | default                                                    |
| status                               | ACTIVE                                                     |
| tenant_id                            | 2c2643bad4d6489a99be97fd69c4fdcc                           |
| updated                              | 2016-05-30T06:18:13Z                                       |
| user_id                              | 9646cdbcb7644ee79d80b0b54a6fcebe                           |
+--------------------------------------+------------------------------------------------------------+

alex (chinabjalex) on 2016-06-02

Changed in nova:
status:	In Progress → New

Takashi Natsume (natsume-takashi) on 2016-06-07

Changed in nova:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-25: Change abandoned on nova (master)

#5

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/318731
Reason: This patch has been sitting unchanged for more than 12 weeks. I am therefore going to abandon it to keep the nova review queue sane. Please feel free to restore the change if you're still working on it.

Revision history for this message

Anusha Unnam (anusha-unnam) wrote on 2016-11-15:

#6

Since there is no activity in this bug and the patch has been abandoned, removing the assignee and setting the status to new because this was not confirmed by anyone other than the reporter.

Changed in nova:
assignee:	alex (chinabjalex) → nobody
status:	In Progress → New

Revision history for this message

Anusha Unnam (anusha-unnam) wrote on 2016-11-15:

#7

I tried to reproduce this bug and i think this is a valid bug. I followed below steps:

Multinode environment with controller and compute node,

!. boot an instance vm1 on compute node.
2. made the compute node to fault state.
3. disabled compute service on controller node.
4. evacaute vm1.
5. recover the fault host and restart the compute service.

As you said vm1 is deleted on compute node. I saw this message from n-cpu log
"[instance: 36663d00-fad8-468e-9a13-462b2bbfbd37] Deleting instance as it has been evacuated from this host"
"INFO nova.virt.libvirt.driver [-] [instance: 36663d00-fad8-468e-9a13-462b2bbfbd37] Instance destroyed successfully."

When i did "nova show vm1", it shows wrongly that it is still on compute node and task_state is 'active' and power_state is 'NOSTATE'.
So the vm is not running and it was not evacuated to controller node even after I enabled the compute service on controller node.

I read that nova evacuate is useful with shared storage. But in my environment i don't have shared storage. I see from your logs you have "onSharedStorage": true.

Anusha Unnam (anusha-unnam) on 2016-11-15

Changed in nova:
status:	New → Confirmed

Margarita Shakhova (shakhova-margarita) on 2018-04-11

Changed in nova:
assignee:	nobody → Margarita Shakhova (shakhova-margarita)

Revision history for this message

Margarita Shakhova (shakhova-margarita) wrote on 2018-04-16:

#8

In fact this bug is reproduced only if we have more than one evacuation attempt so there are more than one row can be found by filter instance_uuid+'evacuation'+'accepted'.
Nova manager marks migration as 'completed' at the first start after failure but because of error in code nova manager finds only last row so therefore instance will be rebooted at nova-compute restart several times which equals the number of 'accepted' evacuation attempts.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-04-16:

#9

This was fixed under a different bug: https://review.openstack.org/#/c/499237/

Revision history for this message

melanie witt (melwitt) wrote on 2018-04-16:

#10

This bug has likely been fixed by https://review.openstack.org/#/c/499237

but we should have some functional test coverage to be sure. mriedem suggested we could probably add a nova-compute restart to this functional test to verify:

https://review.openstack.org/#/c/499237/4/nova/tests/functional/regressions/test_bug_1713783.py

Revision history for this message

Margarita Shakhova (shakhova-margarita) wrote on 2018-04-18:

#11

Yes, this patch fixes the problem. We just tested almost the same change on our lab (use 'failed' instead of 'error').

Still I see one more potential source of problem here:

https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L648

There could be more than one rows found from filter, in this case just the last one will be updated. It would be better to have dict of lists here.

Revision history for this message

melanie witt (melwitt) wrote on 2018-04-18:

#12

What do you mean? The line:

evacuations = {mig.instance_uuid: mig for mig in evacuations}

will create a dict that contains all of the found migrations, keyed by instance UUID.

Revision history for this message

Margarita Shakhova (shakhova-margarita) wrote on 2018-04-18:

#13

I mean that at previous lines we get:

evacuations = objects.MigrationList.get_by_filters(context, filters)

there could be more than one row for which satus is 'accepted' or 'done' and migration_type='evacuation'

If there are more than one raw with the same instance_uuid the first one record in the dict will be created using the first raw, if one more raw found than the value for the same key will be just updated. Therefore we will always have just one value for the instance_uuid key.

In L701 migration status will be updated to 'completed', if we have more than one accepted/done migrations for the same instance_uuid, this code will update only last one. At the next nova-compute restart instance will be rebooted.

I am trying to investigate in what case it is possible to have several such records for the same instance at the host in DB. Before fix https://review.openstack.org/#/c/499237/ it was very easy to reproduce: just call 'evacuate' several times and see several 'accepted' migrations of the same instance from the same host. I am not sure if this is likely to happen now, but just in case it would be better to look for all such records in Migrations.

Revision history for this message

melanie witt (melwitt) wrote on 2018-04-18:

#14

Ah, I understand now, thanks for the detailed explanation.

I agree, seems like it would be best to ensure we can handle more than one migration record per instance.

Revision history for this message

Margarita Shakhova (shakhova-margarita) wrote on 2018-04-23:

#15

Should I report a separate bug to fix this?

Revision history for this message

melanie witt (melwitt) wrote on 2018-04-23:

#16

Yes, I think that would be best. As you mentioned earlier, since the fix [0], it's not clear whether multiple migrations per instance can still happen, but there could be old migrations still around from back when it was possible, and those should be cleaned up.

But, looking at the code, it doesn't look like we can do much about it other than mark the extra migrations as 'completed' because we can only destroy the local instance, cleanup allocations, etc, once per instance anyway.

[0] https://review.openstack.org/#/c/499237

OpenStack Compute (nova)

The instances which didn't be evacuated will be destroyed when the nova-compute service is restarted.

Bug Description

Other bug subscribers

Remote bug watches