The instances which didn't be evacuated will be destroyed when the nova-compute service is restarted.

Bug #1583504 reported by alex on 2016-05-19
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Margarita Shakhova

Bug Description

Description
===========
Normally, if you finished evacuating instance, nova-compute will destroy the existed instances whose state are "done" or "accept" in the migration table from the fault host after the nova-compute service is restarted. However, if the nova-scheduler failed to schedule the hosts, e.g. no valid host, nova-scheduler won't update the migration state of the instance to "failed", so that the nova-compute will destroy this instance after restarting by mistake.

Steps to reproduce
==================
1. Create some instances in the the specific host.
2. Make this host to fault state.
3. Disable nova-compute service in other hosts, aim to mock that nova-scheduler fail to schedule.
4. Recover the fault host and restart nova-compute.
5. Check all instances are still existed in the kvm.

Expected result
===============
All instances should be still existed in the kvm.

Actual result
=============
All instances are destroyed by nova-compute unfortunately.

alex (chinabjalex) on 2016-05-19
Changed in nova:
assignee: nobody → alex (chinabjalex)
Sarafraj Singh (sarafraj-singh) wrote :

Can you please attach logs to this bug?

tags: added: compute
Changed in nova:
status: New → Incomplete
alex (chinabjalex) wrote :
Download full text (5.9 KiB)

nova-api.log

/var/log/nova/nova-api.log:43606:2016-05-30 14:15:24.424 3987 DEBUG nova.api.openstack.wsgi [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Action: 'action', calling method: <bound method EvacuateController._evacuate of <nova.api.openstack.compute.evacuate.EvacuateController object at 0x639b310>>, body: {"evacuate": {"onSharedStorage": true}} _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:789
/var/log/nova/nova-api.log:43607:2016-05-30 14:15:24.425 3987 DEBUG nova.compute.api [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] [instance: eb46eb30-08ae-4faa-a62e-b5f100bcddf5] Fetching instance by UUID get /usr/lib/python2.7/site-packages/nova/compute/api.py:2077
/var/log/nova/nova-api.log:43608:2016-05-30 14:15:24.462 3987 DEBUG nova.compute.api [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] [instance: eb46eb30-08ae-4faa-a62e-b5f100bcddf5] vm evacuation scheduled evacuate /usr/lib/python2.7/site-packages/nova/compute/api.py:3433
/var/log/nova/nova-api.log:43609:2016-05-30 14:15:24.465 3987 DEBUG nova.servicegroup.drivers.db [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Seems service is down. Last heartbeat was 2016-05-30 06:14:08. Elapsed time is 76.465787 is_up /usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py:80
/var/log/nova/nova-api.log:43610:2016-05-30 14:15:24.515 3987 DEBUG nova.objects.instance [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Lazy-loading `flavor' on Instance uuid eb46eb30-08ae-4faa-a62e-b5f100bcddf5 obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:884
/var/log/nova/nova-api.log:43611:2016-05-30 14:15:24.570 3987 DEBUG nova.objects.instance [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Lazy-loading `flavor' on Instance uuid eb46eb30-08ae-4faa-a62e-b5f100bcddf5 obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:884
/var/log/nova/nova-api.log:43612:2016-05-30 14:15:24.617 3987 INFO nova.osapi_compute.wsgi.server [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] 172.16.197.31 "POST /v2/2c2643bad4d6489a99be97fd69c4fdcc/servers/eb46eb30-08ae-4faa-a62e-b5f100bcddf5/action HTTP/1.1" status: 200 len: 191 time: 0.1989682

nova-scheduler.log

/var/log/nova/nova-scheduler.log:9490:2016-05-30 14:15:24.642 1425 WARNING nova.scheduler.host_manager [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] No compute service record found for host compute2
/var/log/nova/nova-scheduler.log:9491:2016-05-30 14:15:24.642 1425 INFO nova.scheduler.host_manager [req-72eab156-cfd1-483b-beab-2a164361071c 9646cdbcb7644ee79d80b0b54a6fcebe 2c2643bad4d6489a99be97fd69c4fdcc - - -] Host filter ignoring hosts: compute1
/var/log/nova/nov...

Read more...

Changed in nova:
status: Incomplete → In Progress
alex (chinabjalex) wrote :

Below is the msg in migrations table from nova database.
| created_at | updated_at | deleted_at | id | source_compute | dest_compute | dest_host | status | instance_uuid | old_instance_type_id | new_instance_type_id | source_node | dest_node | deleted | migration_type | hidden |

after execute evacuate action.
| 2016-05-30 06:15:24 | NULL | NULL | 127 | compute1 | NULL | NULL | accepted | eb46eb30-08ae-4faa-a62e-b5f100bcddf5 | NULL | NULL | compute1 | NULL | 0 | evacuation | 0 |

after restart nova compute service on source node.
| 2016-05-30 06:15:24 | 2016-05-30 06:17:52 | NULL | 127 | compute1 | NULL | NULL | completed | eb46eb30-08ae-4faa-a62e-b5f100bcddf5 | NULL | NULL | compute1 | NULL | 0 | evacuation | 0 |

alex (chinabjalex) wrote :
Download full text (3.5 KiB)

Below is the instance detail. It's host is stilling compute1(the source node)
[root@controller ~]# nova show eb46eb30-08ae-4faa-a62e-b5f100bcddf5
+--------------------------------------+------------------------------------------------------------+
| Property | Value |
+--------------------------------------+------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | aaaaaaa |
| OS-EXT-SRV-ATTR:host | compute1 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute1 |
| OS-EXT-SRV-ATTR:instance_name | instance-0000003b |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2016-05-30T06:11:41.000000 |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| admin-net network | 192.168.0.61 |
| config_drive | True |
| created | 2016-05-30T02:53:37Z |
| flavor | mini (100000000) |
| hostId | da6fc182d7a4592413365df255c835eb99d233d8e1c479642caa7a9f |
| id | eb46eb30-08ae-4faa-a62e-b5f100bcddf5 |
| image | cirros-0.3.1-x86_32 (6e1c83af-e57e-44eb-8a7d-0296b5f464ef) |
| key_name | - |
| metadata | {"admin_pass": "a", "admin_user": "root"} |
| name | vm1 |
| os-extended-volumes:volumes_attached | [] |
| os_type | linux |
| progress | 0 |
| security_groups | default |
| status | ACTIVE ...

Read more...

alex (chinabjalex) on 2016-06-02
Changed in nova:
status: In Progress → New
Changed in nova:
status: New → In Progress

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/318731
Reason: This patch has been sitting unchanged for more than 12 weeks. I am therefore going to abandon it to keep the nova review queue sane. Please feel free to restore the change if you're still working on it.

Anusha Unnam (anusha-unnam) wrote :

Since there is no activity in this bug and the patch has been abandoned, removing the assignee and setting the status to new because this was not confirmed by anyone other than the reporter.

Changed in nova:
assignee: alex (chinabjalex) → nobody
status: In Progress → New
Anusha Unnam (anusha-unnam) wrote :

I tried to reproduce this bug and i think this is a valid bug. I followed below steps:

Multinode environment with controller and compute node,

!. boot an instance vm1 on compute node.
2. made the compute node to fault state.
3. disabled compute service on controller node.
4. evacaute vm1.
5. recover the fault host and restart the compute service.

As you said vm1 is deleted on compute node. I saw this message from n-cpu log
"[instance: 36663d00-fad8-468e-9a13-462b2bbfbd37] Deleting instance as it has been evacuated from this host"
"INFO nova.virt.libvirt.driver [-] [instance: 36663d00-fad8-468e-9a13-462b2bbfbd37] Instance destroyed successfully."

When i did "nova show vm1", it shows wrongly that it is still on compute node and task_state is 'active' and power_state is 'NOSTATE'.
So the vm is not running and it was not evacuated to controller node even after I enabled the compute service on controller node.

I read that nova evacuate is useful with shared storage. But in my environment i don't have shared storage. I see from your logs you have "onSharedStorage": true.

Changed in nova:
status: New → Confirmed
Changed in nova:
assignee: nobody → Margarita Shakhova (shakhova-margarita)

In fact this bug is reproduced only if we have more than one evacuation attempt so there are more than one row can be found by filter instance_uuid+'evacuation'+'accepted'.
Nova manager marks migration as 'completed' at the first start after failure but because of error in code nova manager finds only last row so therefore instance will be rebooted at nova-compute restart several times which equals the number of 'accepted' evacuation attempts.

Matt Riedemann (mriedem) wrote :

This was fixed under a different bug: https://review.openstack.org/#/c/499237/

melanie witt (melwitt) wrote :

This bug has likely been fixed by https://review.openstack.org/#/c/499237

but we should have some functional test coverage to be sure. mriedem suggested we could probably add a nova-compute restart to this functional test to verify:

https://review.openstack.org/#/c/499237/4/nova/tests/functional/regressions/test_bug_1713783.py

Yes, this patch fixes the problem. We just tested almost the same change on our lab (use 'failed' instead of 'error').

Still I see one more potential source of problem here:

https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L648

There could be more than one rows found from filter, in this case just the last one will be updated. It would be better to have dict of lists here.

melanie witt (melwitt) wrote :

What do you mean? The line:

 evacuations = {mig.instance_uuid: mig for mig in evacuations}

will create a dict that contains all of the found migrations, keyed by instance UUID.

I mean that at previous lines we get:

evacuations = objects.MigrationList.get_by_filters(context, filters)

there could be more than one row for which satus is 'accepted' or 'done' and migration_type='evacuation'

If there are more than one raw with the same instance_uuid the first one record in the dict will be created using the first raw, if one more raw found than the value for the same key will be just updated. Therefore we will always have just one value for the instance_uuid key.

In L701 migration status will be updated to 'completed', if we have more than one accepted/done migrations for the same instance_uuid, this code will update only last one. At the next nova-compute restart instance will be rebooted.

I am trying to investigate in what case it is possible to have several such records for the same instance at the host in DB. Before fix https://review.openstack.org/#/c/499237/ it was very easy to reproduce: just call 'evacuate' several times and see several 'accepted' migrations of the same instance from the same host. I am not sure if this is likely to happen now, but just in case it would be better to look for all such records in Migrations.

melanie witt (melwitt) wrote :

Ah, I understand now, thanks for the detailed explanation.

I agree, seems like it would be best to ensure we can handle more than one migration record per instance.

Should I report a separate bug to fix this?

melanie witt (melwitt) wrote :

Yes, I think that would be best. As you mentioned earlier, since the fix [0], it's not clear whether multiple migrations per instance can still happen, but there could be old migrations still around from back when it was possible, and those should be cleaned up.

But, looking at the code, it doesn't look like we can do much about it other than mark the extra migrations as 'completed' because we can only destroy the local instance, cleanup allocations, etc, once per instance anyway.

[0] https://review.openstack.org/#/c/499237

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers