VM failed to recover after vm host reboot

Bug #1857462 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Low
zhipeng liu

Bug Description

Brief Description
-----------------
In DX system, boot up few VMs on active controller, then reboot that controler to evacuate the VMs. After all VMs reach "ERROR" or "REBUILD" status, some VMs remains in ERROR after host recovered.

Severity
--------
Major

Steps to Reproduce
------------------
in DX system, booting up 4 VMs in active controller
reboot -f activer controller
wait for host recover
check VMs status

TC-name: functional/mtc/test_evacuate.py::TestTisGuest::()::test_evacuate_vms

Expected Behavior
------------------
All VMs status back to "ACTIVE"

Actual Behavior
----------------

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
Two node system

Lab-name: WCP_76-77

Branch/Pull Time/Commit
-----------------------
20191224T000000Z

Last Pass
---------
20191004T013000Z

Timestamp/Logs
--------------
[2019-12-24 11:28:46,325] 181 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-1
[2019-12-24 11:28:46,325] 311 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-12-24 11:31:08,122] 311 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-12-24 11:31:10,756] 433 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
| 7c2f6a94-5527-48f4-93a6-ba9a1d9b003b | tenant1-image_vol-14 | ACTIVE | tenant1-net0=172.16.0.132; tenant2-mgmt-net=192.168.241.68 | tis-centos-guest | flv_nolocaldisk |
| be351b26-1cc6-4e35-8d68-b780cdd17051 | tenant1-image_novol-13 | ACTIVE | tenant1-net0=172.16.0.238; tenant2-mgmt-net=192.168.241.83 | tis-centos-guest | flv_nolocaldisk |
| ade5e0d7-d0e7-4e6d-aa95-34b5893c292e | tenant1-vol_local-12 | ACTIVE | tenant1-net0=172.16.0.220; tenant2-mgmt-net=192.168.241.92 | | flv_localdisk |
| 3c4b2e77-7efe-4075-aff8-7bc40444a0f1 | tenant1-vol_nolocal-11 | ACTIVE | tenant1-net0=172.16.0.163; tenant2-mgmt-net=192.168.241.77 | | flv_nolocaldisk |
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
controller-0:~$
[2019-12-24 11:31:10,757] 311 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-12-24 11:31:10,861] 433 DEBUG MainThread ssh.expect :: Output:
0
controller-0:~$
[2019-12-24 11:31:13,865] 1599 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_76_77
[2019-12-24 11:31:13,865] 476 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-12-24 11:31:13,865] 311 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-12-24 11:31:16,793] 433 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+------------------------+---------+------------------------------------------------------------+------------------+-----------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------------------+---------+------------------------------------------------------------+------------------+-----------------+
| 7c2f6a94-5527-48f4-93a6-ba9a1d9b003b | tenant1-image_vol-14 | ERROR | tenant1-net0=172.16.0.132; tenant2-mgmt-net=192.168.241.68 | tis-centos-guest | flv_nolocaldisk |
| be351b26-1cc6-4e35-8d68-b780cdd17051 | tenant1-image_novol-13 | ERROR | tenant1-net0=172.16.0.238; tenant2-mgmt-net=192.168.241.83 | tis-centos-guest | flv_nolocaldisk |
| ade5e0d7-d0e7-4e6d-aa95-34b5893c292e | tenant1-vol_local-12 | REBUILD | tenant1-net0=172.16.0.220; tenant2-mgmt-net=192.168.241.92 | | flv_localdisk |
| 3c4b2e77-7efe-4075-aff8-7bc40444a0f1 | tenant1-vol_nolocal-11 | ERROR | tenant1-net0=172.16.0.163; tenant2-mgmt-net=192.168.241.77 | | flv_nolocaldisk |
+--------------------------------------+------------------------+---------+------------------------------------------------------------+------------------+-----------------+

[2019-12-24 11:41:15,652] 311 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-12-24 11:41:18,325] 433 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
| 7c2f6a94-5527-48f4-93a6-ba9a1d9b003b | tenant1-image_vol-14 | ACTIVE | tenant1-net0=172.16.0.132; tenant2-mgmt-net=192.168.241.68 | tis-centos-guest | flv_nolocaldisk |
| be351b26-1cc6-4e35-8d68-b780cdd17051 | tenant1-image_novol-13 | ERROR | tenant1-net0=172.16.0.238; tenant2-mgmt-net=192.168.241.83 | tis-centos-guest | flv_nolocaldisk |
| ade5e0d7-d0e7-4e6d-aa95-34b5893c292e | tenant1-vol_local-12 | ACTIVE | tenant1-net0=172.16.0.220; tenant2-mgmt-net=192.168.241.92 | | flv_localdisk |
| 3c4b2e77-7efe-4075-aff8-7bc40444a0f1 | tenant1-vol_nolocal-11 | ERROR | tenant1-net0=172.16.0.163; tenant2-mgmt-net=192.168.241.77 | | flv_nolocaldisk |
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
controller-0:~$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to the distro.openstack PL for review/triage

tags: added: stx.distro.openstack
Changed in starlingx:
assignee: nobody → yong hu (yhu6)
zhipeng liu (zhipengs)
Changed in starlingx:
assignee: yong hu (yhu6) → zhipeng liu (zhipengs)
Revision history for this message
zhipeng liu (zhipengs) wrote :
Download full text (3.3 KiB)

Hi Peng,

I have found the root cause in nfv-vim side. We might need a fix in nfv module.

After controller-1 reboot, nfv will trigger VM evacuation one by one.
tenant1-vol_local-12
tenant1-image_vol-14
tenant1-image_novol-13
tenant1-vol_nolocal-11

However, there is a precondition for evacuation in nfv-vim code.
In _instance_task_work.py
line 728 # We must wait for the compute host to go offline or power off.
When first 2 VMs finish evacuation, controller-1 has already entered intest status from offline after reboot.
Then another 2 VMs evacuation are blocked and fail in the end.

Some log markers below FYI.
ALL_NODES_20191224.154508/controller-0_20191224.154508/var/log$ vi nfv-vim.log
============================================================================
2019-12-24T11:29:55.628 controller-0 VIM_Thread[2421691] DEBUG _host.py.666 Host State-Change detected: nfvi_admin_state=unlocked host_admin_state=unlocked, nfvi_oper_state=disabled host_oper_state=enabled, nfvi_avail_state=offline host_avail_status=available, locking=False unlocking=False fsm current_state=enabled for controller-1.
2019-12-24T11:29:55.643 controller-0 VIM_Thread[2421691] INFO _host_director.py.434 Notify other directors that the host controller-1 is offline.
2019-12-24T11:31:12.888 controller-0 VIM_Thread[2421691] INFO _instance_state_evacuate.py.80 Evacuate starting for tenant1-vol_local-12.
2019-12-24T11:31:13.039 controller-0 VIM_Thread[2421691] DEBUG _instance_task_work.py.679 Evacuate-Instance callback for tenant1-vol_local-12, response={'completed': True, 'reason': ''}.
2019-12-24T11:32:40.601 controller-0 VIM_Thread[2421691] INFO _instance_director.py.1012 Evacuate of instance tenant1-vol_local-12 from host controller-1 succeeded.
2019-12-24T11:32:41.278 controller-0 VIM_Thread[2421691] INFO _instance_state_evacuate.py.80 Evacuate starting for tenant1-image_vol-14.
2019-12-24T11:32:41.464 controller-0 VIM_Thread[2421691] DEBUG _instance_task_work.py.679 Evacuate-Instance callback for tenant1-image_vol-14, response={'completed': True, 'reason': ''}.
2019-12-24T11:34:37.041 controller-0 VIM_Thread[2421691] DEBUG _host.py.666 Host State-Change detected: nfvi_admin_state=unlocked host_admin_state=unlocked, nfvi_oper_state=disabled host_oper_state=disabled, nfvi_avail_state=intest host_avail_status=offline, locking=False unlocking=False fsm current_state=disabling for controller-1.
2019-12-24T11:35:04.436 controller-0 VIM_Thread[2421691] INFO _instance_director.py.1012 Evacuate of instance tenant1-image_vol-14 from host controller-1 succeeded.
2019-12-24T11:35:04.448 controller-0 VIM_Thread[2421691] DEBUG _instance_task_work.py.733 Evacuate-Instance for tenant1-image_novol-13, but host controller-1 is not offline or power-off.
2019-12-24T11:35:50.203 controller-0 VIM_Thread[2421691] DEBUG _host.py.666 Host State-Change detected: nfvi_admin_state=unlocke...

Read more...

Changed in starlingx:
status: New → In Progress
status: In Progress → Confirmed
Revision history for this message
Bart Wensley (bartwensley) wrote :
Download full text (3.4 KiB)

The VIM is behaving correctly in this case. When a host with instances on it goes offline, the behaviour is as follows:
- As long as the host is disabled, attempt to evacuate the instances from the host (one at a time).
- When the host goes enabled (e.g. after it reboots) do not do any further evacuations (this could lead to an instance running on multiple hosts).
- Once the hypervisor on the host goes enabled, recover any failed instances by rebooting (or rebuilding if the reboot fails).

That is what happened in this case. The VIM sees the host go offline and the hypervisor go disabled here:
2019-12-24T11:29:55.628 controller-0 VIM_Thread[2421691] DEBUG _host.py.666 Host State-Change detected: nfvi_admin_state=unlocked host_admin_state=unlocked, nfvi_oper_state=disabled host_oper_state=enabled, nfvi_avail_state=offline host_avail_status=available, locking=False unlocking=False fsm current_state=enabled for controller-1.
2019-12-24T11:30:00.066 controller-0 VIM_Thread[2421691] INFO _hypervisor.py.133 Hypervisor controller-1 state change was unlocked-disabled now locked-disabled

The first instance is successfully evacuated:
2019-12-24T11:32:40.601 controller-0 VIM_Thread[2421691] INFO _instance_director.py.1012 Evacuate of instance tenant1-vol_local-12 from host controller-1 succeeded.

The second instance is successfully evacuated:
2019-12-24T11:35:04.436 controller-0 VIM_Thread[2421691] INFO _instance_director.py.1012 Evacuate of instance tenant1-image_vol-14 from host controller-1 succeeded.

The host then goes enabled, so no more evacuations can be done:
2019-12-24T11:35:50.203 controller-0 VIM_Thread[2421691] DEBUG _host.py.666 Host State-Change detected: nfvi_admin_state=unlocked host_admin_state=unlocked, nfvi_oper_state=enabled host_oper_state=disabled, nfvi_avail_state=available host_avail_status=intest, locking=False unlocking=False fsm current_state=disabling for controller-1.

But the hypervisor is still disabled:
2019-12-24T11:35:56.257 controller-0 VIM_Thread[2421691] INFO _hypervisor.py.133 Hypervisor controller-1 state change was locked-disabled now unlocked-disabled

The hypervisor goes enabled here:
2019-12-24T11:41:27.482 controller-0 VIM_Thread[2421691] INFO _hypervisor.py.133 Hypervisor controller-1 state change was unlocked-disabled now unlocked-enabled

This allows the last two instances to recover:
2019-12-24T11:42:01.530 controller-0 VIM_Thread[2421691] INFO _instance_director.py.1601 Instance tenant1-image_novol-13 has recovered on host controller-1.
2019-12-24T11:42:03.775 controller-0 VIM_Thread[2421691] INFO _instance_director.py.1601 Instance tenant1-vol_nolocal-11 has recovered on host controller-1.
...

Read more...

Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Barton,

Thanks for your clarification for current VIM design!

Hi Peng,

Could you retest and double check it again? Need to wait for enough time as mentioned in Barton’s comment.
From current log, there are 2 VMs already resumed on ctrl-1 after it reboot.

Thanks!
Zhipeng

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
Revision history for this message
Thales Elero Cervi (tcervi) wrote :

Closing it now due to inactivity.

Changed in starlingx:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.