DX VMs stayed on the same host after host reboot

Bug #1867009 reported by Peng Peng
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Stefan Dinescu

Bug Description

Brief Description
-----------------
Having VMs running on DX active controller; reboot -f active controller. VMs failed to evacuate to another host.

Severity
--------
Major

Steps to Reproduce
------------------
1. Boot VMs on DX active controller
2. reboot -f active controller
3. check VM host

TC-name: /mtc/test_evacuate.py::TestTisGuest::()::test_evacuate_vms

Expected Behavior
------------------
vms are successfully evacuated and host is recovered after reboot

Actual Behavior
----------------
VMs stayed on the same host after host reboot

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
Two node system

Lab-name: WCP_76-77

Branch/Pull Time/Commit
-----------------------
20200311T013001Z

Last Pass
---------
Lab: PV1
Load: 20200219T023000Z

Timestamp/Logs
--------------
[2020-03-11 10:40:06,467] 314 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show cac2c22f-0f82-4ca8-aeb8-d9119945e501'
[2020-03-11 10:40:09,942] 436 DEBUG MainThread ssh.expect :: Output:
+-------------------------------------+------------------------------------------------------------+
| Field | Value |
+-------------------------------------+------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | controller-1 |

[2020-03-11 10:48:12,128] 181 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-1
[2020-03-11 10:48:12,128] 314 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2020-03-11 10:52:15,987] 314 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2020-03-11 10:52:22,820] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+------------------------+---------+------------------------------------------------------------+------------------+-----------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------------------+---------+------------------------------------------------------------+------------------+-----------------+
| 8de4313f-6646-4278-b024-f28fb6471021 | tenant1-image_vol-14 | ERROR | tenant1-mgmt-net=192.168.141.39; tenant1-net1=172.16.1.247 | tis-centos-guest | flv_nolocaldisk |
| 7d4505c9-5626-4ef2-97a1-3b257ccc68f9 | tenant1-image_novol-13 | ERROR | tenant1-mgmt-net=192.168.141.46; tenant1-net1=172.16.1.195 | tis-centos-guest | flv_nolocaldisk |
| 285a93f7-c21c-4538-981d-a651c73cef8e | tenant1-vol_local-12 | REBUILD | tenant1-mgmt-net=192.168.141.50; tenant1-net1=172.16.1.176 | | flv_localdisk |
| cac2c22f-0f82-4ca8-aeb8-d9119945e501 | tenant1-vol_nolocal-11 | ERROR | tenant1-mgmt-net=192.168.141.49; tenant1-net1=172.16.1.153 | | flv_nolocaldisk |
+--------------------------------------+------------------------+---------+------------------------------------------------------------+------------------+-----------------+

[2020-03-11 10:57:42,756] 314 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2020-03-11 10:57:46,426] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
| 8de4313f-6646-4278-b024-f28fb6471021 | tenant1-image_vol-14 | ACTIVE | tenant1-mgmt-net=192.168.141.39; tenant1-net1=172.16.1.247 | tis-centos-guest | flv_nolocaldisk |
| 7d4505c9-5626-4ef2-97a1-3b257ccc68f9 | tenant1-image_novol-13 | ACTIVE | tenant1-mgmt-net=192.168.141.46; tenant1-net1=172.16.1.195 | tis-centos-guest | flv_nolocaldisk |
| 285a93f7-c21c-4538-981d-a651c73cef8e | tenant1-vol_local-12 | ACTIVE | tenant1-mgmt-net=192.168.141.50; tenant1-net1=172.16.1.176 | | flv_localdisk |
| cac2c22f-0f82-4ca8-aeb8-d9119945e501 | tenant1-vol_nolocal-11 | ACTIVE | tenant1-mgmt-net=192.168.141.49; tenant1-net1=172.16.1.153 | | flv_nolocaldisk |
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
controller-0:~$
[2020-03-11 10:57:46,426] 314 DEBUG MainThread ssh.send :: Send 'echo $?'
[2020-03-11 10:57:46,531] 436 DEBUG MainThread ssh.expect :: Output:
0
controller-0:~$
[2020-03-11 10:57:46,532] 1604 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_76_77
[2020-03-11 10:57:46,532] 479 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2020-03-11 10:57:46,532] 314 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show cac2c22f-0f82-4ca8-aeb8-d9119945e501'
[2020-03-11 10:57:50,379] 436 DEBUG MainThread ssh.expect :: Output:
+-------------------------------------+------------------------------------------------------------+
| Field | Value |
+-------------------------------------+------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | controller-1 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | controller-1 |

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - VM evacuation issue.
Needs investigation in the VIM area as it triggers the evacuations.

tags: added: stx.nfv
Changed in starlingx:
status: New → Triaged
tags: added: stx.4.0
Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
Frank Miller (sensfan22)
Changed in starlingx:
assignee: Ovidiu Poncea (ovidiu.poncea) → Stefan Dinescu (stefandinescu)
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
2020-06-27_18-35-20
WCP_69-70

new log added at
https://files.starlingx.kube.cengn.ca/launchpad/1867009

Revision history for this message
Frank Miller (sensfan22) wrote :

Because some VMs do evacuate and the ones that don't recover once the host recovers this issue is not serious enough to warrant a cherrypick back to r/stx.4.0 when the fix is ready. Re-gating this to stx.5.0.

tags: added: stx.5.0
removed: stx.4.0
Revision history for this message
Stefan Dinescu (stefandinescu) wrote :

Rejecting this issue as the test results are non-deterministic due to inconsistent reboot times. Depending on the time a host needs to do a full reboot and come back up, the platform code may not have enough time to evacuate all instances on the rebooted node. when the rebooted node comes back up, the instances that had not had enough time to be evacuated, are recovered on the rebooted node anyway, so the end result is the same: all instances are recovered (just not evacuated as the test expected).

Test setup run:
- 4 instances launched: 2 from volume and 2 from image.
- one of the instances launched from image had an empty volume attached to it
- one of the instances launched from volume had a flavor with swap and ephemeral disks instances were always running on the active controller
- during the testing, due to reboots the active controller switches between the two controllers, so reboots alternate between controllers

I ran these instances through a "normal reboot" testing and through "long reboot" testing:
- "normal reboot" was just doing a "reboot -f" on the active controller and wait for the node to come back up on its own
- "long reboot" was doing a "reboot -f" on the active controller, but then power-off the node immediately after the node is rebooted. this was done in order to simulate a longer reboot time.

"Normal reboot" test results:
- test1: fail, 1 volume instance failed to evacuate due to compute services coming back up
  compute services fail: 2020-08-05 17:53:16.448518
  compute services come back: 2020-08-05 17:58:33.264947
- test2: fail, 1 volume instance and 1 image instance failed to evacuate due to compute services coming back up
  compute service fail: 2020-08-05 18:12:35.777400
  compute services come back: 2020-08-05 18:19:03.734006
- test3: success, all instances evacuate
  compute services fail: 2020-08-05 18:25:09.833459
  compute services come back: 2020-08-05 18:36:08.501073
- test4: success, all instances evacuate
  compute services fail: 2020-08-05 18:39:31.652509
  compute services come back up: 2020-08-05 18:49:50.87526
- test5: success, all instances evacuate
 compute services fail: 2020-08-05 18:57:44.042584
 compute services come back up: 2020-08-05 19:07:21.471962
- test6: fail, 1 image instance fails to evacuate
  compute services fail: 2020-08-05 19:11:41.626354
  compute services come back up: 2020-08-05 19:18:12.168744

"Long reboot" tests:
- test1: success
- test2: success
- test3: success
- test4: success
- test5: success
- test6: success

Observations:
- even though all tests were run on the same AIO-DX setup, reboot times for a host are not consistent across tests. When the reboot time is lower, the platform code doesn't have enough time to evacuate all 4 instances from the failed node.
- in order to evacuate all instances on a node, the node must be down about 3 minutes/VM.

I talked with the reporter of this bug to change their test scenario to increase the time the rebooted node stays down in order to give the platform enough time evacuate all instances.

Changed in starlingx:
status: Triaged → Invalid
Peng Peng (ppeng)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.