SX host-lock failed due to failure to stop VM

Bug #1830430 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Austin Sun

Bug Description

Brief Description
-----------------
In single node system, tried to lock host, after 1 min, the host is still in unlock status

Severity
--------
Major

Steps to Reproduce
------------------
as description

TC-name: TestLockWithVMs::test_lock_with_max_vms_simplex

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Intermittent

System Configuration
--------------------
One node system

Lab-name: SM-2

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-05-23_18-37-00

Last Pass
---------
20190506T233000Z

Timestamp/Logs
--------------

[2019-05-24 15:17:50,348] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock controller-0'
[2019-05-24 15:17:52,481] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------------------+
| Property | Value |
+---------------------+--------------------------------------------+
| action | none |
| administrative | unlocked |

[2019-05-24 15:18:43,952] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-0'
[2019-05-24 15:18:45,769] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+-------------------------------------------------------------------------+
| Property | Value |
+---------------------+-------------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Numan Waheed (nwaheed) wrote :

As per Peng, this issue is only seen once.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Please monitor to get a better understanding of frequency

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Frank Miller (sensfan22) wrote :

Looking into the nfv-vim logs for the unlock, the failure is due to unable to stop one of the VMs:

2019-05-24T15:17:52.249 controller-0 VIM_Thread[441639] DEBUG _vim_nfvi_events.py.63 Host action, host_uuid=d3a0401f-ed79-4465-b5ae-6743438cf6e3, host_name=controller-0, do_action=lock.

2019-05-24T15:17:55.332 controller-0 VIM_Thread[441639] INFO _host_task_work.py.1467 Host controller-0 instance stopped extending timeout.

2019-05-24T15:18:13.111 controller-0 VIM_Thread[441639] INFO _task_worker_pool.py.73 Timeout worker Compute-Worker-0
2019-05-24T15:18:14.117 controller-0 VIM_Thread[441639] ERROR _task.py.200 Task(stop_instance) work (stop_server) timed out, id=1019.
2019-05-24T15:18:14.119 controller-0 VIM_Thread[441639] DEBUG _instance_task_work.py.856 Stop-Instance callback for tenant2-vol_ephemswap-5, response={'completed': False, 'reason': ''}.
2019-05-24T15:18:14.120 controller-0 VIM_Thread[441639] DEBUG _instance_tasks.py.596 Task (stop-instance_tenant2-vol_ephemswap-5) complete.
2019-05-24T15:18:14.120 controller-0 VIM_Thread[441639] INFO _instance_state_stop.py.80 Stop failed for tenant2-vol_ephemswap-5.
2019-05-24T15:18:14.121 controller-0 VIM_Thread[441639] DEBUG _instance.py.2158 Stop failed for instance tenant2-vol_ephemswap-5, reason=.
2019-05-24T15:18:14.122 controller-0 VIM_Thread[441639] INFO _instance_director.py.1071 Stop of instance tenant2-vol_ephemswap-5 on host controller-0 failed.
2019-05-24T15:18:14.123 controller-0 VIM_Thread[441639] INFO _host_task_work.py.1477 Host controller-0 instances-stopped, success=False.
2019-05-24T15:18:14.123 controller-0 VIM_Thread[441639] INFO _host_task_work.py.1507 Notify-Instances-Host-Disabling failed for controller-0.

2019-05-24T15:18:14.129 controller-0 VIM_Thread[441639] INFO _instance_director.py.1332 Canceling host operation host-lock for host controller-0.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Maybe a nova issue(?); needs further investigation to determine frequency and decide on whether this gates stx.2.0 or not

tags: added: stx.2.0 stx.distro.openstack
summary: - SX host-lock failed
+ SX host-lock failed due to failure to stop VM
Changed in starlingx:
importance: Undecided → Medium
status: Incomplete → Triaged
assignee: nobody → Bruce Jones (brucej)
tags: removed: stx.2.0
Revision history for this message
Bruce Jones (brucej) wrote :

Cindy please have someone reproduce and root cause this issue.

Changed in starlingx:
assignee: Bruce Jones (brucej) → Cindy Xie (xxie1)
Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Austin Sun (sunausti)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating until further investigation.

In addition to attempting to reproduce, the attached logs should be reviewed as well

tags: added: stx.2.0
Revision history for this message
Austin Sun (sunausti) wrote :

From log, the VM
2019-05-24T15:18:13.111 controller-0 VIM_Thread[441639] INFO _task_worker_pool.py.73 Timeout worker Compute-Worker-0
2019-05-24T15:18:14.117 controller-0 VIM_Thread[441639] ERROR _task.py.200 Task(stop_instance) work (stop_server) timed out, id=1019.
2019-05-24T15:18:14.119 controller-0 VIM_Thread[441639] DEBUG _instance_task_work.py.856 Stop-Instance callback for tenant2-vol_ephemswap-5, response={'completed': False, 'reason': ''}.
2019-05-24T15:18:14.120 controller-0 VIM_Thread[441639] DEBUG _instance_tasks.py.596 Task (stop-instance_tenant2-vol_ephemswap-5) complete.
2019-05-24T15:18:14.120 controller-0 VIM_Thread[441639] INFO _instance_state_stop.py.80 Stop failed for tenant2-vol_ephemswap-5.
2019-05-24T15:18:14.121 controller-0 VIM_Thread[441639] DEBUG _instance.py.2158 Stop failed for instance tenant2-vol_ephemswap-5, reason=.

tenant2-vol_ephemswap-5 is timeout for stop, but all other 3 VMs are successfully stopped.
and there is no abnormal log in nova container.

and I tried to reproduce this issue 10 times, it is always successfully lock/unlock, and VM can shutdown/active.

It might be same as https://bugs.launchpad.net/starlingx/+bug/1817212 and https://bugs.launchpad.net/starlingx/+bug/1817936

Revision history for this message
yong hu (yhu6) wrote :

@austin, please check it on the latest build. If still not reproduced, we can close it.

Revision history for this message
Austin Sun (sunausti) wrote :

duplicate bug #1817936, once bug #1817936 is fixed, please verify this issue if fixed or not.

Revision history for this message
Austin Sun (sunausti) wrote :

in kern.log
2019-05-24T15:15:51.686 controller-0 kernel: err [ 3291.728637] kvm [898299]: vcpu0 disabled perfctr wrmsr: 0xc2 data 0xffff
2019-05-24T15:16:35.921 controller-0 kernel: err [ 3335.963042] kvm [907375]: vcpu0 disabled perfctr wrmsr: 0xc2 data 0xffff
2019-05-24T15:17:15.933 controller-0 kernel: err [ 3375.974180] kvm [916407]: vcpu0 disabled perfctr wrmsr: 0xc2 data 0xffff

it seems kvm issue. need to double check if it really same as 1817936.

Revision history for this message
yong hu (yhu6) wrote :

even https://bugs.launchpad.net/bugs/1817936 is fixed, there might be other issues inside.

Revision history for this message
Austin Sun (sunausti) wrote :

from nova-api-osapi, stop server log ,
{"log":"2019-05-24 15:17:52.756 1 INFO nova.osapi_compute.wsgi.server [req-b1bb493c-20ec-4b9d-a82d-a87cc952a09a a4519283ce234c0783623cfc68e5548d f51c7914e02748c8b8636d3b99a81808 - default default] 128.224.150.222 \"POST /v2.1/f51c7914e02748c8b8636d3b99a81808/servers/ba1bb1c4-a43a-456c-81a8-c667c41f90bd/action HTTP/1.1\" status: 202 len: 400 time: 0.3294051\n","stream":"stdout","time":"2019-05-24T15:17:52.757648791Z"}
{"log":"2019-05-24 15:17:52.756 1 INFO nova.osapi_compute.wsgi.server [req-b1bb493c-20ec-4b9d-a82d-a87cc952a09a a4519283ce234c0783623cfc68e5548d f51c7914e02748c8b8636d3b99a81808 - default default] 128.224.150.222 \"POST /v2.1/f51c7914e02748c8b8636d3b99a81808/servers/ba1bb1c4-a43a-456c-81a8-c667c41f90bd/action HTTP/1.1\" status: 202 len: 400 time: 0.3294051\n","stream":"stdout","time":"2019-05-24T15:17:52.757755745Z"}
{"log":"2019-05-24 15:17:52.794 1 INFO nova.osapi_compute.wsgi.server [req-607803b9-b5ee-4b2f-ae22-9cd9b39fc7ef a4519283ce234c0783623cfc68e5548d f51c7914e02748c8b8636d3b99a81808 - default default] 128.224.150.222 \"POST /v2.1/f51c7914e02748c8b8636d3b99a81808/servers/9c80943a-e08d-4cc3-8024-7e6a8f4dc073/action HTTP/1.1\" status: 202 len: 400 time: 0.3579650\n","stream":"stdout","time":"2019-05-24T15:17:52.795767252Z"}
{"log":"2019-05-24 15:17:52.794 1 INFO nova.osapi_compute.wsgi.server [req-607803b9-b5ee-4b2f-ae22-9cd9b39fc7ef a4519283ce234c0783623cfc68e5548d f51c7914e02748c8b8636d3b99a81808 - default default] 128.224.150.222 \"POST /v2.1/f51c7914e02748c8b8636d3b99a81808/servers/9c80943a-e08d-4cc3-8024-7e6a8f4dc073/action HTTP/1.1\" status: 202 len: 400 time: 0.3579650\n","stream":"stdout","time":"2019-05-24T15:17:52.795801805Z"}
{"log":"2019-05-24 15:17:52.999 1 INFO nova.osapi_compute.wsgi.server [req-56ca1e4a-7377-4261-8406-162491d85fae a4519283ce234c0783623cfc68e5548d f51c7914e02748c8b8636d3b99a81808 - default default] 128.224.150.222 \"POST /v2.1/f51c7914e02748c8b8636d3b99a81808/servers/00feca69-2c51-4ece-a1a5-56cbcf4c9d8f/action HTTP/1.1\" status: 202 len: 400 time: 0.1993840\n","stream":"stdout","time":"2019-05-24T15:17:53.000355211Z"}
{"log":"2019-05-24 15:17:52.999 1 INFO nova.osapi_compute.wsgi.server [req-56ca1e4a-7377-4261-8406-162491d85fae a4519283ce234c0783623cfc68e5548d f51c7914e02748c8b8636d3b99a81808 - default default] 128.224.150.222 \"POST /v2.1/f51c7914e02748c8b8636d3b99a81808/servers/00feca69-2c51-4ece-a1a5-56cbcf4c9d8f/action HTTP/1.1\" status: 202 len: 400 time: 0.1993840\n","stream":"stdout","time":"2019-05-24T15:17:53.000549102Z"}

tenant2-image_root_attachvol-7 is ba1bb1c4-a43a-456c-81a8-c667c41f90bd,
tenant2-image_root-6 is 9c80943a-e08d-4cc3-8024-7e6a8f4dc073
tenant2-vol_root-4 is 00feca69-2c51-4ece-a1a5-56cbcf4c9d8f,
those 3 instances together with (name:tenant2-vol_ephemswap-5, id c83fb306-4d83-466a-b254-283270bf0dd1) are stopping , but only got 3 POST ACTIONs on nova-api-osapi, so it suspected same as 1817936, the command is not sent to os-api-osapi container.

Revision history for this message
Austin Sun (sunausti) wrote :

for "2019-05-24T15:15:51.686 controller-0 kernel: err [ 3291.728637] kvm [898299]: vcpu0 disabled perfctr wrmsr: 0xc2 data 0xffff"

similar reported https://access.redhat.com/solutions/2188061 , the wrmsr is 0xc1.
The explanation is
This warning message is reporting that the operating system cannot access the HW PMU counters and should not be of any concern. HP has advisories describing these messages and their recommended solution.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Changing the importance to match the duplicate bug

Changed in starlingx:
importance: Medium → High
Revision history for this message
Ghada Khalil (gkhalil) wrote :

The fix for the duplicate bug was merged on 2019-09-16; marking as Fix Released

Peng Peng (ppeng)
tags: removed: stx.retestneeded
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.