Critical level CPU threshold alarm raised on compute node after instances evacuation/deletion (with force lock)

Bug #1797948 reported by Wendy Mitchell
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Tee Ngo

Bug Description

Brief Description
-----------------
STX: Critical level CPU threshold alarm rasied after instances evacuated (with force lock) then deleted.

Severity
--------
Major

Steps to Reproduce
------------------
1. instances on the host (eg. compute-1)

instances
c37e4e5d-3272-4590-a82e-8faf5794b2c4 | tenant2-image_ephemswap-126
8ddfe354-87d3-4028-91cc-4f1b28b84b6b | tenant2-image_root-124
16332260-3886-4e02-be62-0c0e61f9cdaa | tenant2-image_root_attachvol-125
7887c9e0-6f28-437e-83f4-9904055755f3 | tenant2-vol_ephemswap-123
a670f39a-2021-480e-991d-5b7820786b4f | tenant2-vol_root-122

2. Host is force locked eg. compute-1

[2018-10-14 00:57:27,457] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://<address>:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-region-name RegionOne host-lock compute-1 --force'

3. Unlock compute host eg. compute-1
[2018-10-14 01:02:26,604] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://<address>:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-region-name RegionOne host-unlock compute-1'

The hypervisor for compute-1 enabled here
2018-10-14T01:07:21.000 controller-0 fmManager: info { "event_log_id" : "275.001", "reason_text" : "Host compute-1 hypervisor is now unlocked-enabled", "entity_instance_id" : "host=compute-1.hypervisor=427e041b-f5d3-4c89-b3f6-1110e89ea3b1", "severity" : "critical", "state" : "msg", "timestamp" : "2018-10-14 01:07:21.726220" }

5. Instances are deleted after the hypervisor on compute-1 is enabled state
[2018-10-14 01:07:36,031] 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://<address>:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-region-name RegionOne delete a670f39a-2021-480e-991d-5b7820786b4f 7887c9e0-6f28-437e-83f4-9904055755f3 8ddfe354-87d3-4028-91cc-4f1b28b84b6b 16332260-3886-4e02-be62-0c0e61f9cdaa c37e4e5d-3272-4590-a82e-8faf5794b2c4'

6. cinder volumes (3) are deleted
[2018-10-14 01:07:56,502] 'cinder --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://<address>:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-region-name RegionOne delete 3c468cb1-bae0-4ed3-8a81-d3d21ea9ad3e b9756a8d-3b00-421e-bb6e-1688cebd1133 becadf06-7a8f-4cf1-87ca-0bdf03d781e3'

Expected Behavior
------------------
Successful force lock and evacuation of instances
Successful unlock on the compute followed by successful deletion of instances and cinder volumes

Actual Behavior
----------------
The test is failing after the host is unlocked as a new critical CPU threshold alarm raised and persists long after.
(and not clear for )

Alarm raised at 1:07:58 and still there at [2018-10-14 01:13:40,563]

07e07516-3788-4ad7-b2a7-ccaae3f6464b | 100.101 | Platform CPU threshold exceeded; 95%, actual 100% | host=compute-1 | critical | 2018-10-14T01:07:58.466584

Reproducibility
---------------
Reproducible

System Configuration
--------------------
2 controller, 3 compute

Branch/Pull Time/Commit
-----------------------
Master as of: 2018-10-12_20-18-00

Timestamp/Logs
--------------
see inline

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Targeting stx.2019.03 -- issue reported on only one system using a build from stx master. The issue has not been reported on the stx.2018.10 release branch

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.03 stx.metal
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Tee to triage

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Tee Ngo (teewrs)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on Tee's investigation, this is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1794366

Ken Young (kenyis)
tags: added: stx.2019.05
removed: stx.2019.03
Revision history for this message
Ghada Khalil (gkhalil) wrote :

The duplicate launchpad has been addressed by https://review.openstack.org/613668
Updating the status to match the duplicate bug

Changed in starlingx:
status: Triaged → Fix Released
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.