Brief Description
-----------------
Platform CPU threshold on compute-1 100% after unlock of the other compute node (compute-0)
Severity
--------
Major: The compute-1 remain in the degraded state after unlock
Steps to Reproduce
------------------
1. system host-lock compute-0 (member of quorum so 800.001 alarm raised)
2. system host-unlock compute-0
3. confirm 800.001 alarm clears and confirm other worker host does not get critical CPU alarm
Expected Behavior
------------------
The compute-0 unlock and 800.001 alarm clear without effecting CPU on other worker
Actual Behavior
----------------
critical CPU threshold alarm on her worker (compute-1)
100.101 Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00% host=compute-1 critical 2019-08-20T13:33:27
$ fm alarm-list; date
+----------+-----------------------------------------------------------------------+-----------------------+----------+----------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-----------------------------------------------------------------------+-----------------------+----------+----------------+
| 100.101 | Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00% | host=compute-1 | critical | 2019-08-20T17: |
| | | | | 33:27.367058 |
| | | | | |
| 100.114 | NTP address 2607:5300:60:33 is not a valid or a reachable NTP server. | host=controller-1.ntp | minor | 2019-08-20T14: |
| | | =2607:5300:60:33 | | 53:58.877494 |
| | | | | |
$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 3 | controller-1 | controller | unlocked | enabled | available |
| 4 | compute-0 | worker | unlocked | enabled | available |
| 6 | compute-1 | worker | unlocked | enabled | degraded |
+----+--------------+-------------+----------------+-------------+--------------+
Reproducibility
---------------
CPU critical threshold reached and did not clear when a different compute was locked/unlocked
System Configuration
--------------------
standard System
Branch/Pull Time/Commit
-----------------------
20190819T033000Z"
wcp 63-66
nova/test_cpu_thread.py::TestHTDisabled::test_boot_vm_cpu_thread_ht_disabled[2-require-None-CPUThreadErr.HT_HOST_UNAVAIL]
Last Pass
---------
Timestamp/Logs
--------------
fm-event.log
2019-08-20T16:26:06.000 controller-0 fmManager: info { "event_log_id" : "200.001", "reason_text" : "compute-0 was administratively locked to take it out-of-service.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-0", "severity" : "warning", "state" : "clear", "timestamp" : "2019-08-20 16:26:06.862116" }
2019-08-20T16:26:06.000 controller-0 fmManager: info { "event_log_id" : "200.021", "reason_text" : "compute-0 manual 'unlock' request", "entity_instance_id" : "host=compute-0.command=unlock", "severity" : "not-applicable", "state" : "msg", "timestamp" : "2019-08-20 16:26:06.867437" }
2019-08-20T16:26:06.000 controller-0 fmManager: info { "event_log_id" : "750.004", "reason_text" : "Application Apply In Progress", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.k8s_application=stx-openstack", "severity" : "warning", "state" : "set", "timestamp" : "2019-08-20 16:26:06.767427" }
2019-08-20T16:26:11.000 controller-0 fmManager: info { "event_log_id" : "200.022", "reason_text" : "compute-0 is now 'offline'", "entity_instance_id" : "host=compute-0.status=offline", "severity" : "not-applicable", "state" : "msg", "timestamp" : "2019-08-20 16:26:11.876915" }
2019-08-20T16:26:37.000 controller-0 fmManager: info { "event_log_id" : "750.004", "reason_text" : "Application Apply In Progress", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.k8s_application=stx-openstack", "severity" : "warning", "state" : "clear", "timestamp" : "2019-08-20 16:26:37.485687" }
2019-08-20T16:26:37.000 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-0 Configuration is out-of-date.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=controller-0", "severity" : "major", "state" : "set", "timestamp" : "2019-08-20 16:26:37.414867" }
2019-08-20T16:26:37.000 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-1 Configuration is out-of-date.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=controller-1", "severity" : "major", "state" : "set", "timestamp" : "2019-08-20 16:26:37.535612" }
2019-08-20T16:26:51.000 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-0 Configuration is out-of-date.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=controller-0", "severity" : "major", "state" : "clear", "timestamp" : "2019-08-20 16:26:51.309396" }
2019-08-20T16:26:51.000 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-1 Configuration is out-of-date.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=controller-1", "severity" : "major", "state" : "clear", "timestamp" : "2019-08-20 16:26:51.513353" }
2019-08-20T16:32:10.000 controller-0 fmManager: info { "event_log_id" : "200.022", "reason_text" : "compute-0 is now 'online'", "entity_instance_id" : "host=compute-0.status=online", "severity" : "not-applicable", "state" : "msg", "timestamp" : "2019-08-20 16:32:10.442120" }
2019-08-20T16:32:43.000 controller-0 fmManager: info { "event_log_id" : "800.001", "reason_text" : "Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.cluster=b6d66f65-4a9b-4af2-83a8-5f83898abb47", "severity" : "warning", "state" : "clear", "timestamp" : "2019-08-20 16:32:43.254150" }
2019-08-20T16:33:06.000 controller-0 fmManager: info { "event_log_id" : "200.006", "reason_text" : "compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-0.process=pci-irq-affinity-agent", "severity" : "major", "state" : "set", "timestamp" : "2019-08-20 16:33:06.085508" }
2019-08-20T16:33:08.000 controller-0 fmManager: info { "event_log_id" : "200.022", "reason_text" : "compute-0 is now 'enabled'", "entity_instance_id" : "host=compute-0.state=enabled", "severity" : "not-applicable", "state" : "msg", "timestamp" : "2019-08-20 16:33:08.286408" }
2019-08-20T16:33:12.000 controller-0 fmManager: info { "event_log_id" : "275.001", "reason_text" : "Host compute-0 hypervisor is now unlocked-disabled", "entity_instance_id" : "host=compute-0.hypervisor=79bcdc89-28a2-4925-86fa-9f1e9cb93510", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-20 16:33:11.910350" }
2019-08-20T16:34:46.000 controller-0 fmManager: info { "event_log_id" : "200.006", "reason_text" : "compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-0.process=pci-irq-affinity-agent", "severity" : "major", "state" : "clear", "timestamp" : "2019-08-20 16:34:46.721902" }
2019-08-20T16:35:23.000 controller-0 fmManager: info { "event_log_id" : "275.001", "reason_text" : "Host compute-0 hypervisor is now unlocked-enabled", "entity_instance_id" : "host=compute-0.hypervisor=79bcdc89-28a2-4925-86fa-9f1e9cb93510", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-20 16:35:23.061089" }
2019-08-20T16:35:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 99.85%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:35:27.367770" }
2019-08-20T16:37:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:37:27.366975" }
2019-08-20T16:39:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:39:27.367986" }
2019-08-20T16:41:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:41:27.368017" }
2019-08-20T16:43:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:43:27.367715" }
2019-08-20T16:47:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:47:27.367035" }
2019-08-20T16:49:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:49:27.367982" }
2019-08-20T16:57:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:57:27.368275" }
2019-08-20T16:59:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:59:27.367815" }
2019-08-20T17:01:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:01:27.367033" }
2019-08-20T17:03:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:03:27.369314" }
2019-08-20T17:05:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 99.99%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:05:27.374079" }
2019-08-20T17:07:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:07:27.379294" }
2019-08-20T17:11:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.02%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:11:27.368355" }
2019-08-20T17:13:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:13:27.374143" }
2019-08-20T17:15:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:15:27.383858" }
2019-08-20T17:17:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:17:27.367882" }
2019-08-20T17:23:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:23:27.367097" }
Test Activity
-------------
regression
(Note: This test is not running CPU load)
compute-0 has a 800.001 alarm ie. health warn after the compute-0 is locked (as the compute-0 is one of the monitors)
This should not cause critical CPU on the other compute.