Platform CPU threshold exceeded in compute-1 after lock/unlock member of quorum compute-0

Bug #1840831 reported by Wendy Mitchell
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Lin Shuicheng

Bug Description

Brief Description
-----------------
Platform CPU threshold on compute-1 100% after unlock of the other compute node (compute-0)

Severity
--------
Major: The compute-1 remain in the degraded state after unlock

Steps to Reproduce
------------------
1. system host-lock compute-0 (member of quorum so 800.001 alarm raised)
2. system host-unlock compute-0
3. confirm 800.001 alarm clears and confirm other worker host does not get critical CPU alarm

Expected Behavior
------------------
The compute-0 unlock and 800.001 alarm clear without effecting CPU on other worker

Actual Behavior
----------------
critical CPU threshold alarm on her worker (compute-1)
100.101 Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00% host=compute-1 critical 2019-08-20T13:33:27

$ fm alarm-list; date
+----------+-----------------------------------------------------------------------+-----------------------+----------+----------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-----------------------------------------------------------------------+-----------------------+----------+----------------+
| 100.101 | Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00% | host=compute-1 | critical | 2019-08-20T17: |
| | | | | 33:27.367058 |
| | | | | |
| 100.114 | NTP address 2607:5300:60:33 is not a valid or a reachable NTP server. | host=controller-1.ntp | minor | 2019-08-20T14: |
| | | =2607:5300:60:33 | | 53:58.877494 |
| | | | | |

$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 3 | controller-1 | controller | unlocked | enabled | available |
| 4 | compute-0 | worker | unlocked | enabled | available |
| 6 | compute-1 | worker | unlocked | enabled | degraded |
+----+--------------+-------------+----------------+-------------+--------------+

Reproducibility
---------------
CPU critical threshold reached and did not clear when a different compute was locked/unlocked

System Configuration
--------------------
standard System

Branch/Pull Time/Commit
-----------------------
20190819T033000Z"
wcp 63-66
nova/test_cpu_thread.py::TestHTDisabled::test_boot_vm_cpu_thread_ht_disabled[2-require-None-CPUThreadErr.HT_HOST_UNAVAIL]

Last Pass
---------

Timestamp/Logs
--------------

fm-event.log
2019-08-20T16:26:06.000 controller-0 fmManager: info { "event_log_id" : "200.001", "reason_text" : "compute-0 was administratively locked to take it out-of-service.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-0", "severity" : "warning", "state" : "clear", "timestamp" : "2019-08-20 16:26:06.862116" }
2019-08-20T16:26:06.000 controller-0 fmManager: info { "event_log_id" : "200.021", "reason_text" : "compute-0 manual 'unlock' request", "entity_instance_id" : "host=compute-0.command=unlock", "severity" : "not-applicable", "state" : "msg", "timestamp" : "2019-08-20 16:26:06.867437" }
2019-08-20T16:26:06.000 controller-0 fmManager: info { "event_log_id" : "750.004", "reason_text" : "Application Apply In Progress", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.k8s_application=stx-openstack", "severity" : "warning", "state" : "set", "timestamp" : "2019-08-20 16:26:06.767427" }
2019-08-20T16:26:11.000 controller-0 fmManager: info { "event_log_id" : "200.022", "reason_text" : "compute-0 is now 'offline'", "entity_instance_id" : "host=compute-0.status=offline", "severity" : "not-applicable", "state" : "msg", "timestamp" : "2019-08-20 16:26:11.876915" }
2019-08-20T16:26:37.000 controller-0 fmManager: info { "event_log_id" : "750.004", "reason_text" : "Application Apply In Progress", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.k8s_application=stx-openstack", "severity" : "warning", "state" : "clear", "timestamp" : "2019-08-20 16:26:37.485687" }
2019-08-20T16:26:37.000 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-0 Configuration is out-of-date.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=controller-0", "severity" : "major", "state" : "set", "timestamp" : "2019-08-20 16:26:37.414867" }
2019-08-20T16:26:37.000 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-1 Configuration is out-of-date.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=controller-1", "severity" : "major", "state" : "set", "timestamp" : "2019-08-20 16:26:37.535612" }
2019-08-20T16:26:51.000 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-0 Configuration is out-of-date.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=controller-0", "severity" : "major", "state" : "clear", "timestamp" : "2019-08-20 16:26:51.309396" }
2019-08-20T16:26:51.000 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-1 Configuration is out-of-date.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=controller-1", "severity" : "major", "state" : "clear", "timestamp" : "2019-08-20 16:26:51.513353" }
2019-08-20T16:32:10.000 controller-0 fmManager: info { "event_log_id" : "200.022", "reason_text" : "compute-0 is now 'online'", "entity_instance_id" : "host=compute-0.status=online", "severity" : "not-applicable", "state" : "msg", "timestamp" : "2019-08-20 16:32:10.442120" }
2019-08-20T16:32:43.000 controller-0 fmManager: info { "event_log_id" : "800.001", "reason_text" : "Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.cluster=b6d66f65-4a9b-4af2-83a8-5f83898abb47", "severity" : "warning", "state" : "clear", "timestamp" : "2019-08-20 16:32:43.254150" }
2019-08-20T16:33:06.000 controller-0 fmManager: info { "event_log_id" : "200.006", "reason_text" : "compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-0.process=pci-irq-affinity-agent", "severity" : "major", "state" : "set", "timestamp" : "2019-08-20 16:33:06.085508" }
2019-08-20T16:33:08.000 controller-0 fmManager: info { "event_log_id" : "200.022", "reason_text" : "compute-0 is now 'enabled'", "entity_instance_id" : "host=compute-0.state=enabled", "severity" : "not-applicable", "state" : "msg", "timestamp" : "2019-08-20 16:33:08.286408" }
2019-08-20T16:33:12.000 controller-0 fmManager: info { "event_log_id" : "275.001", "reason_text" : "Host compute-0 hypervisor is now unlocked-disabled", "entity_instance_id" : "host=compute-0.hypervisor=79bcdc89-28a2-4925-86fa-9f1e9cb93510", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-20 16:33:11.910350" }
2019-08-20T16:34:46.000 controller-0 fmManager: info { "event_log_id" : "200.006", "reason_text" : "compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-0.process=pci-irq-affinity-agent", "severity" : "major", "state" : "clear", "timestamp" : "2019-08-20 16:34:46.721902" }
2019-08-20T16:35:23.000 controller-0 fmManager: info { "event_log_id" : "275.001", "reason_text" : "Host compute-0 hypervisor is now unlocked-enabled", "entity_instance_id" : "host=compute-0.hypervisor=79bcdc89-28a2-4925-86fa-9f1e9cb93510", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-20 16:35:23.061089" }
2019-08-20T16:35:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 99.85%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:35:27.367770" }
2019-08-20T16:37:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:37:27.366975" }
2019-08-20T16:39:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:39:27.367986" }
2019-08-20T16:41:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:41:27.368017" }
2019-08-20T16:43:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:43:27.367715" }
2019-08-20T16:47:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:47:27.367035" }
2019-08-20T16:49:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:49:27.367982" }
2019-08-20T16:57:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:57:27.368275" }
2019-08-20T16:59:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 16:59:27.367815" }
2019-08-20T17:01:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:01:27.367033" }
2019-08-20T17:03:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:03:27.369314" }
2019-08-20T17:05:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 99.99%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:05:27.374079" }
2019-08-20T17:07:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:07:27.379294" }
2019-08-20T17:11:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.02%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:11:27.368355" }
2019-08-20T17:13:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:13:27.374143" }
2019-08-20T17:15:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:15:27.383858" }
2019-08-20T17:17:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:17:27.367882" }
2019-08-20T17:23:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 100.01%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-20 17:23:27.367097" }

Test Activity
-------------
regression

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

(Note: This test is not running CPU load)
compute-0 has a 800.001 alarm ie. health warn after the compute-0 is locked (as the compute-0 is one of the monitors)
This should not cause critical CPU on the other compute.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority - requires further investigation.

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.3.0 stx.config
Changed in starlingx:
status: New → Triaged
assignee: nobody → Cindy Xie (xxie1)
Revision history for this message
Cindy Xie (xxie1) wrote :

is this similar to LP: https://bugs.launchpad.net/starlingx/+bug/1839181? Assign Shuicheng to compare the log.

Changed in starlingx:
assignee: Cindy Xie (xxie1) → Lin Shuicheng (shuicheng)
Revision history for this message
Lin Shuicheng (shuicheng) wrote :
Download full text (4.2 KiB)

From controller-0_20190820.165053/var/log/fm-event.log, we could find compute-1 has the CPU spike during 15:33:27.371002 -- 15:35:27.564281, 16:35:27.367770 -- 16:49:27.367982.
Log message like below:
2019-08-20T15:33:27.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 90.00%, actual 92.48%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-33_36.host=compute-1", "severity" : "major", "state" : "set", "timestamp" : "2019-08-20 15:33:27.371002" }

From compute-1_20190820.165053/var/extra/process.info, we could find ovs-vswitchd main process (Run in processor 0, which should be platform CPU) consumed 12 minutes CPU time, which consumes most time of processor 0.
DPDK polling threads are running in processor 1 and 2, which should be dedicated for VSwitch.

cat compute-1_20190820.165053/var/extra/process.info | grep -v "00:00" | less
--------------------------------------------------------------------
Tue Aug 20 16:53:52 UTC 2019 : : ps -eL -o pid,lwp,ppid,state,class,nice,rtprio,priority,psr,stime,etime,time,wchan:16,tty,comm,command
--------------------------------------------------------------------
    PID LWP PPID S CLS NI RTPRIO PRI PSR STIME ELAPSED TIME WCHAN TT COMMAND COMMAND
  43223 43223 1 R TS -10 - 10 0 14:59 01:54:34 00:12:53 - ? ovs-vswitchd ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
  43223 43686 1 R TS -10 - 10 1 14:59 01:54:27 01:54:25 - ? pmd64 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
  43223 43694 1 R TS -10 - 10 2 14:59 01:54:27 01:54:25 - ? pmd65 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
  87706 87706 87689 S TS 0 - 20 0 15:33 01:20:28 00:01:31 ep_poll ? /var/lib/openst /var/lib/openstack/bin/python /var/lib/openstack/bin/neutron-dhcp-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-file /etc/neutron/metadata_agent.ini --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini

And in compute-1_20190820.165053/var/log/openvswitch/ovs-vswitchd.log, there is below error message when CPU spike:
2019-08-20T16:35:08.662Z|01058|dpif_netdev|ERR|error receiving data from tap95772ad7-94: File descriptor in bad state
2019-08-20T16:35:08.850Z|01059|dpif_netdev|ERR|error receiving data from tap95772ad7-94: File descriptor in bad state
2019-08-20T16:35:09.143Z|01060|connmgr|INFO|br-int<->unix#566: 6 flow_mods in the ...

Read more...

Revision history for this message
Lin Shuicheng (shuicheng) wrote :

It should be the same issue as https://bugs.launchpad.net/starlingx/+bug/1839181.
As the log is similar.
Will duplicate it later.

Suspect it relate with OVS/DPDK, already ask help from Networking team.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as Invalid to match the duplicate LP

Changed in starlingx:
status: Triaged → Invalid
Revision history for this message
Ghada Khalil (gkhalil) wrote :

LP was closed as invalid a long time ago; removing the stx.retestneeded tag

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.