Platform CPU threshold exceeded in compute after lock/unlock a different compute host (Storage System)

Bug #1839181 reported by Senthil Mukundakumar on 2019-08-06
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
High
ChenjieXu

Bug Description

Brief Description
-----------------
Platform CPU threshold exceeded 95% during compute node lock & unlock in a storage system. This is reproduced multiple times in daily sanity.

Severity
--------
Provide the severity of the defect.
Critical: The compute remain in the degraded state after unlock

Steps to Reproduce
------------------
1. system host-lock compute
2. system host-unlock compute

This alarm is reproduced only when lock/unlock on controller executed prior to this test case.
PASS test_horizon_host_inventory_display
PASS test_lock_active_controller_reject
PASS test_lock_unlock_host[controller]
FAIL test_lock_unlock_host[compute]

Expected Behavior
------------------
The compute expected to be unlocked and in the available state with no alarm raised for CPU threshold.

Actual Behavior
----------------
| Platform CPU threshold exceeded; threshold 95.00%, actual 99.99%

Reproducibility
---------------
CPU threshold issue is reproduced multiple times in daily sanity. Also seen in load 201907290421.

System Configuration
--------------------
Storage System

Branch/Pull Time/Commit
-----------------------
20190804T233000Z

Last Pass
---------
20190728T013000Z

Timestamp/Logs
--------------
====================== Test Step 2: Lock compute host - compute-0 and ensure it is successfully locked
[2019-08-05 08:30:30,888] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-0'
====================== Test Step 3: Unlock compute host - compute-0 and ensure it is successfully unlocked
[2019-08-05 08:31:15,904] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-0'

Test Activity
-------------
Sanity

Senthil Mukundakumar (smukunda) wrote :
description: updated
Yang Liu (yliu12) wrote :

Some characters of this issue:
- It is seen 5/5 times since 20190728T233000Z load.
- The host got lock/unlocked was compute-0, but the alarm was against compute-1
- The alarm seems to a stale alarm, there was no vm hosted on compute-1 and the alarm stays uncleared
- The alarm on compute-1 eventually got cleared after another lock/unlock of compute-0

Title is updated based on above observations.

summary: - Platform CPU threshold exceeded in compute after unlock (Storage System)
+ Platform CPU threshold exceeded in compute after lock/unlock a different
+ compute host (Storage System)
Frank Miller (sensfan22) wrote :

Assigning to Cindy and requesting assistance to assign to a prime to investigate this issue. Based on the above info this appears to be related to a storage system but that should be confirmed.

Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Cindy Xie (xxie1)
tags: added: stx.config
tags: added: stx.2.0
tags: added: stx.storage
Numan Waheed (nwaheed) on 2019-08-07
tags: added: stx.retestneeded
Cindy Xie (xxie1) on 2019-08-07
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Lin Shuicheng (shuicheng)
Yang Liu (yliu12) wrote :

This issue is seen consistently in dedicated storage sanity since the LP opens.
And it is never seen on a standard system.

Cindy Xie (xxie1) on 2019-08-14
Changed in starlingx:
importance: Medium → High
Lin Shuicheng (shuicheng) wrote :

Try to reproduce the issue with 2+2+2 VM. Does it occur with bare metal only?

Lin Shuicheng (shuicheng) wrote :

Hi Yang,
What does below sentence mean?
"This alarm is reproduced only when lock/unlock on controller executed prior to this test case."

I cannot reproduce the issue with my 2+2+2 VM environment. I try to lock/unlock compute/storage/controller node.

From attached log, except dpdk process, it seems there is no other process consumed too much CPU per /var/extra/process.info log. So I guess when the alarm occur, it is just a spike/jitter, and will be cleared later. Is it right?
Could you help login to the compute node to check which process consume the CPU when issue occur?
Thanks.

Yang Liu (yliu12) wrote :

Hi Shuicheng, we are no longer seeing this issue in recent sanity runs.

Lin Shuicheng (shuicheng) wrote :

So let's mark it as in-complete first. And may close it later if it doesn't occur again.
Thanks for the status update.

Changed in starlingx:
status: Triaged → Incomplete
Lin Shuicheng (shuicheng) wrote :

It share the same log as https://bugs.launchpad.net/starlingx/+bug/1840831
It seems ovs/dpdk thread is in abnormal status and consumed most CPU of processor 0, which is the platform CPU.
Here is the log from process.info of compute-1, which we could find ovs-vswitchd main thread (not polling thread) run in processor 0 consumed 25 min CPU time. pmd76/pmd77 are the dpdk polling thread which run in processor 1/2.
--------------------------------------------------------------------
Mon Aug 5 09:18:35 UTC 2019 : : ps -eL -o pid,lwp,ppid,state,class,nice,rtprio,priority,psr,stime,etime,time,wchan:16,tty,comm,command
--------------------------------------------------------------------
    PID LWP PPID S CLS NI RTPRIO PRI PSR STIME ELAPSED TIME WCHAN TT COMMAND COMMAND
  43819 43819 1 R TS -10 - 10 0 07:30 01:47:36 00:25:38 - ? ovs-vswitchd ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
  43819 44289 1 R TS -10 - 10 1 07:31 01:47:28 01:47:27 - ? pmd76 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
  43819 44293 1 R TS -10 - 10 2 07:31 01:47:28 01:47:27 - ? pmd77 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
  71853 71853 71834 S TS 0 - 20 0 07:52 01:25:43 00:01:32 ep_poll ? /var/lib/openst /var/lib/openstack/bin/python /var/lib/openstack/bin/neutron-dhcp-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-file /etc/neutron/metadata_agent.ini --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini

And when the alarm occur, we could find ovs/dpdk is in error state also from ovs-vswitchd.log in compute-1.
"
2019-08-05T08:40:07.513Z|00901|netdev_linux|WARN|error receiving Ethernet packet on tap6c26786b-f7: File descriptor in bad state
2019-08-05T08:40:07.514Z|00902|connmgr|INFO|br-int<->unix#485: 8 flow_mods in the last 0 s (8 deletes)
2019-08-05T08:40:09.218Z|00903|connmgr|INFO|br-int<->unix#488: 2 flow_mods in the last 0 s (2 adds)
2019-08-05T08:40:11.828Z|00904|poll_loop|INFO|Dropped 47 log messages in last 4144 seconds (most recently, 4142 seconds ago) due to excessive rate
2019-08-05T08:40:11.828Z|00905|poll_loop|INFO|wakeup due to [POLLERR] on fd 171 (character device /dev/net/tun) at lib/netdev-linux.c:1346 (71% CPU usage)
"

Lin Shuicheng (shuicheng) wrote :

There is similar issue reported in OVS community.
https://mail.openvswitch.org/pipermail/ovs-discuss/2019-May/048608.html
https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1827264
Both are reported with ovs 2.10. While StarlingX use ovs 2.11
So it seems 2.11 still have similar issue.

ChenjieXu (midone) wrote :

Hi Senthil/Yang,

Before lock/unlock compute-0, is there any VM running on compute-0? Is there any network configured on compute-0? Is there SR-IOV configured?

Hi Shuicheng,
The nova-compute seems not working, could you please help check the nova-compute logs for compute-0 and compute-1?
{"log":"2019-08-05 08:39:46.118 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:02.0: PciDeviceNotFoundById: PCI device 0000:09:02.0 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.118800276Z"}
{"log":"2019-08-05 08:39:46.118 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:02.0: PciDeviceNotFoundById: PCI device 0000:09:02.0 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.120392857Z"}
{"log":"2019-08-05 08:39:46.136 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:01.4: PciDeviceNotFoundById: PCI device 0000:09:01.4 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.136395168Z"}
{"log":"2019-08-05 08:39:46.136 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:01.4: PciDeviceNotFoundById: PCI device 0000:09:01.4 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.138388079Z"}
{"log":"2019-08-05 08:39:46.146 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:01.5: PciDeviceNotFoundById: PCI device 0000:09:01.5 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.146464898Z"}
{"log":"2019-08-05 08:39:46.146 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:01.5: PciDeviceNotFoundById: PCI device 0000:09:01.5 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.147416674Z"}
{"log":"2019-08-05 08:39:46.165 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:04.6: PciDeviceNotFoundById: PCI device 0000:09:04.6 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.1656974Z"}

{"log":"2019-08-05 09:05:47.648 51452 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer\n","stream":"stdout","time":"2019-08-05T09:05:47.649210486Z"}
{"log":"2019-08-05 09:06:14.357 51452 ERROR oslo.messaging._drivers.impl_rabbit [req-c4cf8b01-06dc-4ac2-a4e0-eb6ef853a971 - - - - -] [e90a3df3-0a0b-4910-84de-ef1c6ebe8fa9] AMQP server on rabbitmq.openstack.svc.cluster.local:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: IOError: Server unexpectedly closed connection\n","stream":"stdout","time":"2019-08-05T09:06:14.357618236Z"}

ChenjieXu (midone) wrote :

Hi Senthil/Yang,

Could you please help check the file (filename starts with "tap" or "qr") status in the below directory:
   /sys/devices/virtual/net/
   /sys/class/net/

The following log in ovs-vswitchd in compute-1 shows those file are in wrong status which may cause ovs-vswitchd wrong:
2019-08-05T08:40:03.921Z|00881|dpif_netdev|ERR|error receiving data from tap588a7711-e2: File descriptor in bad state
2019-08-05T08:40:04.006Z|00882|dpif_netdev|ERR|error receiving data from tap588a7711-e2: File descriptor in bad state
2019-08-05T08:40:04.020Z|00888|dpif_netdev|ERR|error receiving data from tap588a7711-e2: File descriptor in bad state
2019-08-05T08:40:04.167Z|00889|dpif_netdev|ERR|error receiving data from tap6c26786b-f7: File descriptor in bad state
2019-08-05T08:40:04.167Z|00890|dpif_netdev|ERR|error receiving data from tap588a7711-e2: File descriptor in bad state

Lin Shuicheng (shuicheng) wrote :

Hi Chenjie,
About the warning message in nova-compute, it is relate to QAT VF device. And the log shows up after nova compute container is running in both compute-0 and compute-1.
It seems it is a 1 minute periodic task. There are 64 line log (2 duplicated line for each QAT VF device, and there are 32 QAT VF device) for each minute.
It should be due to QAT VF device is created by QAT driver, but is not configured/used by QEMU/libvirt.
It should be just a warning message. We could find similar log in normal system also.

ChenjieXu (midone) wrote :

Hi Shuicheng,

Thank you for your explanation!

ChenjieXu (midone) on 2019-09-04
Changed in starlingx:
assignee: Lin Shuicheng (shuicheng) → ChenjieXu (midone)
Cindy Xie (xxie1) on 2019-09-04
tags: added: stx.networking
removed: stx.storage
Ghada Khalil (gkhalil) wrote :

@Senthil, what is the NIC type used for data interfaces in the lab reporting this issue? Is it a Fortville XL710?

Yang Liu (yliu12) wrote :

It's X710.

Ghada Khalil (gkhalil) wrote :

Marking as Incomplete. Developer not able to reproduce.
Need the reporter to reproduce and collect the data requested above.

I think that stx.retestneeded can be removed.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers