Bug #1839181 “Platform CPU threshold exceeded in compute after l...” : Bugs : StarlingX

Revision history for this message

Senthil Mukundakumar (smukunda) wrote on 2019-08-06:

#1

CPU_thershold_exceeded_storage_system.tar Edit (86.5 MiB, application/x-tar)

Senthil Mukundakumar (smukunda) on 2019-08-06

description:

updated

Revision history for this message

Yang Liu (yliu12) wrote on 2019-08-07:

#2

Some characters of this issue:
- It is seen 5/5 times since 20190728T233000Z load.
- The host got lock/unlocked was compute-0, but the alarm was against compute-1
- The alarm seems to a stale alarm, there was no vm hosted on compute-1 and the alarm stays uncleared
- The alarm on compute-1 eventually got cleared after another lock/unlock of compute-0

Title is updated based on above observations.

summary:

- Platform CPU threshold exceeded in compute after unlock (Storage System)
+ Platform CPU threshold exceeded in compute after lock/unlock a different
+ compute host (Storage System)

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-08-07:

#3

Assigning to Cindy and requesting assistance to assign to a prime to investigate this issue. Based on the above info this appears to be related to a storage system but that should be confirmed.

Changed in starlingx:
status:	New → Triaged
importance:	Undecided → Medium
assignee:	nobody → Cindy Xie (xxie1)
tags:	added: stx.config
tags:	added: stx.2.0
tags:	added: stx.storage

Numan Waheed (nwaheed) on 2019-08-07

tags:

added: stx.retestneeded

Cindy Xie (xxie1) on 2019-08-07

Changed in starlingx:
assignee:	Cindy Xie (xxie1) → Lin Shuicheng (shuicheng)

Revision history for this message

Yang Liu (yliu12) wrote on 2019-08-09:

#4

This issue is seen consistently in dedicated storage sanity since the LP opens.
And it is never seen on a standard system.

Cindy Xie (xxie1) on 2019-08-14

Changed in starlingx:
importance:	Medium → High

Revision history for this message

Lin Shuicheng (shuicheng) wrote on 2019-08-19:

#5

Try to reproduce the issue with 2+2+2 VM. Does it occur with bare metal only?

Revision history for this message

Lin Shuicheng (shuicheng) wrote on 2019-08-20:

#6

Hi Yang,
What does below sentence mean?
"This alarm is reproduced only when lock/unlock on controller executed prior to this test case."

I cannot reproduce the issue with my 2+2+2 VM environment. I try to lock/unlock compute/storage/controller node.

From attached log, except dpdk process, it seems there is no other process consumed too much CPU per /var/extra/process.info log. So I guess when the alarm occur, it is just a spike/jitter, and will be cleared later. Is it right?
Could you help login to the compute node to check which process consume the CPU when issue occur?
Thanks.

Revision history for this message

Yang Liu (yliu12) wrote on 2019-08-20:

#7

Hi Shuicheng, we are no longer seeing this issue in recent sanity runs.

Revision history for this message

Lin Shuicheng (shuicheng) wrote on 2019-08-21:

#8

So let's mark it as in-complete first. And may close it later if it doesn't occur again.
Thanks for the status update.

Changed in starlingx:
status:	Triaged → Incomplete

Revision history for this message

Lin Shuicheng (shuicheng) wrote on 2019-08-30:

#9

It share the same log as https://bugs.launchpad.net/starlingx/+bug/1840831
It seems ovs/dpdk thread is in abnormal status and consumed most CPU of processor 0, which is the platform CPU.
Here is the log from process.info of compute-1, which we could find ovs-vswitchd main thread (not polling thread) run in processor 0 consumed 25 min CPU time. pmd76/pmd77 are the dpdk polling thread which run in processor 1/2.
--------------------------------------------------------------------
Mon Aug 5 09:18:35 UTC 2019 : : ps -eL -o pid,lwp,ppid,state,class,nice,rtprio,priority,psr,stime,etime,time,wchan:16,tty,comm,command
--------------------------------------------------------------------
    PID LWP PPID S CLS NI RTPRIO PRI PSR STIME ELAPSED TIME WCHAN TT COMMAND COMMAND
  43819 43819 1 R TS -10 - 10 0 07:30 01:47:36 00:25:38 - ? ovs-vswitchd ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
  43819 44289 1 R TS -10 - 10 1 07:31 01:47:28 01:47:27 - ? pmd76 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
  43819 44293 1 R TS -10 - 10 2 07:31 01:47:28 01:47:27 - ? pmd77 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
  71853 71853 71834 S TS 0 - 20 0 07:52 01:25:43 00:01:32 ep_poll ? /var/lib/openst /var/lib/openstack/bin/python /var/lib/openstack/bin/neutron-dhcp-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-file /etc/neutron/metadata_agent.ini --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini

It share the same log as https://bugs.launchpad.net/starlingx/+bug/1840831
It seems ovs/dpdk thread is in abnormal status and consumed most CPU of processor 0, which is the platform CPU.
Here is the log from process.info of compute-1, which we could find ovs-vswitchd main thread (not polling thread) run in processor 0 consumed 25 min CPU time. pmd76/pmd77 are the dpdk polling thread which run in processor 1/2.
--------------------------------------------------------------------
Mon Aug  5 09:18:35 UTC 2019 :  : ps -eL -o pid,lwp,ppid,state,class,nice,rtprio,priority,psr,stime,etime,time,wchan:16,tty,comm,command
--------------------------------------------------------------------
    PID     LWP    PPID S CLS  NI RTPRIO PRI PSR STIME     ELAPSED     TIME WCHAN            TT       COMMAND         COMMAND
  43819   43819       1 R TS  -10      -  10   0 07:30    01:47:36 00:25:38 -                ?        ovs-vswitchd    ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
  43819   44289       1 R TS  -10      -  10   1 07:31    01:47:28 01:47:27 -                ?        pmd76           ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
  43819   44293       1 R TS  -10      -  10   2 07:31    01:47:28 01:47:27 -                ?        pmd77           ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
  71853   71853   71834 S TS    0      -  20   0 07:52    01:25:43 00:01:32 ep_poll          ?        /var/lib/openst /var/lib/openstack/bin/python /var/lib/openstack/bin/neutron-dhcp-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-file /etc/neutron/metadata_agent.ini --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini

Revision history for this message

Lin Shuicheng (shuicheng) wrote on 2019-08-30:

#10

There is similar issue reported in OVS community.
https://mail.openvswitch.org/pipermail/ovs-discuss/2019-May/048608.html
https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1827264
Both are reported with ovs 2.10. While StarlingX use ovs 2.11
So it seems 2.11 still have similar issue.

Revision history for this message

ChenjieXu (midone) wrote on 2019-09-02:

#11

Hi Senthil/Yang,

Before lock/unlock compute-0, is there any VM running on compute-0? Is there any network configured on compute-0? Is there SR-IOV configured?

Hi Shuicheng,
The nova-compute seems not working, could you please help check the nova-compute logs for compute-0 and compute-1?
{"log":"2019-08-05 08:39:46.118 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:02.0: PciDeviceNotFoundById: PCI device 0000:09:02.0 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.118800276Z"}
{"log":"2019-08-05 08:39:46.118 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:02.0: PciDeviceNotFoundById: PCI device 0000:09:02.0 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.120392857Z"}
{"log":"2019-08-05 08:39:46.136 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:01.4: PciDeviceNotFoundById: PCI device 0000:09:01.4 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.136395168Z"}
{"log":"2019-08-05 08:39:46.136 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:01.4: PciDeviceNotFoundById: PCI device 0000:09:01.4 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.138388079Z"}
{"log":"2019-08-05 08:39:46.146 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:01.5: PciDeviceNotFoundById: PCI device 0000:09:01.5 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.146464898Z"}
{"log":"2019-08-05 08:39:46.146 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:01.5: PciDeviceNotFoundById: PCI device 0000:09:01.5 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.147416674Z"}
{"log":"2019-08-05 08:39:46.165 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:04.6: PciDeviceNotFoundById: PCI device 0000:09:04.6 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.1656974Z"}

{"log":"2019-08-05 09:05:47.648 51452 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer\n","stream":"stdout","time":"2019-08-05T09:05:47.649210486Z"}
{"log":"2019-08-05 09:06:14.357 51452 ERROR oslo.messaging._drivers.impl_rabbit [req-c4cf8b01-06dc-4ac2-a4e0-eb6ef853a971 - - - - -] [e90a3df3-0a0b-4910-84de-ef1c6ebe8fa9] AMQP server on rabbitmq.openstack.svc.cluster.local:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: IOError: Server unexpectedly closed connection\n","stream":"stdout","time":"2019-08-05T09:06:14.357618236Z"}

Hi Senthil/Yang,

Before lock/unlock compute-0, is there any VM running on compute-0? Is there any network configured on compute-0? Is there SR-IOV configured?

Hi Shuicheng,
The nova-compute seems not working, could you please help check the nova-compute logs for compute-0 and compute-1?
{"log":"2019-08-05 08:39:46.118 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:02.0: PciDeviceNotFoundById: PCI device 0000:09:02.0 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.118800276Z"}
{"log":"2019-08-05 08:39:46.118 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:02.0: PciDeviceNotFoundById: PCI device 0000:09:02.0 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.120392857Z"}
{"log":"2019-08-05 08:39:46.136 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:01.4: PciDeviceNotFoundById: PCI device 0000:09:01.4 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.136395168Z"}
{"log":"2019-08-05 08:39:46.136 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:01.4: PciDeviceNotFoundById: PCI device 0000:09:01.4 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.138388079Z"}
{"log":"2019-08-05 08:39:46.146 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:01.5: PciDeviceNotFoundById: PCI device 0000:09:01.5 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.146464898Z"}
{"log":"2019-08-05 08:39:46.146 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:01.5: PciDeviceNotFoundById: PCI device 0000:09:01.5 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.147416674Z"}
{"log":"2019-08-05 08:39:46.165 51452 WARNING nova.pci.utils [req-1e61dcf6-2bc5-48d5-945a-b88d5d81f561 - - - - -] No net device was found for VF 0000:09:04.6: PciDeviceNotFoundById: PCI device 0000:09:04.6 not found\n","stream":"stdout","time":"2019-08-05T08:39:46.1656974Z"}

{"log":"2019-08-05 09:05:47.648 51452 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer\n","stream":"stdout","time":"2019-08-05T09:05:47.649210486Z"}
{"log":"2019-08-05 09:06:14.357 51452 ERROR oslo.messaging._drivers.impl_rabbit [req-c4cf8b01-06dc-4ac2-a4e0-eb6ef853a971 - - - - -] [e90a3df3-0a0b-4910-84de-ef1c6ebe8fa9] AMQP server on rabbitmq.openstack.svc.cluster.local:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: IOError: Server unexpectedly closed connection\n","stream":"stdout","time":"2019-08-05T09:06:14.357618236Z"}

Revision history for this message

ChenjieXu (midone) wrote on 2019-09-02:

#12

Hi Senthil/Yang,

Could you please help check the file (filename starts with "tap" or "qr") status in the below directory:
/sys/devices/virtual/net/
/sys/class/net/

The following log in ovs-vswitchd in compute-1 shows those file are in wrong status which may cause ovs-vswitchd wrong:
2019-08-05T08:40:03.921Z|00881|dpif_netdev|ERR|error receiving data from tap588a7711-e2: File descriptor in bad state
2019-08-05T08:40:04.006Z|00882|dpif_netdev|ERR|error receiving data from tap588a7711-e2: File descriptor in bad state
2019-08-05T08:40:04.020Z|00888|dpif_netdev|ERR|error receiving data from tap588a7711-e2: File descriptor in bad state
2019-08-05T08:40:04.167Z|00889|dpif_netdev|ERR|error receiving data from tap6c26786b-f7: File descriptor in bad state
2019-08-05T08:40:04.167Z|00890|dpif_netdev|ERR|error receiving data from tap588a7711-e2: File descriptor in bad state

Revision history for this message

Lin Shuicheng (shuicheng) wrote on 2019-09-03:

#13

Hi Chenjie,
About the warning message in nova-compute, it is relate to QAT VF device. And the log shows up after nova compute container is running in both compute-0 and compute-1.
It seems it is a 1 minute periodic task. There are 64 line log (2 duplicated line for each QAT VF device, and there are 32 QAT VF device) for each minute.
It should be due to QAT VF device is created by QAT driver, but is not configured/used by QEMU/libvirt.
It should be just a warning message. We could find similar log in normal system also.

Revision history for this message

ChenjieXu (midone) wrote on 2019-09-04:

#14

Hi Shuicheng,

Thank you for your explanation!

ChenjieXu (midone) on 2019-09-04

Changed in starlingx:
assignee:	Lin Shuicheng (shuicheng) → ChenjieXu (midone)

Cindy Xie (xxie1) on 2019-09-04

tags:

added: stx.networking
removed: stx.storage

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-05:

#15

@Senthil, what is the NIC type used for data interfaces in the lab reporting this issue? Is it a Fortville XL710?

Revision history for this message

Yang Liu (yliu12) wrote on 2019-09-05:

#16

It's X710.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-17:

#17

Marking as Incomplete. Developer not able to reproduce.
Need the reporter to reproduce and collect the data requested above.

Revision history for this message

Chris Winnicki (chriswinnicki) wrote on 2019-10-17:

#18

I think that stx.retestneeded can be removed.

Yang Liu (yliu12) on 2019-10-31

tags:

removed: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-11-14:

#19

Confirmed with Yang Liu that this issue is not reproducible in recent testing done for stx.2.0.1. Closing.

Changed in starlingx:
status:	Incomplete → Invalid

Revision history for this message

Yang Liu (yliu12) wrote on 2020-02-10:

#20

Reopen this one as it is reproduced in https://bugs.launchpad.net/starlingx/+bug/1840831.
Please see new LP and comment #5 for more details.

Changed in starlingx:
status:	Invalid → Confirmed
tags:	added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-02-21:

#21

Yang, I don't see new comments in https://bugs.launchpad.net/starlingx/+bug/1840831
Also this issue was reported on stx.2.0. If the recent issue is seen on stx master, it makes more sense to open a new LP unless you have verified thru the logs that this is indeed the same issue.

Revision history for this message

Yang Liu (yliu12) wrote on 2020-02-21:

#22

I must have commented on the wrong LP. Closing this one.
We have not seen this issue in recent sanity and regression testing.

Changed in starlingx:
status:	Confirmed → Invalid
tags:	removed: stx.retestneeded

StarlingX

Platform CPU threshold exceeded in compute after lock/unlock a different compute host (Storage System)

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches