ovs-vswitchd thread consuming 100% CPU

Bug #1827264 reported by Jamon Camisso
64
This bug affects 11 people
Affects Status Importance Assigned to Milestone
openvswitch (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I have an ovs-vswitchd process consuming 100% CPU in a very lightly used Openstack Rocky cloud running on Bionic. The version in question is openvswitch-switch 2.10.0-0ubuntu2~cloud0

ovs-vswitchd is running alongside various neutron processes (lbaasv2-agent, metadata-agent, l3-agent, dhcp-agent, openvswitch-agent) inside an LXC container on a physical host. There is a single neutron router, and the entire environment including br-tun, br-ex, and br-int traffic barely goes over 200KiB/s TX/RX combined.

The issue appears to have arisen on its own after the host was hard rebooted. On restart ovs-vswitchd came up with high load and it has not diminished since then.

The thread that is consuming CPU looks like this in ps and shows up with the name 'handler89':

UID PID SPID PPID C STIME TTY TIME CMD
root 7267 7454 1 99 Apr23 ? 8-03:38:14 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach

Logs in /var/log/openvswitch/ovs-vswitchd.log show ~14000 entries like this per day - roughly one per second:

2019-05-01T18:34:30.110Z|237220|poll_loop(handler89)|INFO|Dropped 1092844 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate

2019-05-01T18:34:30.110Z|237221|poll_loop(handler89)|INFO|wakeup due to [POLLIN] on fd 42 (unknown anon_inode:[eventpoll]) at ../lib/dpif-netlink.c:2786 (99% CPU usage)

Logs in /var/log/neutron/neutron-openvswitch-agent.log show entries like this, again one set per second for ~14000 per day:

2019-05-01 18:35:31.174 13621 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-7e354ade-55a3-46ba-adf4-db49a51e5218 - - - - -] Agent rpc_loop - iteration:355070 started rpc_loop /usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2086

2019-05-01 18:35:31.177 13621 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-7e354ade-55a3-46ba-adf4-db49a51e5218 - - - - -] ofctl request version=0x4,msg_type=0x12,msg_len=0x38,xid=0x81826d95,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=23,type=1) result [OFPFlowStatsReply(body=[OFPFlowStats(byte_count=0,cookie=13408676230369642694,duration_nsec=724000000,duration_sec=710697,flags=0,hard_timeout=0,idle_timeout=0,instructions=[],length=56,match=OFPMatch(oxm_fields={}),packet_count=0,priority=0,table_id=23)],flags=0,type=1)] _send_msg /usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py:114

2019-05-01 18:35:31.179 13621 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-7e354ade-55a3-46ba-adf4-db49a51e5218 - - - - -] Agent rpc_loop - iteration:355070 completed. Processed ports statistics: {'regular': {'added': 0, 'updated': 0, 'removed': 0}}. Elapsed:0.005 loop_count_and_wait /usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1875

I'm not sure what else information to add here but am happy to gather more diagnostic data to try to pin this down. I did come across https://mail.openvswitch.org/pipermail/ovs-discuss/2018-October/047558.html which is the same version as mine and the symptoms appear to be the same.

Below are various ovs-dpctl/ofctl show reports:

root@juju-df624b-4-lxd-10:~# ovs-dpctl show
system@ovs-system:
  lookups: hit:223561120 missed:5768546 lost:798
  flows: 131
  masks: hit:2284286371 total:15 hit/pkt:9.96
  port 0: ovs-system (internal)
  port 1: br-ex (internal)
  port 2: eth1
  port 3: gre_sys (gre: packet_type=ptap)
  port 4: br-tun (internal)
  port 5: br-int (internal)
  port 6: tapa062d6f1-40
  port 7: tapf90e3ab6-13
  port 8: tap45ba891c-4c

root@juju-df624b-4-lxd-10:~# ovs-ofctl show br-int
OFPT_FEATURES_REPLY (xid=0x2): dpid:00003643edb09542
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(int-br-ex): addr:4a:b4:cc:dd:aa:ac
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 2(patch-tun): addr:7a:36:76:1f:47:6e
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 3(tapa062d6f1-40): addr:92:d4:6c:26:55:0a
     config: 0
     state: 0
     current: 10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 4(tapf90e3ab6-13): addr:9e:8b:4f:ae:8f:ba
     config: 0
     state: 0
     current: 10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 5(tap45ba891c-4c): addr:76:45:39:c8:d7:e0
     config: 0
     state: 0
     current: 10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br-int): addr:36:43:ed:b0:95:42
     config: PORT_DOWN
     state: LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

root@juju-df624b-4-lxd-10:~# ovs-ofctl show br-tun
OFPT_FEATURES_REPLY (xid=0x2): dpid:000092c6c8f4a545
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(patch-int): addr:02:1d:bc:ec:0c:3b
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 2(gre-0a30029b): addr:fe:09:61:11:92:8b
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 3(gre-0a3002a2): addr:a2:64:93:48:e2:82
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 4(gre-0a30029d): addr:12:b1:73:32:19:d4
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 5(gre-0a3002ce): addr:ca:10:51:f2:0f:05
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 6(gre-0a3002a1): addr:de:fe:95:33:d9:67
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 7(gre-0a30029e): addr:5a:fa:ce:15:7b:5c
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 8(gre-0a3002c3): addr:b6:87:66:09:fc:04
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 10(gre-0a3002a0): addr:ce:79:f2:bf:a8:94
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 14(gre-0a30029f): addr:b6:b0:84:7d:9f:aa
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 18(gre-0a30029c): addr:da:14:12:54:1c:ec
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br-tun): addr:92:c6:c8:f4:a5:45
     config: PORT_DOWN
     state: LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

root@juju-df624b-4-lxd-10:~# ovs-ofctl show br-ex
OFPT_FEATURES_REPLY (xid=0x2): dpid:000000163e9a47a2
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(eth1): addr:00:16:3e:9a:47:a2
     config: 0
     state: 0
     current: 10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 2(phy-br-ex): addr:ce:a8:6a:25:55:0e
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br-ex): addr:00:16:3e:9a:47:a2
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

Revision history for this message
hard (ihard) wrote :

I have same issue with highload cpu usage on network node Openstack Rocky Ubuntu 18.04
Kernel 4.15.0-48-generic #51-Ubuntu

Package:
ii neutron-openvswitch-agent 2:13.0.2-0ubuntu3.1~cloud0 all Neutron is a virtual network service for Openstack - Open vSwitch plugin agent
ii openvswitch-common 2.10.0-0ubuntu2~cloud0 amd64 Open vSwitch common components
ii openvswitch-switch 2.10.0-0ubuntu2~cloud0 amd64 Open vSwitch switch implementations

Sytem not reboot,
from logs ovs-vswitchd.log:
2019-05-08T15:30:13.280Z|00726|connmgr|INFO|br-tun<->tcp:127.0.0.1:6633: 12 flow_mods in the 9 s starting 10 s ago (9 adds, 3 deletes)
2019-05-08T15:30:14.232Z|00727|bridge|INFO|bridge br-tun: added interface vxlan-0ac84a99 on port 55
2019-05-08T15:30:17.456Z|00002|poll_loop(handler68)|INFO|wakeup due to [POLLIN] on fd 29 (unknown anon_inode:[eventpoll]) at ../lib/dpif-netlink.c:2786 (99% CPU usage)
2019-05-08T15:30:17.456Z|00003|poll_loop(handler68)|INFO|wakeup due to [POLLIN] on fd 29 (unknown anon_inode:[eventpoll]) at ../lib/dpif-netlink.c:2786 (99% CPU usage)
2019-05-08T15:30:17.456Z|00004|poll_loop(handler68)|INFO|wakeup due to [POLLIN] on fd 29 (unknown anon_inode:[eventpoll]) at ../lib/dpif-netlink.c:2786 (99% CPU usage)
2019-05-08T15:30:17.456Z|00005|poll_loop(handler68)|INFO|wakeup due to [POLLIN] on fd 29 (unknown anon_inode:[eventpoll]) at ../lib/dpif-netlink.c:2786 (99% CPU usage)
2019-05-08T15:30:17.456Z|00006|poll_loop(handler68)|INFO|wakeup due to [POLLIN] on fd 29 (unknown anon_inode:[eventpoll]) at ../lib/dpif-netlink.c:2786 (99% CPU usage)
2019-05-08T15:30:17.456Z|00007|poll_loop(handler68)|INFO|wakeup due to [POLLIN] on fd 29 (unknown anon_inode:[eventpoll]) at ../lib/dpif-netlink.c:2786 (99% CPU usage)
2019-05-08T15:30:17.456Z|00008|poll_loop(handler68)|INFO|wakeup due to [POLLIN] on fd 29 (unknown anon_inode:[eventpoll]) at ../lib/dpif-netlink.c:2786 (99% CPU usage)
2019-05-08T15:30:17.456Z|00009|poll_loop(handler68)|INFO|wakeup due to [POLLIN] on fd 29 (unknown anon_inode:[eventpoll]) at ../lib/dpif-netlink.c:2786 (99% CPU usage)
2019-05-08T15:30:17.456Z|00010|poll_loop(handler68)|INFO|wakeup due to [POLLIN] on fd 29 (unknown anon_inode:[eventpoll]) at ../lib/dpif-netlink.c:2786 (99% CPU usage)
2019-05-08T15:30:17.456Z|00011|poll_loop(handler68)|INFO|wakeup due to [POLLIN] on fd 29 (unknown anon_inode:[eventpoll]) at ../lib/dpif-netlink.c:2786 (99% CPU usage)
2019-05-08T15:30:23.456Z|00012|poll_loop(handler68)|INFO|Dropped 2841636 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2019-05-08T15:30:23.456Z|00013|poll_loop(handler68)|INFO|wakeup due to [POLLIN] on fd 29 (unknown anon_inode:[eventpoll]) at ../lib/dpif-netlink.c:2786 (99% CPU usage)

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openvswitch (Ubuntu):
status: New → Confirmed
Revision history for this message
Jamon Camisso (jamon) wrote :

A bit of follow up on this:

strace on the thread in question shows the following:

13:35:47 poll([{fd=23, events=POLLIN}], 1, 0) = 0 (Timeout) <0.000018>
13:35:47 epoll_wait(42, [{EPOLLIN, {u32=3, u64=3}}], 9, 0) = 1 <0.000018>
13:35:47 recvmsg(417, {msg_namelen=0}, MSG_DONTWAIT) = -1 EAGAIN
(Resource temporarily unavailable) <0.000018>
13:35:47 poll([{fd=11, events=POLLIN}, {fd=42, events=POLLIN}, {fd=23,
events=POLLIN}], 3, 2147483647) = 1 ([{fd=42, revents=POLLIN}]) <0.000019>
13:35:47 getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=490842,
tv_usec=749026}, ru_stime={tv_sec=710657, tv_usec=442946}, ...}) = 0
<0.000018>
13:35:47 poll([{fd=23, events=POLLIN}], 1, 0) = 0 (Timeout) <0.000018>
13:35:47 epoll_wait(42, [{EPOLLIN, {u32=3, u64=3}}], 9, 0) = 1 <0.000019>
13:35:47 recvmsg(417, {msg_namelen=0}, MSG_DONTWAIT) = -1 EAGAIN
(Resource temporarily unavailable) <0.000018>
13:35:47 poll([{fd=11, events=POLLIN}, {fd=42, events=POLLIN}, {fd=23,
events=POLLIN}], 3, 2147483647) = 1 ([{fd=42, revents=POLLIN}]) <0.000019>
13:35:47 getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=490842,
tv_usec=749108}, ru_stime={tv_sec=710657, tv_usec=442946}, ...}) = 0
<0.000017>

And if I strace with -c to collect a summary, after 4-5 seconds it shows
the following:

sudo strace -c -p 1658344
strace: Process 1658344 attached
^Cstrace: Process 1658344 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00 0.000000 0 9 write
  0.00 0.000000 0 56397 poll
  0.00 0.000000 0 28198 28198 recvmsg
  0.00 0.000000 0 28199 getrusage
  0.00 0.000000 0 148 56 futex
  0.00 0.000000 0 28199 epoll_wait
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000 141150 28254 total

Revision history for this message
Dincer Celik (dincercelik) wrote :

Same on bionic with stein and ovs 2.11.0.

Revision history for this message
hard (ihard) wrote :

We catch this problem about a couple of times a week and have found a pretty non-trivial way to fix it quickly.

In crontab: watchdog_openvswitch.sh

#!/bin/bash

timeout 10 ovs-appctl version &>/dev/null

if [[ "$?" != "0" ]]; then
    echo "run strace for fix openvswitch"
    timeout 5 strace -f -p $(cat /var/run/openvswitch/ovs-vswitchd.pid) &>/dev/null
fi

Revision history for this message
Magnus Bergman (magnusbe) wrote :

We are also seeing this in bionic with rocky neutron-openvswitch-agents. We haven't really found any pattern in when it occurs, but even after emptying out a compute node with live migrations and rebooting it, we will keep seeing it. Amount of traffic doesn't seem to be related.

What we are seeing are an increasing number of handler threads using 100% cpu which in our setup can mean several 1000% in total being used by ovs-vswitchd. While investigating we found that the handler threads seemed to spend quite a lot of time in native_queued_spin_lock_slowpath, making us believe there might be too many threads running competing for the same locks so we tried lowering other_config:n-handler-threads, which actually seemed to fix the issue, but it seems to have just been a temporary fix as we still see the issue and in fact may just be that the modification of the other_config:n-handler-threads variable slays the handlers that have been stuck in 100% CPU and spun up new fresh ones.

I find it strange however that a rebooted node doesn't see the same though which makes me think there is something going on during startup of either OVS or openstack-neutron-agent that in some circumstances causes this.

Investigation is on-going, just wanted to share our observations so far.

Revision history for this message
Jeremy Phillips (jeremy.phillips) wrote :

I know this has been open for a year now, but for what it's worth, we're seeing the same issue on two of our three Neutron nodes. A whole lot of entries in /var/log/openvswitch/ovs-vswitchd.log like the following:

2020-05-22T00:19:25.581Z|18410|poll_loop(handler74)|INFO|Dropped 2973816 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
2020-05-22T00:19:25.581Z|18411|poll_loop(handler74)|INFO|wakeup due to [POLLIN] on fd 35 (unknown anon_inode:[eventpoll]) at ../lib/dpif-netlink.c:2786 (98% CPU usage)

We're running Neutron in Rocky on Ubuntu 18.04:

# uname -r
4.15.0-99-generic

# dpkg -l | grep openvswitch-
ii neutron-openvswitch-agent 2:13.0.6-0ubuntu1~cloud0 all Neutron is a virtual network service for Openstack - Open vSwitch plugin agent
ii openvswitch-common 2.10.0-0ubuntu2~cloud0 amd64 Open vSwitch common components
ii openvswitch-switch 2.10.0-0ubuntu2~cloud0 amd64 Open vSwitch switch implementations

Looking at the ovs-vswitchd process itself:

# timeout 10 strace -c -p 1227
strace: Process 1227 attached
strace: Process 1227 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
 96.28 9.059391 28852 314 poll
  3.03 0.284833 284833 1 restart_syscall
  0.28 0.026164 13 1968 sendmsg
  0.27 0.025726 67 385 208 recvfrom
  0.07 0.006302 3 2220 252 recvmsg
  0.02 0.002327 7 315 69 read
  0.02 0.002157 2 930 171 futex
  0.02 0.002075 5 378 378 accept
  0.01 0.000532 2 316 getrusage
  0.00 0.000238 14 17 sendto
  0.00 0.000034 11 3 write
------ ----------- ----------- --------- --------- ----------------
100.00 9.409779 6847 1078 total

zgyzgy3 (zgyzgy3)
information type: Public → Public Security
Revision history for this message
hard (ihard) wrote :

We managed to solve the problem by installing the libc version from the ubuntu disco repository.

Revision history for this message
zhangyi (zhangyi410692) wrote :

libc6?? ubuntu 1904 source?

Revision history for this message
hard (ihard) wrote :

yes,
dpkg -l | grep libc6
ii libc6:amd64 2.29-0ubuntu2 amd64 GNU C Library: Shared libraries
ii libc6-dev:amd64 2.29-0ubuntu2 amd64 GNU C Library: Development Libraries and Header Files

Paul Collins (pjdc)
information type: Public Security → Public
Revision history for this message
Trent Lloyd (lathiat) wrote :

Seems there is a good chance at least some of the people commenting or affected by this bug are duplicate of Bug #1839592 - essentially a libc6 bug that meant threads weren't woken up when they should have been. Fixed by libc6 upgrade to 2.27-3ubuntu1.3 in bionic.

Revision history for this message
James Page (james-page) wrote :

I think that's a fairly safe assumption @lathiat so I'm going to dupe this bug against that one.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.