high cpu usage on ovn-controller process

Bug #1536003 reported by Xiao Li Xu
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
networking-ovn
Fix Released
High
Russell Bryant

Bug Description

Description :

1. Create 200 routers with 3 networks on each router, With this load cpu usage of ovn-controller is going very high on the hypervisors. This cpu usage is not getting impacted by the number of VMs.

We tested with 2 VMs and 40 VMs on the hypervisor to see how no. of VMs impacting the cpu of ovn-controller. In both the cases ovn-controller cpu is always showing 100%. But reducing number of networks, routers bringing down the cpu usage of ovn-controller

[root@test ~]# top
top - 11:50:38 up 2 days, 1:46, 2 users, load average: 3.01, 17.81, 29.49
Tasks: 639 total, 2 running, 637 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.6 us, 0.2 sy, 0.0 ni, 95.9 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
KiB Mem : 26386313+total, 25805459+free, 5011364 used, 797176 buff/cache
KiB Swap: 1048572 total, 1048572 free, 0 used. 25836371+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23095 root 20 0 849368 801748 2428 R 100.0 0.3 1346:55 ovn-controller
49774 qemu 20 0 4735692 1.413g 10156 S 100.0 0.6 259:27.55 qemu-kvm
49776 root 20 0 0 0 0 S 22.3 0.0 35:58.28 vhost-49774
312 root 20 0 0 0 0 S 3.0 0.0 0:07.25 ksoftirqd/38
22015 root 10 -10 2298800 382132 9412 S 1.0 0.1 82:13.25 ovs-vswitchd
115 root 20 0 0 0 0 S 0.3 0.0 0:12.19 rcuos/48
1862 root 20 0 285008 14616 6520 S 0.3 0.0 8:08.79 consul
1 root 20 0 192828 7796 2380 S 0.0 0.0 0:05.70 systemd
[root@test ~]# virsh list
Id Name State
----------------------------------------------------
6 instance-00016080 running

[root@test ~]#

[root@test ~]# ovs-vsctl show | grep Port | wc -l
950
[root@test ~]#

Tags: ovn-upstream
Revision history for this message
Kyle Mestery (mestery) wrote :

After instrumenting the OVN code, this appears to be an issue with how physical_run() is constructed. Thus, the fix is likely in OVN itself, and not in networking-ovn. I'll leave this bug open to track it here for now.

Changed in networking-ovn:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Matt Mulsow (mamulsow) wrote :

I've done some more testing on this and it seems to be directly related to the number of ports in use, not specifically routers. I can reproduce the problem with just creating 1k networks/subnets with dhcp agents or by creating 1k VMs. Basically once 'neutron port-list' gets up to around 1k for whatever reason, ovn-controller will be spinning at 100% cpu usage.

I've added some debug logging to ovn-controller in this environment and I see ovn-controller spending most of it's time in the call to lflow_run().

Revision history for this message
Matt Mulsow (mamulsow) wrote :

Added more debugging in lflow_run and I see that in our environment with 300 routers connected to 2 networks each, I see it is looping over about 25k flows.

Revision history for this message
Russell Bryant (russellb) wrote :

Thanks, Matt.

I assigned this to myself as I'm looking into it, but I'm happy to have help from others too!

Changed in networking-ovn:
assignee: nobody → Russell Bryant (russellb)
Revision history for this message
Mala Anand (manand) wrote :
Download full text (4.5 KiB)

I am observing ovn controller pegging at 100% even when no new resources are being actively created. Once the existing ports reach 500, ovn controller spends most of the time in lflow_run routine.

The perf profile taken during the test (creating resources) and after test completion without deleting the resources are mostly identical. Ovn controller utilization only comes down as we start deleting resources.

 Perf during the test:
#
# Samples: 120K of event 'cycles'
# Event count (approx.): 59579583762
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ .............. ............................. ............................................................
#
    24.33% 29205 ovn-controller /usr/local/bin/ovn-controller 0x66959 B [.] bitwise_rscan
    13.95% 16737 ovn-controller /usr/local/bin/ovn-controller 0xb0ad B [.] create_patch_port.isra.2
    10.25% 12302 ovn-controller /usr/local/bin/ovn-controller 0x5c9f8 B [.] smap_find__
     6.84% 8203 ovn-controller /usr/lib64/libc-2.17.so 0x133d50 B [.] __strncmp_sse42
     5.23% 6277 ovn-controller /usr/lib64/libc-2.17.so 0x7e29a B [.] _int_malloc
     2.91% 3496 ovn-controller /usr/local/bin/ovn-controller 0x1a8cf B [.] hash_bytes
     2.33% 2794 ovn-controller /usr/local/bin/ovn-controller 0x8b4c0 B [.] flow_wildcards_hash
     2.32% 2787 ovn-controller /usr/local/bin/ovn-controller 0x1f257 B [.] match_hash
     2.26% 2717 ovn-controller /usr/lib64/libc-2.17.so 0x15d250 B [.] __memcmp_sse4_1
     2.20% 2641 ovn-controller /usr/local/bin/ovn-controller 0xa69f B [.] ofctrl_add_flow
     2.18% 2615 ovn-controller /usr/lib64/libc-2.17.so 0x7c92b B [.] _int_free
     1.51% 1810 ovn-controller /usr/local/bin/ovn-controller 0x5cf42 B [.] smap_get_node
     1.35% 1618 ovn-controller /usr/local/bin/ovn-controller 0x14ae4 B [.] lex_token_parse
     1.11% 1329 ovn-controller /usr/lib64/libc-2.17.so 0x13200e B [.] __strcmp_sse42

Perf after the test, system is mostly idle except for oven controller using one hw thread:

# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 111K of event 'cycles'
# Event count (approx.): 55343064284
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ .............. ............................. ..........................................................
#
    29.73% 33139 ovn-controller /usr/local/bin/ovn-controller 0x66959 B [.] bitwise_rscan
     9.39% 10468 ovn-controller /usr/local/bin/ovn-controller 0xb0af B [.] create_patch_port.isra.2
     7.16% 7987 ovn-controller /usr/local/bin/ovn-controller 0x5c9eb B [.] smap_find__
     6.26% 6983 ovn-controller /usr/lib64/libc-2.17.so 0x7d...

Read more...

Revision history for this message
Nirapada (nghosh) wrote :

Hi Russell, would you please provide a bit of update on this ? Would love to help in any ways I can. Please let me know.

Revision history for this message
Russell Bryant (russellb) wrote :

I merged one minor optimization: https://github.com/openvswitch/ovs/commit/b1e04512f7150aa9d98a121b32f820c316522372

If your setup uses ports directly attached to a provider network, this patch will make a significant different by drastically reducing the number of logical flows. There is a corresponding networking-ovn patch for this, as well. https://patchwork.ozlabs.org/patch/582095/

Further optimizations are being worked on, but they're a bit further out.

tags: added: ovn-upstream
Revision history for this message
Nirapada (nghosh) wrote :

I ported the patch into our environment and ran the tests again, overall cpu usage did not change, when I looked at gprof output, bitwise_rscan() is consuming 37.27% of cpu time whereas it was taking 49.04% before the patch. BTW, I do not have any VM instances in my setup, just routers, networks, subnets, ports and a lot of them. ovn-controller utilization stays at 99.8--100%.
And, also, I am not using provider network. I guess I may have to wait until further optimization comes out.

Revision history for this message
Richard Theis (rtheis) wrote :

Is this bug still valid?

Revision history for this message
Han Zhou (zhouhan) wrote :

Yes.

With some optimizations (e.g. [1]), the CPU is not a big issue any more for L2 & provider networks (flat/vlan). However, for L3, it is still the problem. The ongoing optimizations [2] & [3] are supposed to solve the problem.

[1] https://github.com/openvswitch/ovs/commit/c4f3269632ba321bef50cf5f44165a54895f7cc0

[2] http://openvswitch.org/pipermail/dev/2016-June/073524.html

[3] http://openvswitch.org/pipermail/dev/2016-June/073813.html

Revision history for this message
Russell Bryant (russellb) wrote :

A *lot* of optimizations have been completed at this point. I think we should close this out and open new bugs as needed for new observations. Thanks!

Changed in networking-ovn:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.