epoll_wait busy loop in neutron-openvswitch-agent

Bug #1762341 reported by Jarkko Oranen on 2018-04-09
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Undecided
Unassigned

Bug Description

I'm installing a demo openstack environment using TripleO Quickstart using the Queens release, and after deploying the undercloud node, neutron-openvswitch-agent will consume 100% CPU constantly apparently because it keeps calling epoll_wait with a timeout of 0.

Whatever settings Neutron has are the defaults configured by TripleO quickstart with

./quickstart.sh -R queens -E config/environments/mysetup.yml --tags all -N config/nodes/mysetup.yaml -t all -p quickstart.yml os-demo-1

and ./quickstart.sh -T none -I -R queens -E config/environments/mysetup.yml --tags all -N config/nodes/mysetup.yaml -t all -p quickstart-extras-undercloud.yml os-demo-1

The host node is a freshly installed HP Gen8 blade Server and updated CentOS 7 with default repositories and whatever the quickstart ansible scripts set up. the undercloud node is whatever CentOS 7 image is used by TripleO Quickstart commit 505a0c5df551c4518b769f77ddc3da09c4e6e2a1

I have not configured any Neutron settings myself. This is 100% reproducible on my host if I delete all the virtual machines and run TripleO Quickstart again.

I do not know what exactly triggers this behaviour, but the wait(0) call is in the run method in /usr/lib/python2.7/site-packages/eventlet/hubs.py

I added a line of code to throw an exception when wait(0) happens and this is the stacktrace I get:

2018-04-09 08:14:56.840 23151 CRITICAL neutron [-] Unhandled error: Exception: Eventlet waited for 0
2018-04-09 08:14:56.840 23151 ERROR neutron Traceback (most recent call last):
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/bin/neutron-openvswitch-agent", line 10, in <module>
2018-04-09 08:14:56.840 23151 ERROR neutron sys.exit(main())
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/cmd/eventlet/plugins/ovs_neutron_agent.py", line 20, in main
2018-04-09 08:14:56.840 23151 ERROR neutron agent_main.main()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/main.py", line 47, in main
2018-04-09 08:14:56.840 23151 ERROR neutron mod.main()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/main.py", line 35, in main
2018-04-09 08:14:56.840 23151 ERROR neutron 'neutron.plugins.ml2.drivers.openvswitch.agent.'
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 372, in run_apps
2018-04-09 08:14:56.840 23151 ERROR neutron app_mgr.close()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 549, in close
2018-04-09 08:14:56.840 23151 ERROR neutron self.uninstantiate(app_name)
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 533, in uninstantiate
2018-04-09 08:14:56.840 23151 ERROR neutron app.stop()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 185, in stop
2018-04-09 08:14:56.840 23151 ERROR neutron hub.joinall(self.threads)
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 103, in joinall
2018-04-09 08:14:56.840 23151 ERROR neutron t.wait()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait
2018-04-09 08:14:56.840 23151 ERROR neutron return self._exit_event.wait()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait
2018-04-09 08:14:56.840 23151 ERROR neutron return hubs.get_hub().switch()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
2018-04-09 08:14:56.840 23151 ERROR neutron return self.greenlet.switch()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 348, in run
2018-04-09 08:14:56.840 23151 ERROR neutron raise Exception("Eventlet waited for 0")
2018-04-09 08:14:56.840 23151 ERROR neutron Exception: Eventlet waited for 0
2018-04-09 08:14:56.840 23151 ERROR neutron

Just changing this wait to be non-zero drops cpu usage to ~nothing, though I can't tell if this impacts functionality in any way. Doesn't seem to, though.

neutron package versions are as such:

[stack@undercloud hubs]$ rpm -qa | grep neutron
python2-ironic-neutron-agent-1.0.0-0.20180220161644.deb466b.el7.centos.noarch
openstack-neutron-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
openstack-neutron-linuxbridge-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
openstack-neutron-ml2-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
openstack-neutron-lbaas-12.0.1-0.20180328075810.268cc42.el7.centos.noarch
python2-neutron-lib-1.13.0-0.20180211233639.dcf96cd.el7.centos.noarch
puppet-neutron-12.4.0-0.20180329040645.502d290.el7.centos.noarch
openstack-neutron-common-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
python-neutron-lbaas-12.0.1-0.20180328075810.268cc42.el7.centos.noarch
openstack-neutron-l2gw-agent-12.0.2-0.20180302213951.b064078.el7.centos.noarch
openstack-neutron-metering-agent-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
python-neutron-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
openstack-neutron-sriov-nic-agent-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
python2-neutronclient-6.7.0-0.20180211221651.95d64ce.el7.centos.noarch
openstack-neutron-openvswitch-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch

also
python2-eventlet-0.20.1-2.el7.noarch
python2-ryu-4.15-1.el7.noarch
openvswitch-2.9.0-3.el7.x86_64
python2-openvswitch-2.9.0-3.el7.noarch

Brian Haley (brian-haley) wrote :

Can you see if you have this change:

  https://review.openstack.org/#/c/545612/

but not this one:

  https://review.openstack.org/#/c/554258/

Having one but not the other could cause the agent to consume 100% of a cpu.

Changed in neutron:
status: New → Incomplete
Jarkko Oranen (oranenj) wrote :

It seems like I am indeed missing the second fix.

The python-neutron package comes from the tripleo "delorean" repository set up by the quickstart ansible scripts and it's version 12.0.1.0.20180328231751.7e1d5b6.el7.centos, so I suppose the issue might be fixed with a more up-to-date package.

the repository baseurl seems to be https://trunk.rdoproject.org/centos7-queens/64/24/64246f7baa0fc00f1eb5e34732d2a0a7cba697c0_035e37fd/ which looks like it won't just update by itself.

I am not very familiar with TripleO yet, so I'll have to dig some to figure out how to point it at a more recent repository.

Brian Haley (brian-haley) wrote :

Thanks for the response.

Yeah, I probably should have tied those two changes closer together, they unfortunately merged almost a week apart and the repo is probably pulled nightly.

I'm not sure of the quickest way to update it to point at a newer version either.

anush shetty (anush3d) wrote :

Hello,

did you manage to fix it anyhow ?

I am facing the same 100% cpu usage because of the same reason. I tried some manual changes to ip_conntrack file but that failed with following error and I had to revert the file:

2018-04-13 09:26:24.704 4684 ERROR neutron 'neutron.plugins.ml2.drivers.openvswitch.agent.'
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 375, in run_apps
2018-04-13 09:26:24.704 4684 ERROR neutron hub.joinall(services)
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 103, in joinall
2018-04-13 09:26:24.704 4684 ERROR neutron t.wait()
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait
2018-04-13 09:26:24.704 4684 ERROR neutron return self._exit_event.wait()
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 125, in wait
2018-04-13 09:26:24.704 4684 ERROR neutron current.throw(*self._exc)
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
2018-04-13 09:26:24.704 4684 ERROR neutron result = function(*args, **kwargs)
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 65, in _launch
2018-04-13 09:26:24.704 4684 ERROR neutron raise e
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_conntrack.py", line 44
2018-04-13 09:26:24.704 4684 ERROR neutron def __repr__(self):

Much appreciated if any help can be provided on how to update this.

Brian Haley (brian-haley) wrote :

Both changes have merged to the stable repository, so the problem should be fixed.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers