epoll_wait busy loop in neutron-openvswitch-agent

Bug #1762341 reported by Jarkko Oranen
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Incomplete
Undecided
Unassigned

Bug Description

I'm installing a demo openstack environment using TripleO Quickstart using the Queens release, and after deploying the undercloud node, neutron-openvswitch-agent will consume 100% CPU constantly apparently because it keeps calling epoll_wait with a timeout of 0.

Whatever settings Neutron has are the defaults configured by TripleO quickstart with

./quickstart.sh -R queens -E config/environments/mysetup.yml --tags all -N config/nodes/mysetup.yaml -t all -p quickstart.yml os-demo-1

and ./quickstart.sh -T none -I -R queens -E config/environments/mysetup.yml --tags all -N config/nodes/mysetup.yaml -t all -p quickstart-extras-undercloud.yml os-demo-1

The host node is a freshly installed HP Gen8 blade Server and updated CentOS 7 with default repositories and whatever the quickstart ansible scripts set up. the undercloud node is whatever CentOS 7 image is used by TripleO Quickstart commit 505a0c5df551c4518b769f77ddc3da09c4e6e2a1

I have not configured any Neutron settings myself. This is 100% reproducible on my host if I delete all the virtual machines and run TripleO Quickstart again.

I do not know what exactly triggers this behaviour, but the wait(0) call is in the run method in /usr/lib/python2.7/site-packages/eventlet/hubs.py

I added a line of code to throw an exception when wait(0) happens and this is the stacktrace I get:

2018-04-09 08:14:56.840 23151 CRITICAL neutron [-] Unhandled error: Exception: Eventlet waited for 0
2018-04-09 08:14:56.840 23151 ERROR neutron Traceback (most recent call last):
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/bin/neutron-openvswitch-agent", line 10, in <module>
2018-04-09 08:14:56.840 23151 ERROR neutron sys.exit(main())
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/cmd/eventlet/plugins/ovs_neutron_agent.py", line 20, in main
2018-04-09 08:14:56.840 23151 ERROR neutron agent_main.main()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/main.py", line 47, in main
2018-04-09 08:14:56.840 23151 ERROR neutron mod.main()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/main.py", line 35, in main
2018-04-09 08:14:56.840 23151 ERROR neutron 'neutron.plugins.ml2.drivers.openvswitch.agent.'
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 372, in run_apps
2018-04-09 08:14:56.840 23151 ERROR neutron app_mgr.close()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 549, in close
2018-04-09 08:14:56.840 23151 ERROR neutron self.uninstantiate(app_name)
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 533, in uninstantiate
2018-04-09 08:14:56.840 23151 ERROR neutron app.stop()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 185, in stop
2018-04-09 08:14:56.840 23151 ERROR neutron hub.joinall(self.threads)
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 103, in joinall
2018-04-09 08:14:56.840 23151 ERROR neutron t.wait()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait
2018-04-09 08:14:56.840 23151 ERROR neutron return self._exit_event.wait()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait
2018-04-09 08:14:56.840 23151 ERROR neutron return hubs.get_hub().switch()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
2018-04-09 08:14:56.840 23151 ERROR neutron return self.greenlet.switch()
2018-04-09 08:14:56.840 23151 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 348, in run
2018-04-09 08:14:56.840 23151 ERROR neutron raise Exception("Eventlet waited for 0")
2018-04-09 08:14:56.840 23151 ERROR neutron Exception: Eventlet waited for 0
2018-04-09 08:14:56.840 23151 ERROR neutron

Just changing this wait to be non-zero drops cpu usage to ~nothing, though I can't tell if this impacts functionality in any way. Doesn't seem to, though.

neutron package versions are as such:

[stack@undercloud hubs]$ rpm -qa | grep neutron
python2-ironic-neutron-agent-1.0.0-0.20180220161644.deb466b.el7.centos.noarch
openstack-neutron-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
openstack-neutron-linuxbridge-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
openstack-neutron-ml2-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
openstack-neutron-lbaas-12.0.1-0.20180328075810.268cc42.el7.centos.noarch
python2-neutron-lib-1.13.0-0.20180211233639.dcf96cd.el7.centos.noarch
puppet-neutron-12.4.0-0.20180329040645.502d290.el7.centos.noarch
openstack-neutron-common-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
python-neutron-lbaas-12.0.1-0.20180328075810.268cc42.el7.centos.noarch
openstack-neutron-l2gw-agent-12.0.2-0.20180302213951.b064078.el7.centos.noarch
openstack-neutron-metering-agent-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
python-neutron-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
openstack-neutron-sriov-nic-agent-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch
python2-neutronclient-6.7.0-0.20180211221651.95d64ce.el7.centos.noarch
openstack-neutron-openvswitch-12.0.1-0.20180328231751.7e1d5b6.el7.centos.noarch

also
python2-eventlet-0.20.1-2.el7.noarch
python2-ryu-4.15-1.el7.noarch
openvswitch-2.9.0-3.el7.x86_64
python2-openvswitch-2.9.0-3.el7.noarch

Revision history for this message
Jarkko Oranen (oranenj) wrote :
Revision history for this message
Brian Haley (brian-haley) wrote :

Can you see if you have this change:

  https://review.openstack.org/#/c/545612/

but not this one:

  https://review.openstack.org/#/c/554258/

Having one but not the other could cause the agent to consume 100% of a cpu.

Changed in neutron:
status: New → Incomplete
Revision history for this message
Jarkko Oranen (oranenj) wrote :

It seems like I am indeed missing the second fix.

The python-neutron package comes from the tripleo "delorean" repository set up by the quickstart ansible scripts and it's version 12.0.1.0.20180328231751.7e1d5b6.el7.centos, so I suppose the issue might be fixed with a more up-to-date package.

the repository baseurl seems to be https://trunk.rdoproject.org/centos7-queens/64/24/64246f7baa0fc00f1eb5e34732d2a0a7cba697c0_035e37fd/ which looks like it won't just update by itself.

I am not very familiar with TripleO yet, so I'll have to dig some to figure out how to point it at a more recent repository.

Revision history for this message
Brian Haley (brian-haley) wrote :

Thanks for the response.

Yeah, I probably should have tied those two changes closer together, they unfortunately merged almost a week apart and the repo is probably pulled nightly.

I'm not sure of the quickest way to update it to point at a newer version either.

Revision history for this message
anush shetty (anush3d) wrote :

Hello,

did you manage to fix it anyhow ?

I am facing the same 100% cpu usage because of the same reason. I tried some manual changes to ip_conntrack file but that failed with following error and I had to revert the file:

2018-04-13 09:26:24.704 4684 ERROR neutron 'neutron.plugins.ml2.drivers.openvswitch.agent.'
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/base/app_manager.py", line 375, in run_apps
2018-04-13 09:26:24.704 4684 ERROR neutron hub.joinall(services)
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 103, in joinall
2018-04-13 09:26:24.704 4684 ERROR neutron t.wait()
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait
2018-04-13 09:26:24.704 4684 ERROR neutron return self._exit_event.wait()
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 125, in wait
2018-04-13 09:26:24.704 4684 ERROR neutron current.throw(*self._exc)
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
2018-04-13 09:26:24.704 4684 ERROR neutron result = function(*args, **kwargs)
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 65, in _launch
2018-04-13 09:26:24.704 4684 ERROR neutron raise e
2018-04-13 09:26:24.704 4684 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_conntrack.py", line 44
2018-04-13 09:26:24.704 4684 ERROR neutron def __repr__(self):

Much appreciated if any help can be provided on how to update this.

Revision history for this message
Brian Haley (brian-haley) wrote :

Both changes have merged to the stable repository, so the problem should be fixed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.