Eventlet fails when starting network agents

Bug #1965140 reported by Nicolas Melot
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Incomplete
Undecided
Unassigned

Bug Description

I have a two nodes openstack setup, where one node (w3) runs all controller services (keystone, glance, placement, nova, neutron, horizon, cinder) as well as nova-compute and cinder-volume and the second (w6) runs nova-compute and linuxbridge agent.

All network agents on w3 are dead
[root@w3 ~]# openstack network agent list
+--------------------------------------+--------------------+---------------+-------------------+-------+-------+---------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+--------------------+---------------+-------------------+-------+-------+---------------------------+
| 330269b7-b73c-4207-abc7-21f1a2972b7b | Linux bridge agent | w6.int.lunarc | None | :-) | UP | neutron-linuxbridge-agent |
| 83d16241-8a3a-42b0-beda-87246d945dc1 | L3 agent | w3.int.lunarc | nova | XXX | UP | neutron-l3-agent |
| a52ab60f-d893-491d-a43e-823a0d482810 | Linux bridge agent | w3.int.lunarc | None | XXX | UP | neutron-linuxbridge-agent |
| abd75644-d895-41ae-94fa-6c4351cbc4bf | Metadata agent | w3.int.lunarc | None | XXX | UP | neutron-metadata-agent |
| c05c65bc-779e-4fe5-a19e-350c44900be4 | DHCP agent | w3.int.lunarc | nova | XXX | UP | neutron-dhcp-agent |
+--------------------------------------+--------------------+---------------+-------------------+-------+-------+---------------------------+

, and I cannot start them anymore. I tried restarting said agent alone, restarting all openstack daemon on w3 and even restarting the whole node but nothing seems to help and I always have teh same issue and the same trace as show below.

I could not find any useful info in the logs, but systemd does report an issue with eventlet/greenlet:
[root@w3 ~]# journalctl -fu neutron-linuxbridge-agent
-- Logs begin at Wed 2022-03-16 04:32:31 EDT. --
Mar 16 09:14:09 w3.int.lunarc sudo[37085]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/neutron-rootwrap /etc/neutron/rootwrap.conf privsep-helper --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/linuxbridge_agent.ini --config-dir /etc/neutron/conf.d/neutron-linuxbridge-agent --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp93tzwqg3/privsep.sock
Mar 16 09:14:12 w3.int.lunarc sudo[37107]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/neutron-rootwrap /etc/neutron/rootwrap.conf privsep-helper --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/linuxbridge_agent.ini --config-dir /etc/neutron/conf.d/neutron-linuxbridge-agent --privsep_context neutron.privileged.link_cmd --privsep_sock_path /tmp/tmp81iy5eni/privsep.sock
Mar 16 09:14:15 w3.int.lunarc neutron-linuxbridge-agent[37073]: Traceback (most recent call last):
Mar 16 09:14:15 w3.int.lunarc neutron-linuxbridge-agent[37073]: File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Mar 16 09:14:15 w3.int.lunarc neutron-linuxbridge-agent[37073]: timer()
Mar 16 09:14:15 w3.int.lunarc neutron-linuxbridge-agent[37073]: File "/usr/lib/python3.6/site-packages/eventlet/hubs/timer.py", line 59, in __call__
Mar 16 09:14:15 w3.int.lunarc neutron-linuxbridge-agent[37073]: cb(*args, **kw)
Mar 16 09:14:15 w3.int.lunarc neutron-linuxbridge-agent[37073]: File "/usr/lib/python3.6/site-packages/eventlet/semaphore.py", line 152, in _do_acquire
Mar 16 09:14:15 w3.int.lunarc neutron-linuxbridge-agent[37073]: waiter.switch()
Mar 16 09:14:15 w3.int.lunarc neutron-linuxbridge-agent[37073]: greenlet.error: cannot switch to a different thread

I am running OpenStack Xena on CentOS Stream 8 freshly installed. Here are other details:
[root@w3 ~]# uname -a
Linux w3.int.lunarc 4.18.0-365.el8.x86_64 #1 SMP Thu Feb 10 16:11:23 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Any clue on how I can find out what makes this happen, or just how I can get past this crippling greenlet/eventlet error, and get these agents to run again?

Revision history for this message
Nicolas Melot (nicmel) wrote :
Revision history for this message
Miguel Lavalle (minsel) wrote :
Download full text (5.2 KiB)

Based on the information at hand, I don't think this is a bug. The upstream Neutron team tests the linuxbridge agent with every patch that is submitted for the stable/xena branch: https://review.opendev.org/q/project:openstack%252Fneutron+branch:stable%252Fxena+status:open . One example is https://review.opendev.org/c/openstack/neutron/+/833857. Note that we have a job named neutron-tempest-plugin-scenario-linuxbridge-xena. Here's the successful execution of that job for the aforementioned patch https://zuul.opendev.org/t/openstack/build/1ee2736adf46480ea9d48f125b5cd229 and the corresponding linuxbridge agent log https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1ee/833857/1/check/neutron-tempest-plugin-scenario-linuxbridge-xena/1ee2736/controller/logs/screen-q-agt.txt . As you can see, that agent started running today with no problems: "Logs begin at Wed 2022-03-16 10:05:56 UTC".

As far of thoughts on what may be causing your problem, I lean towards an installation / dependencies issue:

1) Is this a fresh new install or did it run successfully before and has started failing recently? If the latter, what changed?

2) Your logging seem a little odd. Note that the third line in the upstream log is: Mar 16 10:19:43.410735 nested-virt-ubuntu-focal-vexxhost-ca-ymq-1-0028853094 neutron-linuxbridge-agent[71074]: INFO neutron.common.config [-] Logging enabled!. In your case, logging doesn't seem to be enabled at all. The second thing the agent does is setup logging: https://github.com/openstack/neutron/blob/2f4661c87681567bb08d7733c723c2b0c31ed6c8/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py#L1017

3) Next, the agent sets up privsep (to be able to execute commands with privileges): https://github.com/openstack/neutron/blob/2f4661c87681567bb08d7733c723c2b0c31ed6c8/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py#L1018. It seems this is where your agent fails, because we see in your log twice:

Mar 16 09:14:09 w3.int.lunarc sudo[37085]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/neutron-rootwrap /etc/neutron/rootwrap.conf privsep-helper --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/linuxbridge_agent.ini --config-dir /etc/neutron/conf.d/neutron-linuxbridge-agent --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp93tzwqg3/privsep.sock

and then the traceback. Comparing with the upstream agent, we see that the privsep daemon starts running successfully:

Mar 16 10:19:43.421615 nested-virt-ubuntu-focal-vexxhost-ca-ymq-1-0028853094 sudo[71765]: stack : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/bin/neutron-rootwrap /etc/neutron/rootwrap.conf privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmppedn6sui/privsep.sock
Mar 16 10:19:43.421948 nested-virt-ubuntu-focal-vexxhost-ca-ymq-1-0028853094 sudo[71765]: pam_unix(sudo:session): session opened for user root by (uid=0)
Mar 16 10:19:43.898488 nested-virt-...

Read more...

Changed in neutron:
status: New → Invalid
status: Invalid → Incomplete
Revision history for this message
Miguel Lavalle (minsel) wrote :
Revision history for this message
Nicolas Melot (nicmel) wrote :
Download full text (3.9 KiB)

Thank you for your help and input.

This is a new installation, aiming at training myself to openstack. I follow as closely as possible the minimal deployment guides on https://docs.openstack.org/install-guide/openstack-services.html, for xena. I occasionally miss setting one parameter in the (long process) and sometime finding out what is missing, setting it right back is a tad of a challenge.

Everything worked fine until I tried to setup a second compute node. I could not deploy new VM on this second node if they are connected to the one provider network ("Binding failed for port xxx, please check neutron logs for more information."). I found out the reason was that both nodes had a different provider network name configured in /etc/neutron/plugins/ml2/linuxbridge_agent.ini\[DEFAULT]\physical_interface_mappings. Turns out correct name was on the newly installed node, so I deleted the provider network (and subnet, port and floating ips), fixed /etc/neutron/plugins/ml2/linuxbridge_agent.ini and /etc/neutron/plugins/ml2/ml2_conf.ini then I created the provider network again. Only then no network agent on the node I fixed could start, and the logs showed the content I gave in the first post.

Now, following your instructions, I attempt to enable logs for linuxbridge-agent. To do that I uncomment the line /etc/neutron/plugins/ml2/linuxbridge_agent.ini\[DEFAULT]\default_log_levels

default_log_levels = amqp=WARN,amqplib=WARN,boto=WARN,qpid=WARN,sqlalchemy=WARN,suds=INFO,oslo.messaging=INFO,oslo_messaging=INFO,iso8601=WARN,requests.packages.urllib3.connectionpool=WARN,urllib3.connectionpool=WARN,websocket=WARN,requests.packages.urllib3.util.retry=WARN,urllib3.util.retry=WARN,keystonemiddleware=WARN,routes.middleware=WARN,stevedore=WARN,taskflow=WARN,keystoneauth=WARN,oslo.cache=INFO,oslo_policy=INFO,dogpile.core.dogpile=INFO

in /etc/neutron/plugins/ml2/linuxbridge_agent.ini without touching the actual content as installed by the centos-stream-8 package (full line below), and I restart the linuxbridge-agent.

The log trace now mentions logs are enabled the the log trace look more like the reference you give. In fact, there is not trace of any error, and linuxbridge-agent seems to run normally:

[root@w3 ~]# openstack network agent list
+--------------------------------------+--------------------+---------------+-------------------+-------+-------+---------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+--------------------+---------------+-------------------+-------+-------+---------------------------+
| 330269b7-b73c-4207-abc7-21f1a2972b7b | Linux bridge agent | w6.int.lunarc | None | :-) | UP | neutron-linuxbridge-agent |
| 83d16241-8a3a-42b0-beda-87246d945dc1 | L3 agent | w3.int.lunarc | nova | XXX | UP | neutron-l3-agent |
| a52ab60f-d893-491d-a43e-823a0d482810 | Linux bridge agent | w3.int.lunarc | None | :-) | UP | neutron-linuxbridge-agent |
| abd75644-d895-41ae-94fa-6c4351cbc4bf | Metadata agent | w3.int.lunarc ...

Read more...

Revision history for this message
Nicolas Melot (nicmel) wrote (last edit ):
Download full text (3.6 KiB)

I have the same symptoms with l3-agent. I set /etc/neutron/l3_agent.ini\[DEFAULT]\default_log_levels

default_log_levels = amqp=DEBUG,amqplib=DEBUG,boto=DEBUG,qpid=DEBUG,sqlalchemy=DEBUG,suds=DEBUG,oslo.messaging=DEBUG,oslo_messaging=DEBUG,iso8601=DEBUG,requests.packages.urllib3.connectionpool=DEBUG,urllib3.connectionpool=DEBUG,websocket=DEBUG,requests.packages.urllib3.util.retry=DEBUG,urllib3.util.retry=DEBUG,keystonemiddleware=DEBUG,routes.middleware=DEBUG,stevedore=DEBUG,taskflow=DEBUG,keystoneauth=DEBUG,oslo.cache=DEBUG,oslo_policy=DEBUG,dogpile.core.dogpile=DEBUG

and I joined the content of /var/log/neutron/l3-agent.log to this post. I did not find anything helpful the l3-agent.log, except perhaps "'cluster_name': '<email address hidden>'" in line 765 where, assuming this is how to connect to rabbitmq, the username is wrong and the config refers to the correct host's A dns record (w3.int.lunarc) instead of its CNAME (osrabbitmq.int.lunarc) I usually use; I don't know where I can set the correct values. Finally, see below for the output of journalctl -fu neutron-l3-agent:
[root@w3 ~]# journalctl -fu neutron-l3-agent
Mar 18 08:05:13 w3.int.lunarc systemd[1]: Stopping OpenStack Neutron Layer 3 Agent...
Mar 18 08:05:13 w3.int.lunarc systemd[1]: neutron-l3-agent.service: Succeeded.
Mar 18 08:05:13 w3.int.lunarc systemd[1]: Stopped OpenStack Neutron Layer 3 Agent.
Mar 18 08:05:13 w3.int.lunarc systemd[1]: Started OpenStack Neutron Layer 3 Agent.
Mar 18 08:05:18 w3.int.lunarc neutron-l3-agent[259430]: Traceback (most recent call last):
Mar 18 08:05:18 w3.int.lunarc neutron-l3-agent[259430]: File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Mar 18 08:05:18 w3.int.lunarc neutron-l3-agent[259430]: timer()
Mar 18 08:05:18 w3.int.lunarc neutron-l3-agent[259430]: File "/usr/lib/python3.6/site-packages/eventlet/hubs/timer.py", line 59, in __call__
Mar 18 08:05:18 w3.int.lunarc neutron-l3-agent[259430]: cb(*args, **kw)
Mar 18 08:05:18 w3.int.lunarc neutron-l3-agent[259430]: File "/usr/lib/python3.6/site-packages/eventlet/semaphore.py", line 152, in _do_acquire
Mar 18 08:05:18 w3.int.lunarc neutron-l3-agent[259430]: waiter.switch()
Mar 18 08:05:18 w3.int.lunarc neutron-l3-agent[259430]: greenlet.error: cannot switch to a different thread

and the status of network agents
[root@w3 ~]# openstack network agent list
+--------------------------------------+--------------------+---------------+-------------------+-------+-------+---------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+--------------------+---------------+-------------------+-------+-------+---------------------------+
| 330269b7-b73c-4207-abc7-21f1a2972b7b | Linux bridge agent | w6.int.lunarc | None | :-) | UP | neutron-linuxbridge-agent |
| 83d16241-8a3a-42b0-beda-87246d945dc1 | L3 agent | w3.int.lunarc | nova | XXX | UP | neutron-l3-agent |
| a52ab60f-d893-491d-a43e-823a0d482810 | Linux bridge agent | w3.int.lunarc | None ...

Read more...

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Nicolas:

You are maybe hitting [1]. Try to upgrade your eventlet version. Or if this is not possible, downgrade it to a safe value. If one node is working fine, use the same versions in the second one.

Regards.

[1]https://github.com/eventlet/eventlet/issues/662

Revision history for this message
Nicolas Melot (nicmel) wrote :

I redeployed from (almost) scratch: new machine, fresh Centos stream 8 OS install but reuse of old config files to redeploy openstack Xena. I hit the exact same problem. Looks like either there is something wrong with my config files, either there is a serious trouble with eventlet or one of its dependencies.
In any case, rollbacking to pre-0.29 eventlet as the issue mentions breaks dependencies, unless I drop Xena for Ussuri. Not sure what are reasonable options for me at this point.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

This is a duplicate of [1] (also check [2] and [3]). For non-wsgi services, the "oslo_messaging_rabit.heartbeat_in_pthread" option should be "False".

[1]https://bugs.launchpad.net/oslo.messaging/+bug/1934937
[2]https://bugs.launchpad.net/tripleo/+bug/1984076
[3]https://bugzilla.redhat.com/show_bug.cgi?id=2115383

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.