shutdowning rabbitmq causes nova-compute.service down

Bug #2054502 reported by Chuan Li

This bug report will be marked for expiration in 47 days if no further activity occurs. (find out why)

6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
oslo.messaging
Incomplete
Undecided
Unassigned

Bug Description

Description
===========
We have an OpenStack with a RabbitMQ cluster of 3 nodes, and with dozens of nova-compute nodes.
When we shut down 1 out of 3 RabbitMQ nodes, Nagios alerted nova-compute.service down for 2 nova-compute nodes.

Upon checking, we found that nova-compute.service is running.

nova-compute.service - OpenStack Compute
     Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2024-02-16 00:42:47 UTC; 4 days ago
   Main PID: 10130 (nova-compute)
      Tasks: 32 (limit: 463517)
     Memory: 248.2M
        CPU: 55min 5.217s
     CGroup: /system.slice/nova-compute.service
             ├─10130 /usr/bin/python3 /usr/bin/nova-compute --config-file=/etc/nova/nova.conf --config-file=/etc/nova/nova-compute.conf --log-file=/var/log/nova/nova-compute.log
             ├─11527 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /tmp/tmpc0sosqey/privsep.sock
             └─11702 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context nova.privsep.sys_admin_pctxt --privsep_sock_path /tmp/tmp2ik7rchu/privsep.sock

Feb 16 00:42:53 node002 sudo[11540]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64060)
Feb 16 00:42:54 node002 sudo[11540]: pam_unix(sudo:session): session closed for user root
Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last):
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Feb 20 04:55:31 node002 nova-compute[10130]: timer()
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
Feb 20 04:55:31 node002 nova-compute[10130]: cb(*args, **kw)
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
Feb 20 04:55:31 node002 nova-compute[10130]: waiter.switch()
Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread

I guess it's possible that when shutting down a RabbitMQ node, nova-compute is experiencing contention or state inconsistencies in processing connection recovery
restarting nova-compute.service can resolve the problem.

Logs & Configs
==============
The nova-compute.log:

2024-02-20 04:55:28.675 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2024-02-20 04:55:29.677 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
2024-02-20 04:55:30.682 10130 INFO oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] Reconnected to AMQP server on 10.10.10.52:5672 via [amqp] client with port 35346.
2024-02-20 04:55:31.361 10130 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
然后systemctl status nova-compute
Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last):
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Feb 20 04:55:31 node002 nova-compute[10130]: timer()
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
Feb 20 04:55:31 node002 nova-compute[10130]: cb(*args, **kw)
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
Feb 20 04:55:31 node002 nova-compute[10130]: waiter.switch()
Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread

Jammy + nova-compute(3:25.2.0-0ubuntu1) + rabbitmq-server (3.9)

nova.conf:

[oslo_messaging_rabbit]

[oslo_messaging_notifications]
driver = messagingv2
transport_url = *********

[notifications]
notification_format = unversioned

Tags: sts
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

This isn't a Nova bug, maybe some oslo.messaging problem, but anyway, as the nova-compute service will be off, then the servicegroup API wouldn't accept it for the scheduler, so this shouldn't be a problem.

Changed in nova:
status: New → Invalid
Revision history for this message
Takashi Kajinami (kajinamit) wrote :

Could you please share the version of oslo.messaging package ? I suspect this was the issue caused by hartbeat_in_pthread we attempted to enable by default in an old release.

Changed in oslo.messaging:
status: New → Incomplete
Revision history for this message
Chuan Li (lccn) wrote :

@kajinamit I am sorry for late reply.

Here is the pkg version

ii python3-oslo.messaging 12.13.0-0ubuntu1.1 all oslo messaging library - Python 3.x

Revision history for this message
Chuan Li (lccn) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.