Heartbeat in pthreads in nova-wallaby crashes with greenlet error

Bug #1934937 reported by Adam Harwell
66
This bug affects 12 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
New
Undecided
Unassigned
oslo.messaging
Fix Released
Undecided
Unassigned

Bug Description

When performing a heartbeat to rabbit (inside a nova-compute process), there is a greenlet error which causes a hard crash.

I'm not exactly sure what details are relevant, but can provide more info if there's something that will be useful!

This is on RHEL7 (essentially... somewhat custom image based on it)

Log snippet:

```
2021-07-07 19:34:52,686 DEBUG [oslo.messaging._drivers.impl_rabbit] /opt/openstack/venv/nova-23.0.20031033070/lib/python3.8/site-packages/oslo_messaging/_drivers/impl_rabbit.py:__init__:608 [279fc413-9d7c-4fad-89e8-8de308658947] Connecting to AMQP server on 127.0.0.1:5671
2021-07-07 19:34:52,699 DEBUG [amqp.connection.Connection.heartbeat_tick] /opt/openstack/venv/nova-23.0.20031033070/lib/python3.8/site-packages/amqp/connection.py:heartbeat_tick:726 heartbeat_tick : for connection 79f7cf4331b34cb0a2e3608281076773
2021-07-07 19:34:52,699 DEBUG [amqp.connection.Connection.heartbeat_tick] /opt/openstack/venv/nova-23.0.20031033070/lib/python3.8/site-packages/amqp/connection.py:heartbeat_tick:740 heartbeat_tick : Prev sent/recv: None/None, now - 6/6, monotonic - 9634.717472491, last_heartbeat_sent - 9634.717470288, heartbeat int. - 60 for connection 79f7cf4331b34cb0a2e3608281076773
2021-07-07 19:34:52,700 DEBUG [amqp.connection.Connection.heartbeat_tick] /opt/openstack/venv/nova-23.0.20031033070/lib/python3.8/site-packages/amqp/connection.py:heartbeat_tick:726 heartbeat_tick : for connection 79f7cf4331b34cb0a2e3608281076773
2021-07-07 19:34:52,701 DEBUG [amqp.connection.Connection.heartbeat_tick] /opt/openstack/venv/nova-23.0.20031033070/lib/python3.8/site-packages/amqp/connection.py:heartbeat_tick:740 heartbeat_tick : Prev sent/recv: 6/6, now - 6/6, monotonic - 9634.719438155, last_heartbeat_sent - 9634.717470288, heartbeat int. - 60 for connection 79f7cf4331b34cb0a2e3608281076773
2021-07-07 19:34:52,718 DEBUG [amqp] /opt/openstack/venv/nova-23.0.20031033070/lib/python3.8/site-packages/amqp/connection.py:_on_start:382 Start from server, version: 0.9, properties: {'capabilities': {'publisher_confirms': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_cancel_notify': True, 'connection.blocked': True, 'consumer_priorities': True, 'authentication_failure_close': True, 'per_consumer_qos': True, 'direct_reply_to': True}, 'cluster_name': '<email address hidden>', 'copyright': 'Copyright (c) 2007-2020 VMware, Inc. or its affiliates.', 'information': 'Licensed under the MPL 1.1. Website: https://rabbitmq.com', 'platform': 'Erlang/OTP 23.0.2', 'product': 'RabbitMQ', 'version': '3.8.5'}, mechanisms: [b'PLAIN', b'AMQPLAIN', b'EXTERNAL'], locales: ['en_US']
2021-07-07 19:34:52,719 DEBUG [amqp] /opt/openstack/venv/nova-23.0.20031033070/lib/python3.8/site-packages/amqp/channel.py:__init__:104 using channel_id: 1
2021-07-07 19:34:52,720 DEBUG [amqp] /opt/openstack/venv/nova-23.0.20031033070/lib/python3.8/site-packages/amqp/channel.py:_on_open_ok:444 Channel open
2021-07-07 19:34:52,721 DEBUG [amqp.connection.Connection.heartbeat_tick] /opt/openstack/venv/nova-23.0.20031033070/lib/python3.8/site-packages/amqp/connection.py:heartbeat_tick:726 heartbeat_tick : for connection c0299792d20e42a2b0a17d037d7d3058
Traceback (most recent call last):
File "/opt/openstack/venv/nova-23.0.20031033070/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
timer()
File "/opt/openstack/venv/nova-23.0.20031033070/lib/python3.8/site-packages/eventlet/hubs/timer.py", line 59, in __call__
cb(*args, **kw)
File "/opt/openstack/venv/nova-23.0.20031033070/lib/python3.8/site-packages/eventlet/semaphore.py", line 152, in _do_acquire
waiter.switch()
greenlet.error: cannot switch to a different thread
```

Versions:

```
oslo.messaging==12.7.1
nova==23.0.2 (packaged locally from stable/wallaby as of July 3, 2021)
```

Revision history for this message
Herve Beraud (herveberaud) wrote :

Hello,

Thanks for reported this bug.

Here are some notes related to the releases of the "heartbeat in pthread" patches.

The heartbeat in a python thread is the default value since oslo.messaging 12.6.0 [1]. This version was released during Wallaby (9 months ago - October 2020) [2].

Please, can you tell us if this behavior is systematic or not?

Can you reproduce it all the time, or does this an isolated case?

[1] Tags who contains add5ab4
```
$ git tag --contains add5ab4
12.6.0
12.6.1
12.7.0
12.7.1
12.8.0
```
[2] https://releases.openstack.org/victoria/index.html#wallaby-oslo-messaging

Revision history for this message
Adam Harwell (adam-harwell) wrote :

Yes, I can reproduce it consistently. The only way I can get the nova-compute service to start without crashing nearly instantly is by setting the config flag to disable the heartbeat-in-pthread functionality.

Revision history for this message
Herve Beraud (herveberaud) wrote :

Sean said setting `[oslo_messaging_rabbit]/heartbeat_in_pthread=false` apprently fixed the issue.

However, for now this parameter is deprecated, so we should reconsider the current depreaction of this one and the future removing of this option, especially if it help to solve the problem.

Indeed, native threads and green threads are mixed there.

This option can help us to restore the green thread and avoid us to mixing different types of threads, however, the problem is that the heartbeat won't work with green thread.

Indeed, with a heartbeat in a green thread, with mod_wsgi the python interpreter in the worker process does not run except when actively processing a request so:
(1) request comes in to httpd,
(2) python interpreter is woken up and handed the request,
(3) python interpreter processes request
and returns result back to httpd,
(4) httpd parks the interpreter until the next request comes in now.

What happens when there is a long delay between (4) and then next request at (1)?

The interpreter doesn't run, so the heartbeat thread doesn't run either, and thus the client connection doesn't heartbeat and then you get rabbitmq closing the connection on the other end meanwhile, the next time a request *does* come in, python wakes up and tries to use the old connection but it gets ECONNRESET or whatever from the os at that point in time so it logs an error and then reconnects.

oslo.messaging suffer from the this architecture (mod_wsgi, gevent), I'm not sure that we could solve this problem on the oslo.messaging side. Either we will suffer from the deconnections or from the mixed types of threads and their side effects (this described bug).

Do you have any idea?

Revision history for this message
Adam Harwell (adam-harwell) wrote :

I am woefully unfamiliar with the threading model concerns here, so I don't really have any ideas or insight to give. I can say that the only place I've really run into this problem so far is *nova-compute* service, which isn't even running under wsgi AFAIK, it's just a backend daemon. I'm not even sure why any of this has to apply there. Maybe that's why it is good to be able to disable this feature for it?

Revision history for this message
Adam Harwell (adam-harwell) wrote :

I don't seem to be able to edit my previous comment, but a quick note:

It's *possible* that I've only run into it in this one place because nova is the first service I've actually tried to run on wallaby (and thus, the affected versions of oslo_messaging). Once I start trying out other services, it's possible I will encounter this again. I will provide further feedback when I know if other services do or don't seem to be affected.

Revision history for this message
Herve Beraud (herveberaud) wrote :

Hello,

Thanks Adam for your reply.

So if nova-compute isn't even running under wsgi then running the heartbeat into a native python thread isn't mandatory either. So, yes we should allow to disable this feature and so we should reconsider our current deprecation.

Let us know if you see similar problem outside nova on Wallaby.

For now I'll propose a patch to drop the deprecation of this option.

Lee Yarwood (lyarwood)
no longer affects: nova
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)
Changed in oslo.messaging:
status: New → In Progress
Revision history for this message
Herve Beraud (herveberaud) wrote :

This patch will [1] remove the "deprecation" of the option that allowed us to switch between green threads and native python threads. We will keep it.

[1] https://review.opendev.org/c/openstack/oslo.messaging/+/800621

Revision history for this message
Adam Harwell (adam-harwell) wrote :

Thanks for that!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/800621
Committed: https://opendev.org/openstack/oslo.messaging/commit/d24edef117a356a473247db35a020a1f2c68f9ae
Submitter: "Zuul (22348)"
Branch: master

commit d24edef117a356a473247db35a020a1f2c68f9ae
Author: Hervé Beraud <email address hidden>
Date: Tue Jul 13 12:50:48 2021 +0200

    Remove deprecation of heartbeat_in_pthread

    In some circumstances services can be executed outside of mod_wsgi and
    in a monkey patched environment. In this context we need to leave the
    possibility to users to execute the heartbeat in a green thread.

    The heartbeat_in_pthread was tagged as depreacted few months and planned
    for a future removal. These changes drop this deprecation to allow to
    enable green threads if needed.

    Closes-Bug: #1934937
    Change-Id: Iee2e5a6f7d71acba70bbc857f0bd7d83e32a7b8c

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 12.11.0

This issue was fixed in the openstack/oslo.messaging 12.11.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/852251

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/852251
Committed: https://opendev.org/openstack/oslo.messaging/commit/e44f286ebca0fbde5eae2f7eb9a21ba55ba2a549
Submitter: "Zuul (22348)"
Branch: master

commit e44f286ebca0fbde5eae2f7eb9a21ba55ba2a549
Author: Slawek Kaplonski <email address hidden>
Date: Fri Aug 5 12:40:40 2022 +0200

    Change default value of "heartbeat_in_pthread" to False

    As was reported in the related bug some time ago, setting that
    option to True for nova-compute can break it as it's non-wsgi service.
    We also noticed same problems with randomly stucked non-wsgi services
    like e.g. neutron agents and probably the same issue can happen with
    any other non-wsgi service.

    To avoid that this patch changes default value of that config option
    to be False.
    Together with [1] it effectively reverts change done in [2] some time
    ago.

    [1] https://review.opendev.org/c/openstack/oslo.messaging/+/800621
    [2] https://review.opendev.org/c/openstack/oslo.messaging/+/747395

    Related-Bug: #1934937
    Closes-Bug: #1961402

    Change-Id: I85f5b9d1b5d15ad61a9fcd6e25925b7eeb8bf6e7

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/yoga)

Related fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/854904

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

We just stumbled into this issue in our openstack environment (wallaby).

Are there any plans to backport the default change back to wallaby and xena and also merge the proposed yoga MR?
Sure we set heartbeat_in_pthread to False for our environment, but more users that will upgrade to wallaby, xena and yoga will have a somewhat broken behaviour on default values when upgrading.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/876318

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/876318
Committed: https://opendev.org/openstack/oslo.messaging/commit/fd2381c723fe805b17aca1f80bfff4738fbe9628
Submitter: "Zuul (22348)"
Branch: master

commit fd2381c723fe805b17aca1f80bfff4738fbe9628
Author: Arnaud Morin <email address hidden>
Date: Fri Mar 3 11:16:56 2023 +0100

    Disable greenthreads for RabbitDriver "listen" connections

    When enabling heartbeat_in_pthread, we were restoring the "threading"
    python library from eventlet to original one in RabbitDriver but we
    forgot to do the same in AMQPDriverBase (RabbitDriver is subclass of
    AMQPDriverBase).

    We also need to use the original "queue" so that queues are not going to
    use greenthreads as well.

    Related-bug: #1961402
    Related-bug: #1934937
    Closes-bug: #2009138

    Signed-off-by: Arnaud Morin <email address hidden>
    Change-Id: I34ea0d1381e934297df2f793e0d2594ef8254f00

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/2023.1)

Related fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/880187

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/zed)

Related fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/880188

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/yoga)

Related fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/880189

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/880187
Committed: https://opendev.org/openstack/oslo.messaging/commit/3645839d162a989da46ba1c0fcbe7f32b48b9fb1
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 3645839d162a989da46ba1c0fcbe7f32b48b9fb1
Author: Arnaud Morin <email address hidden>
Date: Fri Mar 3 11:16:56 2023 +0100

    Disable greenthreads for RabbitDriver "listen" connections

    When enabling heartbeat_in_pthread, we were restoring the "threading"
    python library from eventlet to original one in RabbitDriver but we
    forgot to do the same in AMQPDriverBase (RabbitDriver is subclass of
    AMQPDriverBase).

    We also need to use the original "queue" so that queues are not going to
    use greenthreads as well.

    Related-bug: #1961402
    Related-bug: #1934937
    Closes-bug: #2009138

    Signed-off-by: Arnaud Morin <email address hidden>
    Change-Id: I34ea0d1381e934297df2f793e0d2594ef8254f00
    (cherry picked from commit fd2381c723fe805b17aca1f80bfff4738fbe9628)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/880188
Committed: https://opendev.org/openstack/oslo.messaging/commit/4d15b7c4fe0c14e285484d23c15fe5531e952679
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 4d15b7c4fe0c14e285484d23c15fe5531e952679
Author: Arnaud Morin <email address hidden>
Date: Fri Mar 3 11:16:56 2023 +0100

    Disable greenthreads for RabbitDriver "listen" connections

    When enabling heartbeat_in_pthread, we were restoring the "threading"
    python library from eventlet to original one in RabbitDriver but we
    forgot to do the same in AMQPDriverBase (RabbitDriver is subclass of
    AMQPDriverBase).

    We also need to use the original "queue" so that queues are not going to
    use greenthreads as well.

    Related-bug: #1961402
    Related-bug: #1934937
    Closes-bug: #2009138

    Signed-off-by: Arnaud Morin <email address hidden>
    Change-Id: I34ea0d1381e934297df2f793e0d2594ef8254f00
    (cherry picked from commit 3645839d162a989da46ba1c0fcbe7f32b48b9fb1)

tags: added: in-stable-zed
Revision history for this message
Alan Baghumian (alanbach) wrote :

Seems like the stable/yoga fix was never finalized. I just encountered this today on my Yoga cloud:

:~$ sudo systemctl status nova-compute.service
● nova-compute.service - OpenStack Compute
     Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2023-11-06 16:56:26 PST; 4 days ago
   Main PID: 1621 (nova-compute)
      Tasks: 30 (limit: 314572)
     Memory: 240.3M
     CGroup: /system.slice/nova-compute.service
             ├─ 1621 /usr/bin/python3 /usr/bin/nova-compute --config-file=/etc/nova/nova.conf --config-file=/etc/nova/nova-compute.conf --log-file=/var/log/nova/nova-compute.log
             ├─ 2299 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /tmp/tmpfkxvuql0/privsep.sock
             └─103858 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context nova.privsep.sys_admin_pctxt --privsep_sock_path /tmp/tmp499vcc31/privsep.sock

Nov 11 06:17:08 os-vm-6 nova-compute[1621]: waiter.switch()
Nov 11 06:17:08 os-vm-6 nova-compute[1621]: greenlet.error: cannot switch to a different thread
Nov 11 06:18:24 os-vm-6 nova-compute[1621]: Traceback (most recent call last):
Nov 11 06:18:24 os-vm-6 nova-compute[1621]: File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Nov 11 06:18:24 os-vm-6 nova-compute[1621]: timer()
Nov 11 06:18:24 os-vm-6 nova-compute[1621]: File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
Nov 11 06:18:24 os-vm-6 nova-compute[1621]: cb(*args, **kw)
Nov 11 06:18:24 os-vm-6 nova-compute[1621]: File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
Nov 11 06:18:24 os-vm-6 nova-compute[1621]: waiter.switch()
Nov 11 06:18:24 os-vm-6 nova-compute[1621]: greenlet.error: cannot switch to a different thread

Is it possible to finalize it?

Thanks much,
Alan

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/854904
Committed: https://opendev.org/openstack/oslo.messaging/commit/44b3427bc9efea9f341edfb8ea7aea38f25d1a5a
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 44b3427bc9efea9f341edfb8ea7aea38f25d1a5a
Author: Slawek Kaplonski <email address hidden>
Date: Fri Aug 5 12:40:40 2022 +0200

    Change default value of "heartbeat_in_pthread" to False

    As was reported in the related bug some time ago, setting that
    option to True for nova-compute can break it as it's non-wsgi service.
    We also noticed same problems with randomly stucked non-wsgi services
    like e.g. neutron agents and probably the same issue can happen with
    any other non-wsgi service.

    To avoid that this patch changes default value of that config option
    to be False.
    Together with [1] it effectively reverts change done in [2] some time
    ago.

    [1] https://review.opendev.org/c/openstack/oslo.messaging/+/800621
    [2] https://review.opendev.org/c/openstack/oslo.messaging/+/747395

    Related-Bug: #1934937
    Closes-Bug: #1961402

    Change-Id: I85f5b9d1b5d15ad61a9fcd6e25925b7eeb8bf6e7

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/880189
Committed: https://opendev.org/openstack/oslo.messaging/commit/15779aa0733f3c9bd1f85fa8aea25e3bd8915a1c
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 15779aa0733f3c9bd1f85fa8aea25e3bd8915a1c
Author: Arnaud Morin <email address hidden>
Date: Fri Mar 3 11:16:56 2023 +0100

    Disable greenthreads for RabbitDriver "listen" connections

    When enabling heartbeat_in_pthread, we were restoring the "threading"
    python library from eventlet to original one in RabbitDriver but we
    forgot to do the same in AMQPDriverBase (RabbitDriver is subclass of
    AMQPDriverBase).

    We also need to use the original "queue" so that queues are not going to
    use greenthreads as well.

    Related-bug: #1961402
    Related-bug: #1934937
    Closes-bug: #2009138

    Signed-off-by: Arnaud Morin <email address hidden>
    Change-Id: I34ea0d1381e934297df2f793e0d2594ef8254f00
    (cherry picked from commit 4d15b7c4fe0c14e285484d23c15fe5531e952679)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.