Not enough heartbeat frames sent in heartbeat threshold interval

Bug #2008734 reported by Arnaud Morin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
High
Arnaud Morin

Bug Description

The current rabbitmq driver is not sending enough heartbeat frames in the interval specified in the config.

The default config is:
heartbeat_threshold_timeout = 60
heartbeat_rate = 2

As per heartbeat AMQP documentation [1], if two frames are missing in a interval, then the connection should be closed by the broker.

Having heartbeat_threshold_timeout = 60 means the connection timeout on rabbitmq side will be set to 60secs.
Rabbitmq expects then to receive at least 2 heartbeats in these 60secs (see [2]).
If the two heartbeats are missing, rabbitmq will close the connection.

The current implementation will send a heartbeat frame only every 75 seconds, so the connection will be closed.

Here is why it's every 75 secs:

First, the kombu connection is created with a heartbeat set to:
kombu.connection.Connection(
  ...
  heartbeat=self.heartbeat_timeout_threshold,
  ...
)

So, kombu will send heartbeat frames only every 60 secs.

Secondly, to let kombu send the heartbeats correctly, the heartbeat_check needs to be executed every seconds. However, the current implementation is saying:
# NOTE(sileht): kombu recommend to run heartbeat_check every
# seconds, but we use a lock around the kombu connection
# so, to not lock to much this lock to most of the time do nothing
# expected waiting the events drain, we start heartbeat_check and
# retrieve the server heartbeat packet only two times more than
# the minimum required for the heartbeat works
# (heartbeat_timeout/heartbeat_rate/2.0, default kombu
# heartbeat_rate is 2)

So it will wait an extra 60/2/2 = 15 secs before actually sending the frame.

60 + 15 = 75.

[1] https://www.amqp.org/specification/0-9-1/amqp-org-download
[2] https://www.rabbitmq.com/heartbeats.html#heartbeats-interval

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)
Changed in oslo.messaging:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/875615

Revision history for this message
Arnaud Morin (arnaud-morin) wrote :

Actually, the problem is not only in oslo_messaging, but in amqp library itself.
Here is the proposal fix:

https://github.com/celery/py-amqp/pull/405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (master)

Change abandoned by "Arnaud Morin <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/875614
Reason: fix is done is amqp https://github.com/celery/py-amqp/pull/405

Changed in oslo.messaging:
assignee: nobody → Arnaud Morin (arnaud-morin)
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/875615
Committed: https://opendev.org/openstack/oslo.messaging/commit/36fb5bceabe08a982ebd52e4a8f005cd26fdf6b8
Submitter: "Zuul (22348)"
Branch: master

commit 36fb5bceabe08a982ebd52e4a8f005cd26fdf6b8
Author: Arnaud Morin <email address hidden>
Date: Tue Feb 28 07:25:09 2023 +0100

    Set default heartbeat_rate to 3

    Kombu recommend to run heartbeat_check every seconds but we use a lock
    around the kombu connection so, to not lock to much this lock to most of
    the time do nothing except waiting the events drain, we start
    heartbeat_check and retrieve the server heartbeat packet only two times
    more than the minimum required for the heartbeat works:
        heartbeat_timeout / heartbeat_rate / 2.0

    Because of this, we are not sending the heartbeat frames at correct
    intervals. E.G.

    If heartbeat_timeout=60 and rate=2, AMQP protocol expects to send a
    frame
    every 30sec.

    With the current heartbeat_check implementation, heartbeat_check will be
    called every:
        heartbeat_timeout / heartbeat_rate / 2.0 = 60 / 2 / 2.0 = 15
    Which will result in the following frame flow:
        T+0 --> do nothing (60/2 > 0)
        T+15 --> do nothing (60/2 > 15)
        T+30 --> do nothing (60/2 > 30)
        T+45 --> send a frame (60/2 < 45)
        ...

    With heartbeat_rate=3, the heartbeat_check will be executed more often:
        heartbeat_timeout / heartbeat_rate / 2.0 = 60 / 3 / 2.0 = 10
    Frame flow:
        T+0 --> do nothing (60/3 > 0)
        T+10 --> do nothing (60/3 > 10)
        T+20 --> do nothing (60/3 > 20)
        T+30 --> send a frame (60/3 < 30)
        ...

    Now we are sending the frame with correct intervals

    Closes-bug: #2008734

    Signed-off-by: Arnaud Morin <email address hidden>
    Change-Id: Ie646d254faf5e45ba46948212f4c9baf1ba7a1a8

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 14.5.0

This issue was fixed in the openstack/oslo.messaging 14.5.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.