Excessive number of ConnectionForced: Too many heartbeats missed in logs

Bug #1883038 reported by norman shen
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Invalid
Undecided
Unassigned
Stein
Fix Released
Undecided
Unassigned
Train
Fix Released
Undecided
Unassigned
oslo.messaging
Fix Released
Undecided
norman shen

Bug Description

We are using Openstack Rocky as well as rabbitmq 3.7.4 in our production.

Occasionally I saw many following lines in log

2020-06-11 02:03:06.753 3877409 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2020-06-11 02:03:21.754 3877409 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2020-06-11 02:03:36.755 3877409 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2020-06-11 02:03:51.756 3877409 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2020-06-11 02:04:06.757 3877409 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2020-06-11 02:04:21.757 3877409 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2020-06-11 02:04:36.758 3877409 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2020-06-11 02:04:51.759 3877409 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed

heartbeart interval is 60s and rate is 2. Although it is screaming for missing hearbeats seems rabbitmq server is running fine and messages are received and processed successfully.

***************************************************

SRU Details
-----------

[Impact]
AMQP messages are dropped sometimes resulted in resource creation errors (happened on an environment twice in a week).
Catching the ConnectionForced AMQP connection and reestablish the connection immediately will remediate the issue.

[Test Case]
Reproducing the issue is trickysome. Here are the steps that might help in reproducing the issue.

1. Deploy OpenStack
    (If stsstack-bundles project is used, run command ./generate-bundle.sh -s bionic -r stein -n ddmi:stsstack --run)
2. Change heartbeat_timeout_threshold to 20s in nova.conf and restart nova-api
On nova-cloud-controller,

[oslo_messaging_rabbit]
heartbeat_timeout_threshold = 20

systemctl restart apache2.service

3. Create and delete instances continuously

./tools/instance_launch.sh 10 cirros # command on stsstack-bundles
openstack server list -c ID -f value | xargs openstack server delete

4. On rabbitmq server, drop packets from nova-api -> rabbitmq and allow them randomly
sudo iptables -A INPUT -p tcp --dport 5672 -s 10.5.1.55 -j DROP
sudo iptables -D INPUT 1

5. Perform steps 3,4 until you see the following message in nova-api log
WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: amqp.exceptions.ConnectionForced: Too many heartbeats missed

6. Install the fixed python-oslo.messaging package on nova-cloud-controller
   And restart apache service.

7. Perform steps 3,4 and verify nova-api log for the following INFO message.
INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Too many heartbeats missed

As the above test case is random in nature to reproduce, as additional measure, continuous integration tests for nova-cloud-controller will be run against the packages that are in -proposed.

[Regression Potential]
I do not foresee any regression potential as the patch just adds a new exception and reconnects to AMQP server immediately.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.opendev.org/738538

Changed in oslo.messaging:
assignee: nobody → norman shen (jshen28)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/738538
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=26e8bb4cc9097fddf3c2e34599b131e3f5d73867
Submitter: Zuul
Branch: master

commit 26e8bb4cc9097fddf3c2e34599b131e3f5d73867
Author: shenjiatong <email address hidden>
Date: Tue Jun 30 08:56:26 2020 +0800

    Catch ConnectionForced Exception

    Occasionally I saw large number of connectionforced exceptions.
    such exceptions cannot be healed automatically and requires a reboot.
    catch it and ensure connection established.

    Change-Id: I5c468d10230f05438df780e045f1e92662edf070
    Closes-Bug: #1883038

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/743647

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/ussuri)

Reviewed: https://review.opendev.org/743647
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=360deb89a7e95d332267564b486219e389a793f7
Submitter: Zuul
Branch: stable/ussuri

commit 360deb89a7e95d332267564b486219e389a793f7
Author: shenjiatong <email address hidden>
Date: Tue Jun 30 08:56:26 2020 +0800

    Catch ConnectionForced Exception

    Occasionally I saw large number of connectionforced exceptions.
    such exceptions cannot be healed automatically and requires a reboot.
    catch it and ensure connection established.

    Change-Id: I5c468d10230f05438df780e045f1e92662edf070
    Closes-Bug: #1883038
    (cherry picked from commit 26e8bb4cc9097fddf3c2e34599b131e3f5d73867)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/744262

Revision history for this message
Carlos Augusto da Silva Martins (carlos.martins) wrote :

This affect branch Stable/Stein too

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/train)

Change abandoned by norman shen (<email address hidden>) on branch: stable/train
Review: https://review.opendev.org/744262

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/757242

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/train)

Change abandoned by "Hervé Beraud <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/757242
Reason: rocky is now unmaintained

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "norman shen <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/757242

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/train)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/744262
Committed: https://opendev.org/openstack/oslo.messaging/commit/54ced5361b49fb9b0cf53ff2e6f5825ba7ae513a
Submitter: "Zuul (22348)"
Branch: stable/train

commit 54ced5361b49fb9b0cf53ff2e6f5825ba7ae513a
Author: shenjiatong <email address hidden>
Date: Tue Jun 30 08:56:26 2020 +0800

    Catch ConnectionForced Exception

    Occasionally I saw large number of connectionforced exceptions.
    such exceptions cannot be healed automatically and requires a reboot.
    catch it and ensure connection established.

    Changes:
      oslo_messaging/_drivers/impl_rabbit.py

    NOTE(stephenfin): We need to add the 'absolute_import' future to disable
    the legacy import flow. Without this, the import of 'amqp.exceptions' in
    'oslo_messaging._drivers.impl_rabbit' attempts to import 'exceptions'
    from the sibling module, 'oslo_messaging._drivers.amqp', instead of the
    top-level 'amqp' module. Thank God for Python 3.x, eh? More information
    is available at [1], for those curious enough.

    [1] https://stackoverflow.com/q/33743880/

    Change-Id: I5c468d10230f05438df780e045f1e92662edf070
    Closes-Bug: #1883038
    (cherry picked from commit 26e8bb4cc9097fddf3c2e34599b131e3f5d73867)
    (cherry picked from commit 360deb89a7e95d332267564b486219e389a793f7)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/807240

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/807691

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/807693

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/stein)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/807240
Committed: https://opendev.org/openstack/oslo.messaging/commit/e05bb37ae2b91df0a8391fa1f3c6d91de24aa9f1
Submitter: "Zuul (22348)"
Branch: stable/stein

commit e05bb37ae2b91df0a8391fa1f3c6d91de24aa9f1
Author: shenjiatong <email address hidden>
Date: Tue Jun 30 08:56:26 2020 +0800

    Catch ConnectionForced Exception

    Occasionally I saw large number of connectionforced exceptions.
    such exceptions cannot be healed automatically and requires a reboot.
    catch it and ensure connection established.

    Changes:
      oslo_messaging/_drivers/impl_rabbit.py

    NOTE(stephenfin): We need to add the 'absolute_import' future to disable
    the legacy import flow. Without this, the import of 'amqp.exceptions' in
    'oslo_messaging._drivers.impl_rabbit' attempts to import 'exceptions'
    from the sibling module, 'oslo_messaging._drivers.amqp', instead of the
    top-level 'amqp' module. Thank God for Python 3.x, eh? More information
    is available at [1], for those curious enough.

    NOTE(hberaud): conflicts within the cherry-pick on stable/stein due to
    removed log translation and i18n on stable/train [2]. I decided to ignore this
    patch [2] for stable/stein first to prepare cherry-picking on previous stable
    branches (rocky, queens) and secondly because I don't think we want it
    on older branches.

    [1] https://stackoverflow.com/q/33743880/
    [2]
    https://opendev.org/openstack/oslo.messaging/commit/4f385720d7d43aadfaff7c6f41b0534312b23ba8

    Change-Id: I5c468d10230f05438df780e045f1e92662edf070
    Closes-Bug: #1883038
    (cherry picked from commit 26e8bb4cc9097fddf3c2e34599b131e3f5d73867)
    (cherry picked from commit 360deb89a7e95d332267564b486219e389a793f7)
    (cherry picked from commit 54ced5361b49fb9b0cf53ff2e6f5825ba7ae513a)

tags: added: in-stable-stein
tags: added: sts
description: updated
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :
no longer affects: oslo.messaging (Ubuntu)
no longer affects: oslo.messaging (Ubuntu Bionic)
no longer affects: cloud-archive/queens
no longer affects: cloud-archive/rocky
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

SRU team,
Debdiff for UCA Train/Stein attached

Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello norman, or anyone else affected,

Accepted python-oslo.messaging into train-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:train-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-train-needed to verification-train-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-train-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-train-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello norman, or anyone else affected,

Accepted python-oslo.messaging into stein-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:stein-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-stein-needed to verification-stein-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-stein-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-archive:
status: New → Invalid
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Verified on train-proposed and stein-proposed and test case is successful. oslo.messaging retries the connection after too many heartbeats.

Verification on Train:
$ sudo apt list --installed | grep oslo.messaging

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

python3-oslo.messaging/bionic-proposed,now 9.7.1-0ubuntu3~cloud2 all [installed,automatic]

$ sudo grep -inr "Too many heartbeats" /var/log/nova/
/var/log/nova/nova-api-wsgi.log:8833:2022-01-20 05:06:10.767 31051 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Too many heartbeats missed

Verification on Stein:
$ sudo apt list --installed | grep oslo.messaging

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

python3-oslo.messaging/bionic-proposed,now 9.5.0-0ubuntu1~cloud3 all [installed,automatic]

$ sudo grep -inr "Too many heartbeats missed" /var/log/nova/
/var/log/nova/nova-api-wsgi.log:4978:2022-01-20 04:44:28.501 2907 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Too many heartbeats missed

tags: added: verification-stein-done verification-train-done
removed: verification-train-needed
tags: added: verification-done
Revision history for this message
Corey Bryant (corey.bryant) wrote : Update Released

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package python-oslo.messaging - 9.7.1-0ubuntu3~cloud2
---------------

 python-oslo.messaging (9.7.1-0ubuntu3~cloud2) bionic-train; urgency=medium
 .
   * Catch ConnectionForced Exception (LP: #1883038)
     - d/p/0001-Catch-ConnectionForced-Exception.patch

Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package python-oslo.messaging - 9.5.0-0ubuntu1~cloud3
---------------

 python-oslo.messaging (9.5.0-0ubuntu1~cloud3) bionic-stein; urgency=medium
 .
   * Catch ConnectionForced Exception (LP: #1883038)
     - d/p/0001-Catch-ConnectionForced-Exception.patch

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/queens)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/queens
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/807693
Reason: This branch of this project has transitioned to End of Life, open patches need to be abandoned to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/rocky)

Change abandoned by "Stephen Finucane <email address hidden>" on branch: stable/rocky
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/807691

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging stein-eol

This issue was fixed in the openstack/oslo.messaging stein-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging train-eol

This issue was fixed in the openstack/oslo.messaging train-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.