Bug #1917645 “Nova can't create instances if RabbitMQ notificati...” : Bugs : oslo.messaging

Belmiro Moreira (moreira-belmiro-email-lists) on 2021-03-03

description:

updated

Balazs Gibizer (balazs-gibizer) on 2021-03-04

Changed in nova:
assignee:	nobody → Balazs Gibizer (balazs-gibizer)
status:	New → Fix Committed
status:	Fix Committed → Confirmed
importance:	Undecided → Medium
tags:	added: notifications

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-03-04:

#1

There is two use case to support:
1) the notifications are used for something critical (e.g. billing based on usage) and therefore the failure to send notification should be also lead to the failure of some resource consumption (e.g. new VM creation)

2) the notifications are only used for non mission critical things, like monitoring / telemetry. So a failed notification sending should not lead to the failure of resource allocation.

The solution needs to support both cases so we need to make the error handling configurable. So let's make a new boolean config option [notifications]sending_failure_is_fatal . The default of this option needs to be True to keep the current behavior the default, so the exceptions are raised up in the stack and force the failure of the overall operation. If it is set to False then exceptions are caught and logged instead of propagated.

The solution needs to work for both versioned and unversioned notifications.

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-03-04:

#2

We might need to tweak how we use oslo.messaging to avoid stuck in a long retry loop for notifications.

Revision history for this message

Linda Guo (lihuiguo) wrote on 2021-07-28:

#3

HI

Is there a workaround for this bug. we got a cloud that nova stuck at

2021-07-28 02:27:05.175 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:27:06.487 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 32.0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:27:38.967 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:27:42.039 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:27:43.351 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 32.0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:28:17.143 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:28:18.903 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:28:20.215 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 32.0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:28:25.327 23270 INFO os_vif [-] Loaded VIF plugins: linux_bridge, noop, ovs
2021-07-28 02:28:29.431 23270 ERROR oslo.messaging._drivers.impl_rabbit [req-7f4949cf-f36c-427c-955b-0ce7f14a2b83 - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH

We tried to restart nova services and rabbitmq service but didn't help.

HI

Is there a workaround for this bug. we got a cloud that nova stuck at

2021-07-28 02:27:05.175 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:27:06.487 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 32.0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:27:38.967 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:27:42.039 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:27:43.351 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 32.0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:28:17.143 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:28:18.903 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:28:20.215 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 32.0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:28:25.327 23270 INFO os_vif [-] Loaded VIF plugins: linux_bridge, noop, ovs
2021-07-28 02:28:29.431 23270 ERROR oslo.messaging._drivers.impl_rabbit [req-7f4949cf-f36c-427c-955b-0ce7f14a2b83 - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH

We tried to restart nova services and rabbitmq service  but didn't help.

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-07-30:

#4

@Linda: if you don't need the notification nova emits then you can turn them off by setting [oslo_messaging_notifications]driver to "noop" in the nova configuration file.

https://docs.openstack.org/nova/latest/configuration/config.html#oslo_messaging_notifications.driver

Revision history for this message

Mohammed Naser (mnaser) wrote on 2021-11-23:

#5

I've also found that even if you set `retry` to `0`, which is supposed to never retry, it seems to be stuck in retrying to connect:

https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/notify/notifier.py#L55-L58

It seems that it is stuck to retrying the connection and it doesn't consider EHOSTUNREACH a failure that it should stop retrying on so everything is permablocked.

Revision history for this message

Mohammed Naser (mnaser) wrote on 2021-11-23 (last edit on 2021-11-23):

#6

I've grabbed a GMR:

https://paste.opendev.org/show/811229/

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-11-23:

#7

Thanks for the GMR it is helpful. I can confirm from the code that the retry is only applied for the message sending but not for connection establishment. The connection establishment is done purely inside kombu[1], but oslo_messaging does not pass any timeout value[2] so kombu retries forever.

[1] https://github.com/celery/kombu/blob/be44a0401417c868a1ef59e44cc57c12c987cd50/kombu/connection.py#L438
[2] https://github.com/openstack/oslo.messaging/blob/feb72de7b81e3919dedc697f9fb5484a92f85ad8/oslo_messaging/_drivers/impl_rabbit.py#L731

Revision history for this message

Mohammed Naser (mnaser) wrote on 2021-11-23:

#8

As per sean-k-mooney advice, I've added this to be an oslo.messaging bug since it's more of an issue in there than it is in Nova.

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-11-23:

#9

OK. I agree that this is a bug in oslo.messaging, as the doc of the notifier says that the connection establishment should also consider the retry parameter[1] but as I showed above it is not.

[1] https://github.com/openstack/oslo.messaging/blob/feb72de7b81e3919dedc697f9fb5484a92f85ad8/oslo_messaging/notify/messaging.py#L34-L36

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-11-24:

#10

I did more investigation while creating an oslo.messaging test case that reproduces the problem.

1) the rabbit driver uses kombu and the kombu interface allows defining max_retries for the connection attempt too but the rabbit driver does not use it. So the rabbit driver is fixable.

2) the kafka driver uses confluen_kafka and the producer interface[1] of that module is defined as asynchronous. Today the kafka driver simply ignores the retry parameter[2] for message sending. So fixing the kafka driver is not that simple.

[1] https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.Producer.produce
[2] https://github.com/openstack/oslo.messaging/blob/feb72de7b81e3919dedc697f9fb5484a92f85ad8/oslo_messaging/_drivers/impl_kafka.py#L290

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-24: Fix proposed to oslo.messaging (master)

#11

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/819119

Balazs Gibizer (balazs-gibizer) on 2021-11-24

Changed in oslo.messaging:
assignee:	nobody → Balazs Gibizer (balazs-gibizer)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-24:

#12

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/819142

Changed in oslo.messaging:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-12-21: Fix merged to oslo.messaging (master)

#13

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/819119
Committed: https://opendev.org/openstack/oslo.messaging/commit/1db6de63a86812742cbc37a0f5fe1fd7a095dd7f
Submitter: "Zuul (22348)"
Branch: master

commit 1db6de63a86812742cbc37a0f5fe1fd7a095dd7f
Author: Balazs Gibizer <email address hidden>
Date: Wed Nov 24 15:55:35 2021 +0100

Reproduce bug 1917645

    The [oslo_messaging_notification]retry parameter is not applied during
    connecting to the message bus. But the documentation implies it should[1][2].
    The two possible drivers, rabbit and kafka, behaves differently.

1) The rabbit driver will retry the connection forever, blocking the caller
process.

    2) The kafka driver also ignores the retry configuration but the
       notifier call returns immediately even if the notification is not
       (cannot) be delivered.

This patch adds test cases to show the wrong behavior.

[1] https://docs.openstack.org/oslo.messaging/latest/configuration/opts.html#oslo_messaging_notifications.retry
[2] https://github.com/openstack/oslo.messaging/blob/feb72de7b81e3919dedc697f9fb5484a92f85ad8/oslo_messaging/notify/messaging.py#L31-L36

Related-Bug: #1917645

Change-Id: Id8557050157aecd3abd75c9114d3fcaecdfc5dc9

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-12:

#14

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/819142
Committed: https://opendev.org/openstack/oslo.messaging/commit/7b3968d9b012e873a9b393fcefa578c46fca18c6
Submitter: "Zuul (22348)"
Branch: master

commit 7b3968d9b012e873a9b393fcefa578c46fca18c6
Author: Balazs Gibizer <email address hidden>
Date: Tue Nov 23 16:58:05 2021 +0100

[rabbit] use retry parameters during notification sending

    The rabbit backend now applies the [oslo_messaging_notifications]retry,
    [oslo_messaging_rabbit]rabbit_retry_interval, rabbit_retry_backoff and
    rabbit_interval_max configuration parameters when tries to establish the
    connection to the message bus during notification sending.

This patch also clarifies the differences between the behavior
of the kafka and the rabbit drivers in this regard.

Closes-Bug: #1917645
Change-Id: Id4ccafc95314c86ae918336e42cca64a6acd4d94

Changed in oslo.messaging:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-13: Fix proposed to oslo.messaging (stable/xena)

#15

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/824512

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-13:

#16

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/824513

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-02-03: Fix included in openstack/oslo.messaging 12.12.0

#17

This issue was fixed in the openstack/oslo.messaging 12.12.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-02-09: Fix merged to oslo.messaging (stable/xena)

#18

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/824512
Committed: https://opendev.org/openstack/oslo.messaging/commit/7390034e479c044d9067d97cd801f9f58c813e41
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 7390034e479c044d9067d97cd801f9f58c813e41
Author: Balazs Gibizer <email address hidden>
Date: Wed Nov 24 15:55:35 2021 +0100

Reproduce bug 1917645

    The [oslo_messaging_notification]retry parameter is not applied during
    connecting to the message bus. But the documentation implies it should[1][2].
    The two possible drivers, rabbit and kafka, behaves differently.

1) The rabbit driver will retry the connection forever, blocking the caller
process.

    2) The kafka driver also ignores the retry configuration but the
       notifier call returns immediately even if the notification is not
       (cannot) be delivered.

This patch adds test cases to show the wrong behavior.

[1] https://docs.openstack.org/oslo.messaging/latest/configuration/opts.html#oslo_messaging_notifications.retry
[2] https://github.com/openstack/oslo.messaging/blob/feb72de7b81e3919dedc697f9fb5484a92f85ad8/oslo_messaging/notify/messaging.py#L31-L36

Related-Bug: #1917645

Change-Id: Id8557050157aecd3abd75c9114d3fcaecdfc5dc9
(cherry picked from commit 1db6de63a86812742cbc37a0f5fe1fd7a095dd7f)

tags:

added: in-stable-xena

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-02-11:

#19

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/824513
Committed: https://opendev.org/openstack/oslo.messaging/commit/3b5a0543e97619ca8f8cf98193f6b6375d77cbf2
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 3b5a0543e97619ca8f8cf98193f6b6375d77cbf2
Author: Balazs Gibizer <email address hidden>
Date: Tue Nov 23 16:58:05 2021 +0100

[rabbit] use retry parameters during notification sending

    The rabbit backend now applies the [oslo_messaging_notifications]retry,
    [oslo_messaging_rabbit]rabbit_retry_interval, rabbit_retry_backoff and
    rabbit_interval_max configuration parameters when tries to establish the
    connection to the message bus during notification sending.

This patch also clarifies the differences between the behavior
of the kafka and the rabbit drivers in this regard.

    Closes-Bug: #1917645
    Change-Id: Id4ccafc95314c86ae918336e42cca64a6acd4d94
    (cherry picked from commit 7b3968d9b012e873a9b393fcefa578c46fca18c6)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-02-11: Fix proposed to oslo.messaging (stable/wallaby)

#20

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/828868

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-02-11:

#21

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/828869

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-14: Fix included in openstack/oslo.messaging 12.9.3

#22

This issue was fixed in the openstack/oslo.messaging 12.9.3 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-04-05: Fix merged to oslo.messaging (stable/wallaby)

#23

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/828868
Committed: https://opendev.org/openstack/oslo.messaging/commit/d63173a31f500254277641a76bb721a8bf07ad9c
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit d63173a31f500254277641a76bb721a8bf07ad9c
Author: Balazs Gibizer <email address hidden>
Date: Wed Nov 24 15:55:35 2021 +0100

Reproduce bug 1917645

    The [oslo_messaging_notification]retry parameter is not applied during
    connecting to the message bus. But the documentation implies it should[1][2].
    The two possible drivers, rabbit and kafka, behaves differently.

1) The rabbit driver will retry the connection forever, blocking the caller
process.

    2) The kafka driver also ignores the retry configuration but the
       notifier call returns immediately even if the notification is not
       (cannot) be delivered.

This patch adds test cases to show the wrong behavior.

[1] https://docs.openstack.org/oslo.messaging/latest/configuration/opts.html#oslo_messaging_notifications.retry
[2] https://github.com/openstack/oslo.messaging/blob/feb72de7b81e3919dedc697f9fb5484a92f85ad8/oslo_messaging/notify/messaging.py#L31-L36

Related-Bug: #1917645

    Change-Id: Id8557050157aecd3abd75c9114d3fcaecdfc5dc9
    (cherry picked from commit 1db6de63a86812742cbc37a0f5fe1fd7a095dd7f)
    (cherry picked from commit 7390034e479c044d9067d97cd801f9f58c813e41)

tags:

added: in-stable-wallaby

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-04-05:

#24

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/828869
Committed: https://opendev.org/openstack/oslo.messaging/commit/5d6fd1a176a47ffdc55223b990c466917ded9449
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 5d6fd1a176a47ffdc55223b990c466917ded9449
Author: Balazs Gibizer <email address hidden>
Date: Tue Nov 23 16:58:05 2021 +0100

[rabbit] use retry parameters during notification sending

    The rabbit backend now applies the [oslo_messaging_notifications]retry,
    [oslo_messaging_rabbit]rabbit_retry_interval, rabbit_retry_backoff and
    rabbit_interval_max configuration parameters when tries to establish the
    connection to the message bus during notification sending.

This patch also clarifies the differences between the behavior
of the kafka and the rabbit drivers in this regard.

    Closes-Bug: #1917645
    Change-Id: Id4ccafc95314c86ae918336e42cca64a6acd4d94
    (cherry picked from commit 7b3968d9b012e873a9b393fcefa578c46fca18c6)
    (cherry picked from commit 3b5a0543e97619ca8f8cf98193f6b6375d77cbf2)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-04-27: Fix included in openstack/oslo.messaging 12.7.3

#25

This issue was fixed in the openstack/oslo.messaging 12.7.3 release.

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2022-04-27:

#26

I'm setting the nova part of this bug as Invalid as the this is fixed by an oslo.messaging patch.

Changed in nova:
status:	Confirmed → Invalid

oslo.messaging

Nova can't create instances if RabbitMQ notification cluster is down

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Invalid	Medium	Balazs Gibizer
	oslo.messaging	Fix Released	Undecided	Balazs Gibizer