Nova can't create instances if RabbitMQ notification cluster is down

Bug #1917645 reported by Belmiro Moreira
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Medium
Balazs Gibizer
oslo.messaging
Fix Released
Undecided
Balazs Gibizer

Bug Description

We use independent RabbitMQ clusters for each OpenStack project, Nova Cells and also for notifications. Recently, I noticed in our test infrastructure that if the RabbitMQ cluster for notifications has an outage, Nova can't create new instances. Possibly other operations will also hang.

Not being able to send a notification/connect to the RabbitMQ cluster shouldn't stop new instances to be created. (If this is actually an use-case for some deployments, the operator should have the possibility to configure it.)

Tested against the master branch.

If the notification RabbitMQ is stooped, when creating an instance, nova-scheduler is stuck with:

```
Mar 01 21:16:28 devstack nova-scheduler[18384]: DEBUG nova.scheduler.request_filter [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Request filter 'accelerators_filter' took 0.0 seconds {{(pid=18384) wrapper /opt/stack/nova/nova/scheduler/request_filter.py:46}}
Mar 01 21:16:32 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 2.0 seconds): OSError: [Errno 113] EHOSTUNREACH
Mar 01 21:16:35 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 4.0 seconds): OSError: [Errno 113] EHOSTUNREACH
Mar 01 21:16:42 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 6.0 seconds): OSError: [Errno 113] EHOSTUNREACH
Mar 01 21:16:51 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 8.0 seconds): OSError: [Errno 113] EHOSTUNREACH
Mar 01 21:17:02 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 10.0 seconds): OSError: [Errno 113] EHOSTUNREACH
(...)
```

Because the notification RabbitMQ cluster is down, Nova gets stuck in:

https://github.com/openstack/nova/blob/5b66caab870558b8a7f7b662c01587b959ad3d41/nova/scheduler/filter_scheduler.py#L85

because oslo messaging never gives up:

https://github.com/openstack/oslo.messaging/blob/5aa645b38b4c1cf08b00e687eb6c7c4b8a0211fc/oslo_messaging/_drivers/impl_rabbit.py#L736

description: updated
Changed in nova:
assignee: nobody → Balazs Gibizer (balazs-gibizer)
status: New → Fix Committed
status: Fix Committed → Confirmed
importance: Undecided → Medium
tags: added: notifications
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

There is two use case to support:
1) the notifications are used for something critical (e.g. billing based on usage) and therefore the failure to send notification should be also lead to the failure of some resource consumption (e.g. new VM creation)

2) the notifications are only used for non mission critical things, like monitoring / telemetry. So a failed notification sending should not lead to the failure of resource allocation.

The solution needs to support both cases so we need to make the error handling configurable. So let's make a new boolean config option [notifications]sending_failure_is_fatal . The default of this option needs to be True to keep the current behavior the default, so the exceptions are raised up in the stack and force the failure of the overall operation. If it is set to False then exceptions are caught and logged instead of propagated.

The solution needs to work for both versioned and unversioned notifications.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

We might need to tweak how we use oslo.messaging to avoid stuck in a long retry loop for notifications.

Revision history for this message
Linda Guo (lihuiguo) wrote :

HI

Is there a workaround for this bug. we got a cloud that nova stuck at

2021-07-28 02:27:05.175 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:27:06.487 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 32.0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:27:38.967 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:27:42.039 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:27:43.351 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 32.0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:28:17.143 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:28:18.903 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:28:20.215 51611 ERROR oslo.messaging._drivers.impl_rabbit [req-e9fee75b-eceb-4bdb-93f9-20cbc9a3483e - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 32.0 seconds): OSError: [Errno 113] EHOSTUNREACH
2021-07-28 02:28:25.327 23270 INFO os_vif [-] Loaded VIF plugins: linux_bridge, noop, ovs
2021-07-28 02:28:29.431 23270 ERROR oslo.messaging._drivers.impl_rabbit [req-7f4949cf-f36c-427c-955b-0ce7f14a2b83 - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 0 seconds): OSError: [Errno 113] EHOSTUNREACH

We tried to restart nova services and rabbitmq service but didn't help.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

@Linda: if you don't need the notification nova emits then you can turn them off by setting [oslo_messaging_notifications]driver to "noop" in the nova configuration file.

https://docs.openstack.org/nova/latest/configuration/config.html#oslo_messaging_notifications.driver

Revision history for this message
Mohammed Naser (mnaser) wrote :

I've also found that even if you set `retry` to `0`, which is supposed to never retry, it seems to be stuck in retrying to connect:

https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/notify/notifier.py#L55-L58

It seems that it is stuck to retrying the connection and it doesn't consider EHOSTUNREACH a failure that it should stop retrying on so everything is permablocked.

Revision history for this message
Mohammed Naser (mnaser) wrote (last edit ):
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Thanks for the GMR it is helpful. I can confirm from the code that the retry is only applied for the message sending but not for connection establishment. The connection establishment is done purely inside kombu[1], but oslo_messaging does not pass any timeout value[2] so kombu retries forever.

[1] https://github.com/celery/kombu/blob/be44a0401417c868a1ef59e44cc57c12c987cd50/kombu/connection.py#L438
[2] https://github.com/openstack/oslo.messaging/blob/feb72de7b81e3919dedc697f9fb5484a92f85ad8/oslo_messaging/_drivers/impl_rabbit.py#L731

Revision history for this message
Mohammed Naser (mnaser) wrote :

As per sean-k-mooney advice, I've added this to be an oslo.messaging bug since it's more of an issue in there than it is in Nova.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

OK. I agree that this is a bug in oslo.messaging, as the doc of the notifier says that the connection establishment should also consider the retry parameter[1] but as I showed above it is not.

[1] https://github.com/openstack/oslo.messaging/blob/feb72de7b81e3919dedc697f9fb5484a92f85ad8/oslo_messaging/notify/messaging.py#L34-L36

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I did more investigation while creating an oslo.messaging test case that reproduces the problem.

1) the rabbit driver uses kombu and the kombu interface allows defining max_retries for the connection attempt too but the rabbit driver does not use it. So the rabbit driver is fixable.

2) the kafka driver uses confluen_kafka and the producer interface[1] of that module is defined as asynchronous. Today the kafka driver simply ignores the retry parameter[2] for message sending. So fixing the kafka driver is not that simple.

[1] https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.Producer.produce
[2] https://github.com/openstack/oslo.messaging/blob/feb72de7b81e3919dedc697f9fb5484a92f85ad8/oslo_messaging/_drivers/impl_kafka.py#L290

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)
Changed in oslo.messaging:
assignee: nobody → Balazs Gibizer (balazs-gibizer)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Changed in oslo.messaging:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/819119
Committed: https://opendev.org/openstack/oslo.messaging/commit/1db6de63a86812742cbc37a0f5fe1fd7a095dd7f
Submitter: "Zuul (22348)"
Branch: master

commit 1db6de63a86812742cbc37a0f5fe1fd7a095dd7f
Author: Balazs Gibizer <email address hidden>
Date: Wed Nov 24 15:55:35 2021 +0100

    Reproduce bug 1917645

    The [oslo_messaging_notification]retry parameter is not applied during
    connecting to the message bus. But the documentation implies it should[1][2].
    The two possible drivers, rabbit and kafka, behaves differently.

    1) The rabbit driver will retry the connection forever, blocking the caller
       process.

    2) The kafka driver also ignores the retry configuration but the
       notifier call returns immediately even if the notification is not
       (cannot) be delivered.

    This patch adds test cases to show the wrong behavior.

    [1] https://docs.openstack.org/oslo.messaging/latest/configuration/opts.html#oslo_messaging_notifications.retry
    [2] https://github.com/openstack/oslo.messaging/blob/feb72de7b81e3919dedc697f9fb5484a92f85ad8/oslo_messaging/notify/messaging.py#L31-L36

    Related-Bug: #1917645

    Change-Id: Id8557050157aecd3abd75c9114d3fcaecdfc5dc9

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/819142
Committed: https://opendev.org/openstack/oslo.messaging/commit/7b3968d9b012e873a9b393fcefa578c46fca18c6
Submitter: "Zuul (22348)"
Branch: master

commit 7b3968d9b012e873a9b393fcefa578c46fca18c6
Author: Balazs Gibizer <email address hidden>
Date: Tue Nov 23 16:58:05 2021 +0100

    [rabbit] use retry parameters during notification sending

    The rabbit backend now applies the [oslo_messaging_notifications]retry,
    [oslo_messaging_rabbit]rabbit_retry_interval, rabbit_retry_backoff and
    rabbit_interval_max configuration parameters when tries to establish the
    connection to the message bus during notification sending.

    This patch also clarifies the differences between the behavior
    of the kafka and the rabbit drivers in this regard.

    Closes-Bug: #1917645
    Change-Id: Id4ccafc95314c86ae918336e42cca64a6acd4d94

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/824512

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/824513

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 12.12.0

This issue was fixed in the openstack/oslo.messaging 12.12.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/824512
Committed: https://opendev.org/openstack/oslo.messaging/commit/7390034e479c044d9067d97cd801f9f58c813e41
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 7390034e479c044d9067d97cd801f9f58c813e41
Author: Balazs Gibizer <email address hidden>
Date: Wed Nov 24 15:55:35 2021 +0100

    Reproduce bug 1917645

    The [oslo_messaging_notification]retry parameter is not applied during
    connecting to the message bus. But the documentation implies it should[1][2].
    The two possible drivers, rabbit and kafka, behaves differently.

    1) The rabbit driver will retry the connection forever, blocking the caller
       process.

    2) The kafka driver also ignores the retry configuration but the
       notifier call returns immediately even if the notification is not
       (cannot) be delivered.

    This patch adds test cases to show the wrong behavior.

    [1] https://docs.openstack.org/oslo.messaging/latest/configuration/opts.html#oslo_messaging_notifications.retry
    [2] https://github.com/openstack/oslo.messaging/blob/feb72de7b81e3919dedc697f9fb5484a92f85ad8/oslo_messaging/notify/messaging.py#L31-L36

    Related-Bug: #1917645

    Change-Id: Id8557050157aecd3abd75c9114d3fcaecdfc5dc9
    (cherry picked from commit 1db6de63a86812742cbc37a0f5fe1fd7a095dd7f)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/824513
Committed: https://opendev.org/openstack/oslo.messaging/commit/3b5a0543e97619ca8f8cf98193f6b6375d77cbf2
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 3b5a0543e97619ca8f8cf98193f6b6375d77cbf2
Author: Balazs Gibizer <email address hidden>
Date: Tue Nov 23 16:58:05 2021 +0100

    [rabbit] use retry parameters during notification sending

    The rabbit backend now applies the [oslo_messaging_notifications]retry,
    [oslo_messaging_rabbit]rabbit_retry_interval, rabbit_retry_backoff and
    rabbit_interval_max configuration parameters when tries to establish the
    connection to the message bus during notification sending.

    This patch also clarifies the differences between the behavior
    of the kafka and the rabbit drivers in this regard.

    Closes-Bug: #1917645
    Change-Id: Id4ccafc95314c86ae918336e42cca64a6acd4d94
    (cherry picked from commit 7b3968d9b012e873a9b393fcefa578c46fca18c6)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/828868

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/828869

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 12.9.3

This issue was fixed in the openstack/oslo.messaging 12.9.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/828868
Committed: https://opendev.org/openstack/oslo.messaging/commit/d63173a31f500254277641a76bb721a8bf07ad9c
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit d63173a31f500254277641a76bb721a8bf07ad9c
Author: Balazs Gibizer <email address hidden>
Date: Wed Nov 24 15:55:35 2021 +0100

    Reproduce bug 1917645

    The [oslo_messaging_notification]retry parameter is not applied during
    connecting to the message bus. But the documentation implies it should[1][2].
    The two possible drivers, rabbit and kafka, behaves differently.

    1) The rabbit driver will retry the connection forever, blocking the caller
       process.

    2) The kafka driver also ignores the retry configuration but the
       notifier call returns immediately even if the notification is not
       (cannot) be delivered.

    This patch adds test cases to show the wrong behavior.

    [1] https://docs.openstack.org/oslo.messaging/latest/configuration/opts.html#oslo_messaging_notifications.retry
    [2] https://github.com/openstack/oslo.messaging/blob/feb72de7b81e3919dedc697f9fb5484a92f85ad8/oslo_messaging/notify/messaging.py#L31-L36

    Related-Bug: #1917645

    Change-Id: Id8557050157aecd3abd75c9114d3fcaecdfc5dc9
    (cherry picked from commit 1db6de63a86812742cbc37a0f5fe1fd7a095dd7f)
    (cherry picked from commit 7390034e479c044d9067d97cd801f9f58c813e41)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/828869
Committed: https://opendev.org/openstack/oslo.messaging/commit/5d6fd1a176a47ffdc55223b990c466917ded9449
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 5d6fd1a176a47ffdc55223b990c466917ded9449
Author: Balazs Gibizer <email address hidden>
Date: Tue Nov 23 16:58:05 2021 +0100

    [rabbit] use retry parameters during notification sending

    The rabbit backend now applies the [oslo_messaging_notifications]retry,
    [oslo_messaging_rabbit]rabbit_retry_interval, rabbit_retry_backoff and
    rabbit_interval_max configuration parameters when tries to establish the
    connection to the message bus during notification sending.

    This patch also clarifies the differences between the behavior
    of the kafka and the rabbit drivers in this regard.

    Closes-Bug: #1917645
    Change-Id: Id4ccafc95314c86ae918336e42cca64a6acd4d94
    (cherry picked from commit 7b3968d9b012e873a9b393fcefa578c46fca18c6)
    (cherry picked from commit 3b5a0543e97619ca8f8cf98193f6b6375d77cbf2)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 12.7.3

This issue was fixed in the openstack/oslo.messaging 12.7.3 release.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I'm setting the nova part of this bug as Invalid as the this is fixed by an oslo.messaging patch.

Changed in nova:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.