Nova can't create instances if RabbitMQ notification cluster is down
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Invalid
|
Medium
|
Balazs Gibizer | ||
oslo.messaging |
Fix Released
|
Undecided
|
Balazs Gibizer |
Bug Description
We use independent RabbitMQ clusters for each OpenStack project, Nova Cells and also for notifications. Recently, I noticed in our test infrastructure that if the RabbitMQ cluster for notifications has an outage, Nova can't create new instances. Possibly other operations will also hang.
Not being able to send a notification/
Tested against the master branch.
If the notification RabbitMQ is stooped, when creating an instance, nova-scheduler is stuck with:
```
Mar 01 21:16:28 devstack nova-scheduler[
Mar 01 21:16:32 devstack nova-scheduler[
Mar 01 21:16:35 devstack nova-scheduler[
Mar 01 21:16:42 devstack nova-scheduler[
Mar 01 21:16:51 devstack nova-scheduler[
Mar 01 21:17:02 devstack nova-scheduler[
(...)
```
Because the notification RabbitMQ cluster is down, Nova gets stuck in:
because oslo messaging never gives up:
description: | updated |
Changed in nova: | |
assignee: | nobody → Balazs Gibizer (balazs-gibizer) |
status: | New → Fix Committed |
status: | Fix Committed → Confirmed |
importance: | Undecided → Medium |
tags: | added: notifications |
Changed in oslo.messaging: | |
assignee: | nobody → Balazs Gibizer (balazs-gibizer) |
There is two use case to support:
1) the notifications are used for something critical (e.g. billing based on usage) and therefore the failure to send notification should be also lead to the failure of some resource consumption (e.g. new VM creation)
2) the notifications are only used for non mission critical things, like monitoring / telemetry. So a failed notification sending should not lead to the failure of resource allocation.
The solution needs to support both cases so we need to make the error handling configurable. So let's make a new boolean config option [notifications] sending_ failure_ is_fatal . The default of this option needs to be True to keep the current behavior the default, so the exceptions are raised up in the stack and force the failure of the overall operation. If it is set to False then exceptions are caught and logged instead of propagated.
The solution needs to work for both versioned and unversioned notifications.