Compute node deletes itself if rebooted without DNS

Bug #1939920 reported by Rodrigo Barbieri
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Nova Compute Charm
New
Undecided
Unassigned

Bug Description

Reproduced on: bionic-queens, focal-wallaby

A normal-running nova-compute service with instances can have its DB suffer drastic damage by having its FQDN changed due to external factors that may be beyond control and always have some chance of happening, such a network outage issue or DNS server issue.

What happens is that the code at [0] deletes the compute node entry in nova.compute_nodes table because the FQDN is "different" when such an external problem happens. In fact, it changes from:

"juju-b93c20-bq-6.maas" to "juju-b93c20-bq-6", whereas "juju-b93c20-bq-6" is unchanged and saved in the nova.compute_nodes table in the "host" field. I believe this could be used to prevent this issue.

So because the FQDN is different, the nova-compute service believes it is a different service and the old one registered is an orphan, and then a cascading series of mistakes follow:

1) Deletes itself from the nova.compute_nodes table
*2) Deletes the allocations from the old resource provider in nova_api/placement.allocations
*3) Deletes the resource provider in nova_api/placement.resource_providers
4) Registers a new compute node in nova.compute_nodes
5) Registers a new empty resource provider in nova_api/placement.resource_providers

* In queens my compute service was successfully able to perform those steps, but in wallaby I got the following errors, under the same circumstances.

2021-08-13 19:37:08.636 3300 DEBUG nova.scheduler.client.report [req-c36eee09-c105-4877-b33b-76944f7ace89 - - - - -] Cannot delete allocation for ['581fdcc1-0a47-4dc4-8598-a6ae4fb13a9f'] consumer in placement as consumer does not exist delete_allocation_for_instance /usr/lib/python3/dist-packages/nova/scheduler/client/report.py:2100
2021-08-13 19:37:08.685 3300 ERROR nova.scheduler.client.report [req-c36eee09-c105-4877-b33b-76944f7ace89 - - - - -] [req-a52e1950-e3a3-4985-bc61-9080ba41afcb] Failed to delete resource provider with UUID fcbe200d-bf36-49d4-822a-0f11be3cc392 from the placement API. Got 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to delete resource provider fcbe200d-bf36-49d4-822a-0f11be3cc392: Resource provider has allocations. ", "request_id": "req-a52e1950-e3a3-4985-bc61-9080ba41afcb"}]}.

The series of cascading issues continues as after step (5) above the node behaves "normally", therefore the customer creates more instances, and later when the node is later restarted, it reverts to its old FQDN, and repeats the problem again, however, a bit differently in queens and wallaby:

wallaby: It fails to re-create the resource provider, as it had not successfully deleted the old one. nova.exception.ResourceProviderCreationFailed: Failed to create resource provider juju-f61af6-fw-8.maas. Therefore at this point it is no longer able to create instances on this node because Placement will no longer report it as a candidate (as it is not registered with its new compute_node uuid).

queens: It repeats steps 1-5, so new VMs get their allocations deleted as well, and the node is functional after another restart with its FQDN restored.

So in queens it is usable after FQDN is restored, while in wallaby it is not, and in both cases DB surgery is needed to fix all inconsistencies.

In the end, this issue is very annoying and it causes a lot of inconsistencies in the DB that need to be repaired through DB surgery, for such an external problem that is sometimes beyond control and has some chance of happening.

I've seen this happen many times with customers but hadn't been able to pinpoint the root cause because I used to just notice a lot of allocations issues (more specifically instances running without allocations) a long time after the FQDN problem had happened, in which the customer had already performed many different changes to restore functionality, while being unaware that allocations were inconsistent, and then raising other problems such as not able being to properly create instances some time in the future, as a consequence of the missing allocation entries in nova_api/placement.allocations DB table.

[0] https://github.com/openstack/nova/blob/b0099aa8a28a79f46cfc79708dcd95f07c1e685f/nova/compute/manager.py#L9997

Steps to reproduce:
===================

Variation 1
~~~~~~~~~~~
- edit /etc/hosts
- add your IP, FQDN and hostname similar to example below

10.5.0.134 juju-b93c20-bq-6.maas5 juju-b93c20-bq-6

Edit the FQDN to make it slightly different (in this example the correct was maas, I changed it to maas5)

- restart nova-compute service

Variation 2
~~~~~~~~~~~
- edit your network configuration to change dhcp to static IP, make sure to not include DNS or gateway, just the IP and submask
- reboot node

Tags: sts
tags: added: sts
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote (last edit ):

Discussed this in nova meeting [0]. The meeting conclusion was that the code logic should not change to attempt to address this. The current behavior is a design decision and the node should be configured in a way that prevents the problem. It was suggested that the following approaches are attempted to prevent the problem:

1) use "host" config option in nova.conf

2) set up the hostname in /etc/hosts
2a) a different canonical hostname that is not FQDN so it isn't prone to this problem
2b) set up the FQDN there to prevent the hostname from changing if there is a DNS outage

3) set up a fixed domain in /etc/domainname

I have tried option (1) above, but it does not solve the problem. The value set in the config gets overriden by the system FQDN. As mentioned by Sean Mooney in the meeting, that value comes from libvirt, which apparently will read from the system, not the config file.

I am yet to explore #2 and #3 above.

[0] https://meetings.opendev.org/meetings/nova/2021/nova.2021-08-17-16.01.log.html#l-12

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Given the meeting discussion and the fact that solution (2) above is able to achieve the intended effect, I marked the bug for Nova project as invalid and added charm-nova-compute as affected, where the fix would have to be implemented in.

Changed in nova:
status: New → Invalid
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Supposedly MaaS is expected to have an entry in /etc/hosts with the proper FQDN to prevent this issue. The value is injected through cloudinit, as long as cloudinit data is not being overridden.

- Need to check what MaaS versions do this.
- Need to check whether cloudinit is being overridden.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

MaaS started managing the contents of /etc/hosts via cloud-init in version 2.2.0

(this comment was posted for a while in the wrong LP)

no longer affects: nova
Revision history for this message
Pedro Victor Lourenço Fragola (pedrovlf) wrote (last edit ):

The customer reported this issue after rebooting a node and I noticed that the node was previously created with an old version of MAAS that did not add hostname.fqdn in /etc/hosts, at the moment the customer has MAAS in version 2.8.6 and new hosts already have hostname.fqdn in /etc/hosts

Note: The compute node deployed previously with the old version of MAAS needed to add hostname.fqdn to /etc/hosts to avoid the hostname change.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.