Redeploying an existing hypervisor with the same hostname results in nova-compute startup error

Bug #2051011 reported by Pierre Riteau
38
This bug affects 8 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Status tracked in Epoxy
Antelope
New
Undecided
Unassigned
Bobcat
New
Undecided
Unassigned
Caracal
New
Undecided
Unassigned
Dalmatian
Confirmed
Undecided
Unassigned
Epoxy
In Progress
Undecided
Unassigned

Bug Description

Starting with 2023.1, there were changes to the Nova compute node identification process: https://docs.openstack.org/nova/latest/admin/compute-node-identification.html

If a hypervisor is redeployed after wiping its nova_compute Docker volume and the hostname is kept the same, the nova_compute container will fail to start with the following error:

    nova.exception.InvalidConfiguration: No local node identity found, but this is not our first startup on this host. Refusing to start after potentially having lost that state!

Kolla Ansible needs support for injecting the compute node UUID into the /var/lib/nova/compute_id file.

Revision history for this message
Matt Crees (mattcrees) wrote :

Workaround for now:

Find the id from the database:

``select uuid from compute_nodes where hypervisor_hostname='<hostname>';``

Add this to a file under ``/mnt/nova/compute_id`` where ``/mnt/nova/`` is mounted as ``/var/lib/nova`` in the nova-compute container.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

so the issue here is really an impedance mismatch between kolla's concept of redeploying and what nova requires.

basically if you want to redeploy the keeping the same hostname you either need to delete the compute service recorded for that compute node before redeploying (if there are no instances related to that host)
or you need to preserve the /var/lib/nova/compute_id to tell nova that this host is replacing the previously deployed host.

context:
/var/lib/nova/compute_id is intended to prevent accidental renames of comptue node.
if a compute service exits with the same hostname and host value as use by the current nova-compute binary
in the cell db nova will check the compute_service version and detect if /var/lib/nova/compute_id exits with the compute node uuid.

if the compute_service version is new enough it will prevent startup if the file does not exist with the
correct content.

nova should only ever write that file once on a fresh install of a host on the first boot.
at this point no comptue node or comptue service record should exist for this host so nova will
create a new one and write that file once and only once.

as such if you want to replace a physical server with a new one and keep the same hostname as a previous server that was a nova compute node you either need to ensure you properly remove the compute node and comptue service before redeploying or provide /var/lib/nova/compute_id with the correct uuid to signal that the new server is replacing the previous one.

Note that there is a minor bug related to deleting compute services/compute nodes when the host has pci_devices, which prevents it from properly being cleaned up in the nova db.
its possible that that is the underlying problem here if kolla has already implemented correctly deleting the compute service via the API after removing all nova container form the host.
https://bugs.launchpad.net/nova/+bug/2077070

Revision history for this message
Pierre Riteau (priteau) wrote :

Kolla does not support deleting hypervisors. This bug should be resolved when Kolla supports redeploying Nova with the compute_id file.

Revision history for this message
Patrick (4zive4iozy4wu) wrote :

I ran into this in an Antelope test environment, after destroying then deploying the same hypervisor node, i.e. not following the kolla-ansible documentation for removing hosts correctly. The nova_compute container on the hypervisor ended up being in an infinite boot-loop while giving the same error as Pierre mentioned.

Using 'openstack hypervisor list' I took the UUID that the hypervisor had before.

Then on the afflicted hypervisor node I entered the Docker container in its first few seconds before the error is triggered:
docker exec nova_compute -it /bin/bash

Quickly create and add the UUID to the compute_id file
echo "xxxxxxx-xxxxxxxxx-xxxxxxxx-xxxxxxxx" > /var/lib/nova/compute_id

Finally exited the Docker container and saw it wasn't stuck in a boot loop anymore, restarted the Docker container to be 100% sure.
Did a kolla-ansible deploy again of the hypervisor node to be 200% sure.
Did a reboot of the hypervisor node of the node to be 300% sure.
Moved Instances to it to be 400% sure, finally checked the Instances and saw they were doing fine.

Sven Kieske (s-kieske)
Changed in kolla-ansible:
status: New → Confirmed
Revision history for this message
bjolo (bjorn-lofdahl) wrote (last edit ):

just a comment about the workaround. Just copy the compute_id file over to the nova_compute volume and restart the container. no need to docker exec into it and copy.

I (AI) wrote a short playbook for it. make sure to have openstack client installed and sourced the admin_openrc file first so that authentication works

- name: Ensure compute_id consistency across nodes
  hosts: all
  gather_facts: no
  tasks:
    - name: Check if compute_id file exists on the target node
      stat:
        path: /var/lib/docker/volumes/nova_compute/_data/compute_id
      register: compute_id_file

    - name: Retrieve the compute node's UUID from OpenStack
      shell: openstack hypervisor list -f value -c ID -c 'Hypervisor Hostname' | grep "{{ inventory_hostname }}" | awk '{print $1}'
      register: compute_id
      delegate_to: localhost
      when: not compute_id_file.stat.exists

    - name: Ensure the target directory exists
      file:
        path: /var/lib/docker/volumes/nova_compute/_data
        state: directory
        owner: root
        group: root
        mode: '0755'
      when: not compute_id_file.stat.exists

    - name: Create compute_id file with the retrieved UUID
      copy:
        content: "{{ compute_id.stdout }}"
        dest: /var/lib/docker/volumes/nova_compute/_data/compute_id
        owner: root
        group: root
        mode: '0644'
      when: not compute_id_file.stat.exists

Pierre Riteau (priteau)
no longer affects: kolla-ansible/dalmatian
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)
Changed in kolla-ansible:
status: Confirmed → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.