Ironic hypervisor disappears once hashring got rebuilt

Bug #1825876 reported by Nikolay Fedotov on 2019-04-22
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Nikolay Fedotov

Bug Description

Steps to reproduce
==================
Precondition: Need fresh openstack deployment. Database tables nova.compute_nodes and nova_api.host_mappings must be empty. In other words baremetal nodes were not added to ironic database yet.
It HA deployment. Need to have at least two ironic-conductors running on different servers.

Steps:
1. Create baremetal node . "openstack baremetal node create ..."
2. Change node's state to manageable
3. After sometime "nova hypervisor-list" should list a hypervisor with same UUID as the baremetal node.
3.1 Database should like below
MariaDB [(none)]> select uuid, host, mapped from nova.compute_nodes;
+--------------------------------------+-------------+--------+
| uuid | host | mapped |
+--------------------------------------+-------------+--------+
| d394aa91-3544-417c-acab-916a22e5a5b5 | ironic.aio1 | 1 |
+--------------------------------------+-------------+--------+
MariaDB [(none)]> select * from nova_api.host_mappings;
+---------------------+------------+----+---------+-------------+
| created_at | updated_at | id | cell_id | host |
+---------------------+------------+----+---------+-------------+
| 2019-04-22 09:14:23 | NULL | 22 | 7 | ironic.aio1 |
+---------------------+------------+----+---------+-------------+

4. Call "nova hypervisor-show <hypervisor UUID>" in order to find out server where ironic-conductor is running. Log into that server and stop ironic-conductor. Need to force hashring to rebuild it's state. Wait for about five minutes.
5. Check output of "nova hypervisor-list". The hypervisor is absent.

Result
==================
Look inside database (see below). ironic.aio3 took the baremetal thus node nova changed 'host' field of compute (d394aa91-3544-417c-acab-916a22e5a5b5) to 'ironic.aio3'.
Because of mapped = 1 'nova-manage cell_v2 discover_hosts' (run preiodically https://bugs.launchpad.net/nova/+bug/1715646) does not try to create host mapping.

MariaDB [(none)]> select uuid, host, mapped from nova.compute_nodes;
+--------------------------------------+-------------+--------+
| uuid | host | mapped |
+--------------------------------------+-------------+--------+
| d394aa91-3544-417c-acab-916a22e5a5b5 | ironic.aio3 | 1 |
+--------------------------------------+-------------+--------+
MariaDB [(none)]> select * from nova_api.host_mappings;
+---------------------+------------+----+---------+-------------+
| created_at | updated_at | id | cell_id | host |
+---------------------+------------+----+---------+-------------+
| 2019-04-22 09:14:23 | NULL | 22 | 7 | ironic.aio1 |
+---------------------+------------+----+---------+-------------+

2019-04-22 19:54:00.813 8 WARNING nova.compute.resource_tracker [req-1ded2c35-d0e4-4719-a15d-3a83594bab1c - - - - -] No compute node record for ironic.aio3:5f9c2619-30bb-40d2-8b62-8923f04d90f2: ComputeHostNotFound_Remote: Compute host ironic.aio3 could not be found.
2019-04-22 19:54:00.831 8 INFO nova.compute.resource_tracker [req-1ded2c35-d0e4-4719-a15d-3a83594bab1c - - - - -] ComputeNode 5f9c2619-30bb-40d2-8b62-8923f04d90f2 moving from ironic.aio1 to ironic.aio3
2019-04-22 19:54:00.891 8 DEBUG nova.virt.ironic.driver [req-1ded2c35-d0e4-4719-a15d-3a83594bab1c - - - - -] Using cache for node 5f9c2619-30bb-40d2-8b62-8923f04d90f2, age: 0.0979330539703 _node_from_cache /usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py:860

Missing record in host_mappings table causes nova to print "Unable to find service" DEBUG message (see below). The compute become 'invisible'.
See source code nova/api/openstack/compute/hypervisors.py:HypervisorsController._get_hypervisors

108 def _get_hypervisors(self, req, detail=False, limit=None, marker=None,
109 links=False):
110 """Get hypervisors for the given request.
111
112 :param req: nova.api.openstack.wsgi.Request for the GET request
...
161 hypervisors_list = []
162 for hyp in compute_nodes:
163 try:
164 instances = None
165 if with_servers:
166 instances = self.host_api.instance_get_all_by_host(
167 context, hyp.host)
168 service = self.host_api.service_get_by_compute_host(
169 context, hyp.host)
170 hypervisors_list.append(
171 self._view_hypervisor(
172 hyp, service, detail, req, servers=instances))
173 except (exception.ComputeHostNotFound,
174 exception.HostMappingNotFound):
175 # The compute service could be deleted which doesn't delete
176 # the compute node record, that has to be manually removed
177 # from the database so we just ignore it when listing nodes.
178 LOG.debug('Unable to find service for compute node %s. The '
179 'service may be deleted and compute nodes need to '
180 'be manually cleaned up.', hyp.host)

Fix proposed to branch: master
Review: https://review.opendev.org/654584

Changed in nova:
assignee: nobody → Nikolay Fedotov (nfedotov)
status: New → In Progress

Change abandoned by Nikolay Fedotov (<email address hidden>) on branch: master
Review: https://review.opendev.org/654584

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers