Ironic node rebalance race can lead to missing compute nodes in DB

Bug #1853009 reported by Mark Goddard on 2019-11-18
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Mark Goddard
Ocata
Undecided
Unassigned
Pike
Undecided
Unassigned
Queens
Undecided
Unassigned
Rocky
Undecided
Unassigned
Stein
Undecided
Unassigned
Train
Undecided
Unassigned
Ussuri
High
Mark Goddard

Bug Description

There is a race condition in nova-compute with the ironic virt driver as nodes get rebalanced. It can lead to compute nodes being removed in the DB and not repopulated. Ultimately this prevents these nodes from being scheduled to.

Steps to reproduce
==================

* Deploy nova with multiple nova-compute services managing ironic.
* Create some bare metal nodes in ironic, and make them 'available' (does not work if they are 'active')
* Stop all nova-compute services
* Wait for all nova-compute services to be DOWN in 'openstack compute service list'
* Simultaneously start all nova-compute services

Expected results
================

All ironic nodes appear as hypervisors in 'openstack hypervisor list'

Actual results
==============

One or more nodes may be missing from 'openstack hypervisor list'. This is most easily checked via 'openstack hypervisor list | wc -l'

Environment
===========

OS: CentOS 7.6
Hypervisor: ironic
Nova: 18.2.0, plus a handful of backported patches

Logs
====

I grabbed some relevant logs from one incident of this issue. They are split between two compute services, and I have tried to make that clear, including a summary of what happened at each point.

http://paste.openstack.org/show/786272/

tl;dr

c3: 19:14:55 Finds no compute record in RT. Tries to create one (_init_compute_node). Shows traceback with SQL rollback but seems to succeed
c1: 19:14:56 Finds no compute record in RT, ‘moves’ existing node from c3
c1: 19:15:54 Begins periodic update, queries compute nodes for this host, finds the node
c3: 19:15:54 Finds no compute record in RT, ‘moves’ existing node from c1
c1: 19:15:55 Deletes orphan compute node (which now belongs to c3)
c3: 19:16:56 Creates resource provider
c3; 19:17:56 Uses existing resource provider

There are two major problems here:

* c1 deletes the orphan node after c3 has taken ownership of it

* c3 assumes that another compute service will not delete its nodes. Once a node is in rt.compute_nodes, it is not removed again unless the node is orphaned

Mark Goddard (mgoddard) on 2019-11-18
Changed in nova:
assignee: nobody → Mark Goddard (mgoddard)
status: New → In Progress
Matt Riedemann (mriedem) on 2019-11-18
tags: added: ironic resource-tracker
Mark Goddard (mgoddard) wrote :

I removed the duplicate association to bug 1841481. While the symptoms are similar, I think the underlying cause is different.

Related fix proposed to branch: master
Review: https://review.opendev.org/695189

Vladyslav Drok (vdrok) wrote :
Download full text (10.1 KiB)

Here is what I encountered on queens during network partition (edited for easier read):

nova-compute bmt01

17:17:40,099.099 Final resource view: name=597d7aac-10c3-498b-8311-a3a802feb8ac
17:17:40,337.337 Final resource view: name=a689ac47-8cdb-4162-ab74-5b94f2b22144
17:17:40,526.526 Final resource view: name=feb38a0f-5299-423c-8f46-78ee120f14ee
17:18:58,783.783 compute can not report its status to conductor in nova.servicegroup.drivers.db (on object.Service.save, pymysql.err.InternalError)
17:19:07,437.437 compute fails to perform periodic update_available_resource (on objects.ComputeNode._db_compute_node_get_all_by_host, pymysql.err.InternalError)
17:19:37,444.444 compute fails to perform periodic _sync_scheduler_instance_info (MessagingTimeout in conductor RPCAPI object_class_action_versions)

<instances start moving>

17:19:45,638.638 No compute node record for bmt01:3baefd99-dbd6-40e3-88a4-dadff5ca4bb8
17:19:45,865.865 ComputeNode 3baefd99-dbd6-40e3-88a4-dadff5ca4bb8 moving from bmt03 to bmt01
17:19:51,450.450 No compute node record for bmt01:1ddc0947-541c-47e5-a77a-3dab82205c21
17:19:51,488.488 ComputeNode 1ddc0947-541c-47e5-a77a-3dab82205c21 moving from bmt03 to bmt01
17:19:57,374.374 No compute node record for bmt01:25934ddf-808f-4bb9-b0f9-55a3e3184cb3
17:19:57,491.491 ComputeNode 25934ddf-808f-4bb9-b0f9-55a3e3184cb3 moving from bmt03 to bmt01
17:19:59,425.425 nova.servicegroup.drivers.db Recovered from being unable to report status.
17:20:01,313.313 No compute node record for bmt01:cf9dd25d-0db0-410b-a91d-58b226126f01
17:20:01,568.568 ComputeNode cf9dd25d-0db0-410b-a91d-58b226126f01 moving from bmt03 to bmt01
17:20:03,513.513 No compute node record for bmt01:812fb0ba-2415-4303-a32f-1dcd6ae591d5
17:20:03,599.599 ComputeNode 812fb0ba-2415-4303-a32f-1dcd6ae591d5 moving from bmt02 to bmt01
17:20:04,717.717 No compute node record for bmt01:db58f55e-1a60-4d20-9eea-5354e2c87bc4
17:20:04,756.756 ComputeNode db58f55e-1a60-4d20-9eea-5354e2c87bc4 moving from bmt03 to bmt01
17:20:06,005.005 No compute node record for bmt01:75ae6252-74e1-4d94-b379-8b1fd3665c57
17:20:06,046.046 ComputeNode 75ae6252-74e1-4d94-b379-8b1fd3665c57 moving from bmt02 to bmt01
17:20:07,153.153 No compute node record for bmt01:787f2ff1-6146-4f6f-aba8-5b37bdb23b25
17:20:07,188.188 ComputeNode 787f2ff1-6146-4f6f-aba8-5b37bdb23b25 moving from bmt03 to bmt01
17:20:08,171.171 No compute node record for bmt01:79c76025-da4f-43ac-a544-0eb5bac76bd8
17:20:08,209.209 ComputeNode 79c76025-da4f-43ac-a544-0eb5bac76bd8 moving from bmt02 to bmt01
17:20:09,178.178 No compute node record for bmt01:ffb3dd3b-f8f9-448f-9cb0-e1e22b996f5e
17:20:09,226.226 ComputeNode ffb3dd3b-f8f9-448f-9cb0-e1e22b996f5e moving from bmt02 to bmt01
17:20:10,411.411 No compute node record for bmt01:50aec742-41fc-46eb-9cf6-6e908ee5040b
17:20:10,428.428 ComputeNode 50aec742-41fc-46eb-9cf6-6e908ee5040b moving from bmt02 to bmt01
17:20:12,168.168 No compute node record for bmt01:83de1b40-5db6-4ecf-9c18-1a83356890ae
17:20:12,195.195 ComputeNode 83de1b40-5db6-4ecf-9c18-1a83356890ae moving from bmt03 to bmt01
...
17:20:48,502.502 Final resource view: name=597d7aac-10c3-498b-8311-a3a802feb8ac
17:20:48,677.677 Final resour...

Sylvain Bauza (sylvain-bauza) wrote :

Adding the 'api' tag as there is an API impact when operators want to delete the service : they end up having an exception because the ComputeNode record is gone.

Marking https://bugs.launchpad.net/nova/+bug/1860312 as a duplicate of this one as I think the root cause resolution will go by fixing the virt driver instead of workarounding the API code.

Changed in nova:
importance: Undecided → High
tags: added: api
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers