[HA] an update--update conflict between two racing inspector instances

Bug #1719627 reported by milan k
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic Inspector
Fix Released
High
milan k

Bug Description

Testing for the future HA setup reveals a race condition between two concurrent inspector instances updating the same node_info record. The reproducer is as follows:

I1:

* a start node-0 introspection request arrives at the inspector I1
* this deletes node-0 from the DB and creates a new node-0 record with version_id == None[1]
* inspector I1 proceeds with locking the BMC[2]

I2:

* a (concurrent) request to introspect node-0 arrives at the inspector I2
* due to the node-0 being locked in Ironic, inspector I2 fails rebooting the node[3]
* I2 performs a legitimate state update of the node-0 node_info record in the database
* node_info.version_id being None, it generates an uuid for this record field
* the finished_at and the error code are saved in the DB

I1:

* inspector I1 wakes up from the locking of the BMC
* I1 proceeds with a legitimate state update starting -> waiting[4]
* holding the (now stale) node-info record of node-0 with version_id == None it polls the DB for that field before committing the node-info state update into the DB [5][6]
* since the version_id was None in the stale record, no check against the DB value was performed[5]
* the record state is updated to waiting while both the finished_at and error message are already set by I2

Thu Sep 21 21:42:23 CEST 2017
+-------------+----------------------------------------------------------------------------+
| Field | Value |
+-------------+----------------------------------------------------------------------------+
| error | Failed to power on the node, check it's power management configuration: %s |
| finished | True |
| finished_at | 2017-09-21T19:42:21 |
| started_at | 2017-09-21T19:29:02 |
| state | waiting |
| uuid | 28c1ea1d-f03e-4603-95e6-0ece84d914e5 |
+-------------+----------------------------------------------------------------------------+
Thu Sep 21 21:42:27 CEST 2017
+-------------+----------------------------------------------------------------------------+
| Field | Value |
+-------------+----------------------------------------------------------------------------+
| error | Failed to power on the node, check it's power management configuration: %s |
| finished | True |
| finished_at | 2017-09-21T19:42:32 |
| started_at | 2017-09-21T19:29:02 |
| state | error |
| uuid | 28c1ea1d-f03e-4603-95e6-0ece84d914e5 |
+-------------+----------------------------------------------------------------------------+

Please note this was achieved in a specific environment:

* all the patches in the rebasing_filter topic[7] were applied
* a devstack env was spun up with this modified inspector, using the dnsmasq PXE filter driver (configured not to erase the dhcp hosts directory upon start)
* the devstack inspector process was killed and replaced with two concurrent inspector processes, each listening on a different port
* the haproxy service was configured to use both the inspector processes as backends with same probability[8]
* two looping&racing introspect start node-0 and introspect abort node-0 calls were set up through the openstack baremetal introspection client (in shell)
* a looping status call polling node-0 introspection status

Please note also this haproxy configuration is quite extreme, usually the deployer wouldn't disable the session stickiness. On the other hand it simulates what might happen during a session fail-over.

[1] https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/node_cache.py#L688,#L692
[2] https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/introspect.py#L78
[3] https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/introspect.py#L123
[4] https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/node_cache.py#L208
[5] https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/node_cache.py#L161
[6] https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/node_cache.py#L142
[7] https://review.openstack.org/#/q/topic:rebasing_filter+(status:open+OR+status:merged)
[8] http://paste.openstack.org/show/621940/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic-inspector (master)

Fix proposed to branch: master
Review: https://review.openstack.org/507559

Changed in ironic-inspector:
assignee: nobody → milan k (vetrisko)
status: New → In Progress
Dmitry Tantsur (divius)
Changed in ironic-inspector:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic-inspector (master)

Reviewed: https://review.openstack.org/507559
Committed: https://git.openstack.org/cgit/openstack/ironic-inspector/commit/?id=82000e48ecdaa5738f6a7e69d94386977e714493
Submitter: Zuul
Branch: master

commit 82000e48ecdaa5738f6a7e69d94386977e714493
Author: dparalen <email address hidden>
Date: Tue Sep 26 16:09:13 2017 +0200

    Generate version_id upon add_node

    The version_id isn't set during add_node() call. This function is called
    when introspection starts for both "new" and existing node_info records.
    As a result, race conditions can appear in an HA inspector deployment (see
    the refered bug).

    This patch makes sure a version_id is generated during the add_node() call
    so stale record updates can be detected through the version_id mismatch
    between the inspector memory and the DB record.

    Change-Id: I422473e888e5e49abb3e598fc2cf2f330620bdcd
    Closes-Bug: #1719627

Changed in ironic-inspector:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic-inspector 6.1.0

This issue was fixed in the openstack/ironic-inspector 6.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.