[HA] an update--update conflict between two racing inspector instances
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ironic Inspector |
Fix Released
|
High
|
milan k |
Bug Description
Testing for the future HA setup reveals a race condition between two concurrent inspector instances updating the same node_info record. The reproducer is as follows:
I1:
* a start node-0 introspection request arrives at the inspector I1
* this deletes node-0 from the DB and creates a new node-0 record with version_id == None[1]
* inspector I1 proceeds with locking the BMC[2]
I2:
* a (concurrent) request to introspect node-0 arrives at the inspector I2
* due to the node-0 being locked in Ironic, inspector I2 fails rebooting the node[3]
* I2 performs a legitimate state update of the node-0 node_info record in the database
* node_info.
* the finished_at and the error code are saved in the DB
I1:
* inspector I1 wakes up from the locking of the BMC
* I1 proceeds with a legitimate state update starting -> waiting[4]
* holding the (now stale) node-info record of node-0 with version_id == None it polls the DB for that field before committing the node-info state update into the DB [5][6]
* since the version_id was None in the stale record, no check against the DB value was performed[5]
* the record state is updated to waiting while both the finished_at and error message are already set by I2
Thu Sep 21 21:42:23 CEST 2017
+------
| Field | Value |
+------
| error | Failed to power on the node, check it's power management configuration: %s |
| finished | True |
| finished_at | 2017-09-21T19:42:21 |
| started_at | 2017-09-21T19:29:02 |
| state | waiting |
| uuid | 28c1ea1d-
+------
Thu Sep 21 21:42:27 CEST 2017
+------
| Field | Value |
+------
| error | Failed to power on the node, check it's power management configuration: %s |
| finished | True |
| finished_at | 2017-09-21T19:42:32 |
| started_at | 2017-09-21T19:29:02 |
| state | error |
| uuid | 28c1ea1d-
+------
Please note this was achieved in a specific environment:
* all the patches in the rebasing_filter topic[7] were applied
* a devstack env was spun up with this modified inspector, using the dnsmasq PXE filter driver (configured not to erase the dhcp hosts directory upon start)
* the devstack inspector process was killed and replaced with two concurrent inspector processes, each listening on a different port
* the haproxy service was configured to use both the inspector processes as backends with same probability[8]
* two looping&racing introspect start node-0 and introspect abort node-0 calls were set up through the openstack baremetal introspection client (in shell)
* a looping status call polling node-0 introspection status
Please note also this haproxy configuration is quite extreme, usually the deployer wouldn't disable the session stickiness. On the other hand it simulates what might happen during a session fail-over.
[1] https:/
[2] https:/
[3] https:/
[4] https:/
[5] https:/
[6] https:/
[7] https:/
[8] http://
Changed in ironic-inspector: | |
importance: | Undecided → High |
Fix proposed to branch: master /review. openstack. org/507559
Review: https:/