Race condition in designate central update_zone - different zone objects returned in memory
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Designate |
Fix Released
|
Undecided
|
Arjun Baindur |
Bug Description
There is a race condition when adding a new recordset to a zone and when central is processing the update_status from a previous recordset/zone operation. In the subsequent recordset operation, central fails to update the zone's action/status in DB to UPDATE/PENDING. It remains as NONE/ACTIVE, and worker error's out.
I have root caused it due to fact that each operation retrieves a different copy in memory of the Zone object (when it calls self.storage.
The create_recordset operation in central is not atomic - it can get interrupted by an update_status from worker in between when it retrieves the zone from storage, and when it attempts to write the new zone serial/
1. Zone initially ACTIVE and in stable state (just an SOA and NS record)
2. 1st Recordset is created, this places zone in UPDATE/PENDING in the DB, update_zone RPC sent to worker
3. Worker performs UPDATE Zone action, waits for a bit, polls the nameservers:
https:/
4. Around this time, 2nd recordset is created. We retrieve a copy of the Object.Zone from DB here, at start of create_recordset: https:/
We can see this returns a unique copy in memory of zone in DB, as it calls cls() method on type "objects.Zone": https:/
5. Right around this time, worker finishes polling for 1st recordset, returns status as SUCCESS back to worker:
here: https:/
and here's handler on central side: https:/
You can see this also calls self.storage.
We are now working with 2 copies of the Zone obj - one that was retrieved as a result of update_status from polling after 1st recordset, and one that was retrieved at start of 2nd recordset creation.
6. The update_status for 1st recordset/polling result invokes self.storage.
7. Immediately after, 2nd recordset creation increments the zone serial, and also calls self.storage.
But it's working with a different copy of the Zone obj. It is not aware that Step #6 ever occured, as a result the obj's changed values aren't reflected. The table.update never writes the zone action/status, and zone remains as NONE/ACTIVE.
I have added a log right after this line, to print the result of obj.obj_
https:/
2017-08-29 01:01:11.761 24398 INFO designate.
2017-08-29 01:01:11.763 24398 DEBUG designate.
2017-08-29 01:01:11.765 24398 INFO designate.
2017-08-29 01:01:11.786 24398 INFO designate.
2017-08-29 01:01:11.788 24398 INFO designate.
As you can see, obj here was never aware update_status from worker changes status to NONE/ACTIVE. It still thinks its in state UPDATE/PENDING from when zone was retrieved. As a result table is never updated.
Yet at end, we select row from the table to retrieve the updated zone, and this is Zone sent to worker.
Changed in designate: | |
assignee: | nobody → Arjun Baindur (xagent-9) |
status: | New → In Progress |
Any progress on this?
I think I am seeing effects of this when making 2 updates concurrently to the same zone. End results matches the description above, one of the records is never converged on actual nameservers.