Race condition in designate central update_zone - different zone objects returned in memory

Bug #1713875 reported by Arjun Baindur on 2017-08-29
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Designate
Undecided
Arjun Baindur

Bug Description

There is a race condition when adding a new recordset to a zone and when central is processing the update_status from a previous recordset/zone operation. In the subsequent recordset operation, central fails to update the zone's action/status in DB to UPDATE/PENDING. It remains as NONE/ACTIVE, and worker error's out.

I have root caused it due to fact that each operation retrieves a different copy in memory of the Zone object (when it calls self.storage.get_zone). As a result this does not properly update the obj_changes field for the zone object for the latter operation. Therefore the DB update doesn't update any values.

The create_recordset operation in central is not atomic - it can get interrupted by an update_status from worker in between when it retrieves the zone from storage, and when it attempts to write the new zone serial/action/status.

1. Zone initially ACTIVE and in stable state (just an SOA and NS record)

2. 1st Recordset is created, this places zone in UPDATE/PENDING in the DB, update_zone RPC sent to worker

3. Worker performs UPDATE Zone action, waits for a bit, polls the nameservers:
https://github.com/openstack/designate/blob/master/designate/worker/tasks/zone.py#L230

4. Around this time, 2nd recordset is created. We retrieve a copy of the Object.Zone from DB here, at start of create_recordset: https://github.com/openstack/designate/blob/master/designate/central/service.py#L1231

We can see this returns a unique copy in memory of zone in DB, as it calls cls() method on type "objects.Zone": https://github.com/openstack/designate/blob/master/designate/sqlalchemy/base.py#L250

5. Right around this time, worker finishes polling for 1st recordset, returns status as SUCCESS back to worker:

here: https://github.com/openstack/designate/blob/master/designate/worker/tasks/zone.py#L445

and here's handler on central side: https://github.com/openstack/designate/blob/master/designate/central/service.py#L2308

You can see this also calls self.storage.get_zone(context, zone_id) as we had done in #5, which returns another Zone object: https://github.com/openstack/designate/blob/master/designate/central/service.py#L2316

We are now working with 2 copies of the Zone obj - one that was retrieved as a result of update_status from polling after 1st recordset, and one that was retrieved at start of 2nd recordset creation.

6. The update_status for 1st recordset/polling result invokes self.storage.update_zone(), and updates the zone in central's DB to action/status = NONE/ACTIVE: https://github.com/openstack/designate/blob/master/designate/central/service.py#L2324

7. Immediately after, 2nd recordset creation increments the zone serial, and also calls self.storage.update_zone() with zone.action/status = UPDATE/PENDING to update it in DB.

But it's working with a different copy of the Zone obj. It is not aware that Step #6 ever occured, as a result the obj's changed values aren't reflected. The table.update never writes the zone action/status, and zone remains as NONE/ACTIVE.

I have added a log right after this line, to print the result of obj.obj_get_changes(), and it only thinks the serial has changed:

https://github.com/openstack/designate/blob/master/designate/sqlalchemy/base.py#L533

2017-08-29 01:01:11.761 24398 INFO designate.central.service [req-0e461b95-6e1a-4ce9-b4e5-692e04da8b88 c8f6998ddc6d4a9abbb3b417f0c90291 1912bc4ba68a4dadaf7ae23ad0a10434 - - -] Updating zone serial for zone net1.test-du-neutron-vx-c7-39087.platform9.horse. to serial 1503968471
2017-08-29 01:01:11.763 24398 DEBUG designate.storage.impl_sqlalchemy [req-0e461b95-6e1a-4ce9-b4e5-692e04da8b88 c8f6998ddc6d4a9abbb3b417f0c90291 1912bc4ba68a4dadaf7ae23ad0a10434 - - -] Updating zone <Zone id:'8599a077-76ab-459a-be87-a15cf2927caf' type:'PRIMARY' name:'net1.test-du-neutron-vx-c7-39087.platform9.horse.' pool_id:'794ccc2c-d751-44fe-b57f-8894c9f5c842' serial:'1503968471' action:'UPDATE' status:'PENDING'> update_zone /opt/pf9/designate/lib/python2.7/site-packages/designate/storage/impl_sqlalchemy/__init__.py:309
2017-08-29 01:01:11.765 24398 INFO designate.sqlalchemy.base [req-0e461b95-6e1a-4ce9-b4e5-692e04da8b88 c8f6998ddc6d4a9abbb3b417f0c90291 1912bc4ba68a4dadaf7ae23ad0a10434 - - -] sqalchemy/base.py: _update: changed values = {'serial': 1503968471}

2017-08-29 01:01:11.786 24398 INFO designate.sqlalchemy.base [req-0e461b95-6e1a-4ce9-b4e5-692e04da8b88 c8f6998ddc6d4a9abbb3b417f0c90291 1912bc4ba68a4dadaf7ae23ad0a10434 - - -] _update: obj = <Zone id:'8599a077-76ab-459a-be87-a15cf2927caf' type:'PRIMARY' name:'net1.test-du-neutron-vx-c7-39087.platform9.horse.' pool_id:'794ccc2c-d751-44fe-b57f-8894c9f5c842' serial:'1503968471' action:'UPDATE' status:'PENDING'>
2017-08-29 01:01:11.788 24398 INFO designate.sqlalchemy.base [req-0e461b95-6e1a-4ce9-b4e5-692e04da8b88 c8f6998ddc6d4a9abbb3b417f0c90291 1912bc4ba68a4dadaf7ae23ad0a10434 - - -] resultproxy2 = ('8599a077-76ab-459a-be87-a15cf2927caf', 6, datetime.datetime(2017, 8, 29, 0, 50, 4), datetime.datetime(2017, 8, 29, 1, 1, 12), u'0', None, 2137, u'1912bc4ba68a4dadaf7ae23ad0a10434', u'net1.test-du-neutron-vx-c7-39087.platform9.horse.', <email address hidden>', u'sub domain for networknet1', u'PRIMARY', None, 3600, 1503968471, 3524, 600, 86400, 3600, u'ACTIVE', None, u'NONE', '794ccc2c-d751-44fe-b57f-8894c9f5c842', u'.esroh.9mroftalp.78093-7c-xv-nortuen-ud-tset.1ten', False)

As you can see, obj here was never aware update_status from worker changes status to NONE/ACTIVE. It still thinks its in state UPDATE/PENDING from when zone was retrieved. As a result table is never updated.

Yet at end, we select row from the table to retrieve the updated zone, and this is Zone sent to worker.

Arjun Baindur (abaindur) on 2017-09-01
Changed in designate:
assignee: nobody → Arjun Baindur (xagent-9)
status: New → In Progress
Tomasz Setkowski (s-tom-a) wrote :

Any progress on this?
I think I am seeing effects of this when making 2 updates concurrently to the same zone. End results matches the description above, one of the records is never converged on actual nameservers.

Salman (salmankh) wrote :

I am facing this issue while adding records in multiple zones concurrently. A few went through and a few just stay in PENDING state until zone notifies are issued on timeout. However, adding concurrent records to a single zone works fine.

Reviewed: https://review.openstack.org/587642
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=12cd654b143e215dbe01cc2688a7d6e6cf3ce62f
Submitter: Zuul
Branch: master

commit 12cd654b143e215dbe01cc2688a7d6e6cf3ce62f
Author: Arjun Baindur <email address hidden>
Date: Tue Jul 31 14:36:32 2018 -0700

    Make update_status use synchronized_zone

    There is a race condition when adding a new recordset to a zone
    and when central is processing the update_status from a previous
    recordset/zone operation. In the subsequent recordset operation,
    central fails to update the zone's action/status in DB to UPDATE/PENDING.
    It remains as NONE/ACTIVE, and worker error's out.

    Each operation retrieves a different copy in memory of the Zone object
    (when it calls self.storage.get_zone). As a result this does not update
    the obj_changes field for the zone object for the latter operation.
    Therefore the DB update doesn't update any values.

    update_status should be synchronized as it updates zone DB

    Change-Id: Ib4e0bc567ba8eb54c61c3921d165521bf2f12206
    Closes-Bug: #1713875

Changed in designate:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/606888
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=af1bb8a36a704bb1a226fe5154f828e152ef23e1
Submitter: Zuul
Branch: stable/rocky

commit af1bb8a36a704bb1a226fe5154f828e152ef23e1
Author: Arjun Baindur <email address hidden>
Date: Tue Jul 31 14:36:32 2018 -0700

    Make update_status use synchronized_zone

    There is a race condition when adding a new recordset to a zone
    and when central is processing the update_status from a previous
    recordset/zone operation. In the subsequent recordset operation,
    central fails to update the zone's action/status in DB to UPDATE/PENDING.
    It remains as NONE/ACTIVE, and worker error's out.

    Each operation retrieves a different copy in memory of the Zone object
    (when it calls self.storage.get_zone). As a result this does not update
    the obj_changes field for the zone object for the latter operation.
    Therefore the DB update doesn't update any values.

    update_status should be synchronized as it updates zone DB

    Change-Id: Ib4e0bc567ba8eb54c61c3921d165521bf2f12206
    Closes-Bug: #1713875
    (cherry picked from commit 12cd654b143e215dbe01cc2688a7d6e6cf3ce62f)

tags: added: in-stable-rocky

Reviewed: https://review.openstack.org/606889
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=59c5c664654b88772a6309fbb319e446423317c0
Submitter: Zuul
Branch: stable/queens

commit 59c5c664654b88772a6309fbb319e446423317c0
Author: Arjun Baindur <email address hidden>
Date: Tue Jul 31 14:36:32 2018 -0700

    Make update_status use synchronized_zone

    There is a race condition when adding a new recordset to a zone
    and when central is processing the update_status from a previous
    recordset/zone operation. In the subsequent recordset operation,
    central fails to update the zone's action/status in DB to UPDATE/PENDING.
    It remains as NONE/ACTIVE, and worker error's out.

    Each operation retrieves a different copy in memory of the Zone object
    (when it calls self.storage.get_zone). As a result this does not update
    the obj_changes field for the zone object for the latter operation.
    Therefore the DB update doesn't update any values.

    update_status should be synchronized as it updates zone DB

    Change-Id: Ib4e0bc567ba8eb54c61c3921d165521bf2f12206
    Closes-Bug: #1713875
    (cherry picked from commit 12cd654b143e215dbe01cc2688a7d6e6cf3ce62f)

tags: added: in-stable-queens

Reviewed: https://review.openstack.org/604612
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=7c2a3be3e42cc72b6b03c9b6cb67393d423e5e21
Submitter: Zuul
Branch: stable/pike

commit 7c2a3be3e42cc72b6b03c9b6cb67393d423e5e21
Author: Arjun Baindur <email address hidden>
Date: Tue Jul 31 14:36:32 2018 -0700

    Make update_status use synchronized_zone

    There is a race condition when adding a new recordset to a zone
    and when central is processing the update_status from a previous
    recordset/zone operation. In the subsequent recordset operation,
    central fails to update the zone's action/status in DB to UPDATE/PENDING.
    It remains as NONE/ACTIVE, and worker error's out.

    Each operation retrieves a different copy in memory of the Zone object
    (when it calls self.storage.get_zone). As a result this does not update
    the obj_changes field for the zone object for the latter operation.
    Therefore the DB update doesn't update any values.

    update_status should be synchronized as it updates zone DB

    Change-Id: Ib4e0bc567ba8eb54c61c3921d165521bf2f12206
    Closes-Bug: #1713875
    (cherry picked from commit 12cd654b143e215dbe01cc2688a7d6e6cf3ce62f)

tags: added: in-stable-pike

This issue was fixed in the openstack/designate 8.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers