Delete zone fails to propagate to all (Bind) nameservers in a pool depending on threshold_percentage

Bug #1406414 reported by Paul Glass
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Designate
Fix Released
Critical
Ron Rickard

Bug Description

Reproduction: I was testing this with two Bind servers in a single pool.
- In designate.conf, I set `threshold_percentage = 49`. (This value is just below 50%, which means the API should mark zones as ACTIVE once zone changes have propagated to at least one out of the two nameservers.)
- I created a zone 'example.com.' through Designate's API and ensured that the zone propagated to both nameservers (i.e. that I could dig both nameservers successfully for that zone).
- I then killed only one Bind server (command line: `service bind9 stop`).
- I deleted the zone 'example.com.' through Designate's API. I saw that the zone was properly deleted from the only running Bind server.
- (At this point, getting the zone through the API correctly returns a 404. The zone was deleted on 1 of 2 nameservers, which is over the threshold_percentage of 49.)
- I restarted the Bind server I previously killed.

(At this point, I've sometimes seen the pool manager throw a DomainNotFound exception and sometimes not. Restarting the pool manager will produce the DomainNotFound exception. If I wait a bit, mini-dns seems to throw a DomainNotFound as well.)

In any case, this produces an inconsistency which is never resolved. One nameserver has deleted the zone, while the other nameserver never deletes the zone.

Additional info:
I notice this problem does not occur when setting `threshold_percentage = 51`. In this case, when the zone is deleted with one of two nameservers offline, the zone is marked as ERROR in the domains table. When I bring the second nameserver back online, the pool manager eventually does its sync procedure, and the zone is delete from the second nameserver. Then, the zone is marked DELETED in the domains table database.

Paul Glass (pnglass)
description: updated
Ron Rickard (rjrjr)
Changed in designate:
milestone: none → kilo-2
assignee: nobody → Ron Rickard (rjrjr)
Kiall Mac Innes (kiall)
Changed in designate:
importance: Undecided → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to designate (master)

Fix proposed to branch: master
Review: https://review.openstack.org/146246

Changed in designate:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to designate (master)

Reviewed: https://review.openstack.org/146246
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=6f4ff36bff156b004aa8fd0b4dadc3aeb47c5ad2
Submitter: Jenkins
Branch: master

commit 6f4ff36bff156b004aa8fd0b4dadc3aeb47c5ad2
Author: rjrjr <email address hidden>
Date: Fri Jan 9 18:14:28 2015 -0700

    Ensure Pool Manager Works for Multiple Backend Servers

    A few bugs were discovered when working with multiple backend servers.

    - Update status for a domain/server is only created after the domain has
      been successfully added to the server. This ensure MDNS is only called
      for servers with created domains.
    - When calculating consensus, the calculation is based on the number of
      servers, not the number of Pool Manager status objects.
    - Consensus calculations are done in Decimals, not floats.
    - The same code is reused to create domains on failure as is used to create
      domains the first time.
    - The same code is reused to delete domains on failure as is used to delete
      domains the first time.
    - Unhandled exceptions are trapped if periodic recovery or periodic sync
      encounter them to prevent the threads from dying. This is logged.

    Change-Id: I2ae4a7fdc556ce2d9efd5fd91f73dc005c5a8d00
    Closes-Bug: 1406414

Changed in designate:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in designate:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in designate:
milestone: kilo-2 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.