Changes do not propagate to downed/recovered nameservers, depending on threshold percentage

Bug #1617454 reported by Paul Glass
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Designate
Opinion
Undecided
Unassigned

Bug Description

With the new worker model code (and probably before the worker model), there are situations where if Designate is running with a threshold percentage less than 100, nameservers that have missed changes will never be brought to a fully-consistent state.

To reproduce:

- Configure Designate with one pool and two nameservers in the pool
- Configure Designate with a `threshold_percentage = 49` in the `[service:worker]` section. This means a zone will go to ACTIVE if it propagates to one nameserver.
- Kill one nameserver
- Create a zone
- Wait for the zone to go to ACTIVE
- Query the live nameserver to validate that zone has propagated
- Restart the nameserver
- The zone will *never* propagate to the nameserver that was killed (provided no changes are made to the zone, and provided periodic sync is disabled)

Because the zone is ACTIVE, periodic recovery ignores takes no action to propagate it (periodic recovery only works on zones in ERROR or PENDING).

Tim Simmons (timsim)
Changed in designate:
status: New → Opinion
Revision history for this message
Tim Simmons (timsim) wrote :

Once periodic sync is implemented in `designate-producer`, this bug will be different. But it's worth discussing.

Things that don't propagate to nameservers under the ACTIVE threshold will get cleaned up the next time periodic sync runs, unless a nameserver is down for longer than `periodic_sync_seconds`.

There's an argument here to be had around having another state that accurately reflects zones
that didn't completely propagate.
- An operator would be able to quickly triage what zones could be having problems, which is helpful for troubleshooting.
- Periodic recovery could recover these zones quicker.
- An operator doesn't have to manually bop "periodic_sync_seconds" if an outage for a nameserver was longer than that, or otherwise reseed the nameserver.

On the flip side:
- You'd probably want to display this state to API consumers, to avoid gross translation logic. This is potentially confusing to end users.
- Periodic sync should eventually clean things up, if you've had a long outage, manual intervention probably isn't as big a deal.

Revision history for this message
Paul Glass (pnglass) wrote :

"You'd probably want to display this state to API consumers, to avoid gross translation logic. This is potentially confusing to end users."

Rather than creating a new state, why not create a new status field?

If the "true status" was tracked in a new database column, it would look like:

- If the "true status" is ERROR, then the "display status" is either ERROR or ACTIVE, depending on the threshold percentage
- Otherwise the "display status" matches the "true status" (and this uses existing logic)

Deciding whether to show the new "true status" to users is a different matter, but the visibility of that new status field could be configurable.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.