Changes do not propagate to downed/recovered nameservers, depending on threshold percentage
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Designate |
Opinion
|
Undecided
|
Unassigned |
Bug Description
With the new worker model code (and probably before the worker model), there are situations where if Designate is running with a threshold percentage less than 100, nameservers that have missed changes will never be brought to a fully-consistent state.
To reproduce:
- Configure Designate with one pool and two nameservers in the pool
- Configure Designate with a `threshold_
- Kill one nameserver
- Create a zone
- Wait for the zone to go to ACTIVE
- Query the live nameserver to validate that zone has propagated
- Restart the nameserver
- The zone will *never* propagate to the nameserver that was killed (provided no changes are made to the zone, and provided periodic sync is disabled)
Because the zone is ACTIVE, periodic recovery ignores takes no action to propagate it (periodic recovery only works on zones in ERROR or PENDING).
Changed in designate: | |
status: | New → Opinion |
Once periodic sync is implemented in `designate- producer` , this bug will be different. But it's worth discussing.
Things that don't propagate to nameservers under the ACTIVE threshold will get cleaned up the next time periodic sync runs, unless a nameserver is down for longer than `periodic_ sync_seconds` .
There's an argument here to be had around having another state that accurately reflects zones sync_seconds" if an outage for a nameserver was longer than that, or otherwise reseed the nameserver.
that didn't completely propagate.
- An operator would be able to quickly triage what zones could be having problems, which is helpful for troubleshooting.
- Periodic recovery could recover these zones quicker.
- An operator doesn't have to manually bop "periodic_
On the flip side:
- You'd probably want to display this state to API consumers, to avoid gross translation logic. This is potentially confusing to end users.
- Periodic sync should eventually clean things up, if you've had a long outage, manual intervention probably isn't as big a deal.