Zones stuck in PENDING or ERROR after record changes on 2023.2

Bug #2060024 reported by Andrew Bonney
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Designate
Invalid
Undecided
Unassigned

Bug Description

Having upgraded from 2023.1 to 2023.2 (2c817b3d7f01de44023f195c6e8de8853683a54a) we appear to see issues with all record changes to zones. This appears to relate in some way to how Designate is setting or checking zone serial numbers.

Upon adding a new record to a zone (for example an A record), the change is quickly reflected in both of our user-facing DNS servers but Designate keeps this record and the SOA in a PENDING state. Some time later these records move into an ERROR state. An explicit 'designate-manage pool update' does not improve the situation.

On the user-facing DNS servers we have regular messages such as:

Apr 02 10:06:26 ns1-mydomain.com named[1696326]: received control channel command 'showzone new.vm.mydomain.com '
Apr 02 10:06:26 ns1-mydomain.com named[1696326]: loading NZD config from '_default.nzd' for zone 'new.vm.mydomain.com'
Apr 02 10:06:26 ns1-mydomain.com named[1696326]: received control channel command 'modzone new.vm.mydomain.com { type slave; masters { 10.48.243.83 port 5354; 10.48.241.212 port 5354; 10.48.241.111 port 5354;}; file "slave.new.vm.mydomain.com.07644d3b-b142-4c6e-982e-da6d8dd0f982"; };'
Apr 02 10:06:26 ns1-mydomain.com named[1696326]: updated zone new.vm.mydomain.com in view _default via modzone
Apr 02 10:06:26 ns1-mydomain.com named[1696326]: client @0x7efc1c0013b8 10.48.243.83#41650: received notify for zone 'new.vm.mydomain.com'
Apr 02 10:06:26 ns1-mydomain.com named[1696326]: zone new.vm.mydomain.com/IN: notify from 10.48.243.83#41650: no serial

From Designate we see the following, with serial number checks returning an unexpected result:

Apr 02 10:24:46 infra3-designate-container-1dd51fe2 designate-worker[291935]: 2024-04-02 10:24:46.937 291935 DEBUG designate.storage.sqlalchemy [None req-0c581b93-b07a-4a25-b473-c884ff293c2b - - - - - -] Fetched zone <Zone id:'07644d3b-b142-4c6e-982e-da6d8dd0f982' type:'PRIMARY' name:'new.vm.mydomain.com.' pool_id:'794ccc2c-d751-44fe-b57f-8894c9f5c842' serial:'1712050926' action:'UPDATE' status:'ERROR'> _find_zones /openstack/venvs/designate-28.1.0/lib/python3.10/site-packages/designate/storage/sqlalchemy/__init__.py:406
Apr 02 10:24:46 infra3-designate-container-1dd51fe2 designate-worker[291935]: 2024-04-02 10:24:46.938 291935 DEBUG designate.worker.tasks.zone [None req-0c581b93-b07a-4a25-b473-c884ff293c2b - - - - - -] Polling serial=1712050926 for zone_name=new.vm.mydomain.com. zone_id=07644d3b-b142-4c6e-982e-da6d8dd0f982 action=UPDATE on ns=<PoolNameserver id:'af9ae98e-a7ea-4eb2-9487-e76c31cdad9f' host:'10.52.245.37' port:'53' pool_id:'794ccc2c-d751-44fe-b57f-8894c9f5c842'> __call__ /openstack/venvs/designate-28.1.0/lib/python3.10/site-packages/designate/worker/tasks/zone.py:391
Apr 02 10:24:46 infra3-designate-container-1dd51fe2 designate-worker[291935]: 2024-04-02 10:24:46.939 291935 DEBUG designate.worker.tasks.zone [None req-0c581b93-b07a-4a25-b473-c884ff293c2b - - - - - -] Polling serial=1712050926 for zone_name=new.vm.mydomain.com. zone_id=07644d3b-b142-4c6e-982e-da6d8dd0f982 action=UPDATE on ns=<PoolNameserver id:'d4cd165d-5962-4e96-a8d7-464667a96f51' host:'10.52.245.36' port:'53' pool_id:'794ccc2c-d751-44fe-b57f-8894c9f5c842'> __call__ /openstack/venvs/designate-28.1.0/lib/python3.10/site-packages/designate/worker/tasks/zone.py:391
Apr 02 10:24:46 infra3-designate-container-1dd51fe2 designate-worker[291935]: 2024-04-02 10:24:46.940 291935 DEBUG designate.worker.tasks.zone [None req-0c581b93-b07a-4a25-b473-c884ff293c2b - - - - - -] Found serial=1712050149 for zone_name=new.vm.mydomain.com. zone_id=07644d3b-b142-4c6e-982e-da6d8dd0f982 action=UPDATE on ns=<PoolNameserver id:'af9ae98e-a7ea-4eb2-9487-e76c31cdad9f' host:'10.52.245.37' port:'53' pool_id:'794ccc2c-d751-44fe-b57f-8894c9f5c842'> __call__ /openstack/venvs/designate-28.1.0/lib/python3.10/site-packages/designate/worker/tasks/zone.py:406
Apr 02 10:24:46 infra3-designate-container-1dd51fe2 designate-worker[291935]: 2024-04-02 10:24:46.942 291935 DEBUG designate.worker.tasks.zone [None req-0c581b93-b07a-4a25-b473-c884ff293c2b - - - - - -] Found serial=1712050149 for zone_name=new.vm.mydomain.com. zone_id=07644d3b-b142-4c6e-982e-da6d8dd0f982 action=UPDATE on ns=<PoolNameserver id:'d4cd165d-5962-4e96-a8d7-464667a96f51' host:'10.52.245.36' port:'53' pool_id:'794ccc2c-d751-44fe-b57f-8894c9f5c842'> __call__ /openstack/venvs/designate-28.1.0/lib/python3.10/site-packages/designate/worker/tasks/zone.py:406
Apr 02 10:24:46 infra3-designate-container-1dd51fe2 designate-worker[291935]: 2024-04-02 10:24:46.942 291935 DEBUG designate.worker.processing [None req-0c581b93-b07a-4a25-b473-c884ff293c2b - - - - - -] Finished Task(s): PollForZone, PollForZone in 0.004154s run /openstack/venvs/designate-28.1.0/lib/python3.10/site-packages/designate/worker/processing.py:77
Apr 02 10:24:46 infra3-designate-container-1dd51fe2 designate-worker[291935]: 2024-04-02 10:24:46.942 291935 DEBUG designate.worker.tasks.zone [None req-0c581b93-b07a-4a25-b473-c884ff293c2b - - - - - -] Results for polling zone_name=new.vm.mydomain.com. zone_id=07644d3b-b142-4c6e-982e-da6d8dd0f982 action=UPDATE serial=1712050926 query=DNSQueryResult(positives=0, no_zones=0, consensus_serial=0, results=[1712050149, 1712050149]) parse_query_results /openstack/venvs/designate-28.1.0/lib/python3.10/site-packages/designate/worker/tasks/zone.py:355
Apr 02 10:24:46 infra3-designate-container-1dd51fe2 designate-worker[291935]: 2024-04-02 10:24:46.943 291935 DEBUG designate.worker.tasks.zone [None req-0c581b93-b07a-4a25-b473-c884ff293c2b - - - - - -] Unsuccessful poll for zone_name=new.vm.mydomain.com. zone_id=07644d3b-b142-4c6e-982e-da6d8dd0f982 action=UPDATE on attempt=2 _do_poll /openstack/venvs/designate-28.1.0/lib/python3.10/site-packages/designate/worker/tasks/zone.py:501

After a long period of time the zone does eventually move to ACTIVE, but this could take perhaps half an hour. I'm not certain what triggers this, but I'm assuming it's a periodic sync task:

Apr 02 10:29:02 infra2-designate-container-e370b9d5 designate-worker[289929]: 2024-04-02 10:29:02.285 289929 DEBUG designate.storage.sqlalchemy [None req-c47aae2a-e029-4ad3-a184-c455722740f6 - - - - - -] Fetched zone <Zone id:'07644d3b-b142-4c6e-982e-da6d8dd0f982' type:'PRIMARY' name:'new.vm.mydomain.com.' pool_id:'794ccc2c-d751-44fe-b57f-8894c9f5c842' serial:'1712050926' action:'UPDATE' status:'ERROR'> _find_zones /openstack/venvs/designate-28.1.0/lib/python3.10/site-packages/designate/storage/sqlalchemy/__init__.py:406
Apr 02 10:29:02 infra2-designate-container-e370b9d5 designate-worker[289929]: 2024-04-02 10:29:02.287 289929 INFO designate.worker.tasks.zone [None req-c47aae2a-e029-4ad3-a184-c455722740f6 - - - - - -] Could not find serial=1712050926 for zone_name=new.vm.mydomain.com. zone_id=07644d3b-b142-4c6e-982e-da6d8dd0f982 action=UPDATE on enough nameservers
Apr 02 10:29:02 infra2-designate-container-e370b9d5 designate-worker[289929]: 2024-04-02 10:29:02.289 289929 DEBUG designate.worker.tasks.zone [None req-c47aae2a-e029-4ad3-a184-c455722740f6 - - - - - -] Updating status for zone_name=new.vm.mydomain.com. zone_id=07644d3b-b142-4c6e-982e-da6d8dd0f982 to action=UPDATE serial=1712050926 __call__ /openstack/venvs/designate-28.1.0/lib/python3.10/site-packages/designate/worker/tasks/zone.py:753
Apr 02 10:29:02 infra2-designate-container-e370b9d5 designate-worker[289929]: 2024-04-02 10:29:02.296 289929 DEBUG designate.worker.processing [None req-c47aae2a-e029-4ad3-a184-c455722740f6 - - - - - -] Finished Task(s): ZoneAction-Update in 156.227155s run /openstack/venvs/designate-28.1.0/lib/python3.10/site-packages/designate/worker/processing.py:77

I'd be happy to look through the logs for any other useful debug information. We're running bind9 1:9.18.18-0ubuntu0.22.04.2 and these are all Ubuntu Jammy systems.

Revision history for this message
Erik Olof Gunnar Andersson (eandersson) wrote :

This likely means that your designate-producer isn't configured properly. Is it running? Does it have all tasks enabled? Can you share some of the designate-producer logs?

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

We have three producer services running, all of which appear to be configured with defaults, so as I understand it that means all tasks are enabled.

When creating a new A record we get a message like the following from one of the producer instances:

Apr 02 11:55:52 infra1-designate-container-289d0f6e designate-producer[277764]: 2024-04-02 11:55:52.499 277764 DEBUG designate.producer.tasks [None req-7d7f14c1-4fcd-4d5a-a920-9cc760a7e154 - - - - - -] Incremented serial for 07644d3b-b142-4c6e-982e-da6d8dd0f982 to 1712058952 __call__ /openstack/venvs/designate-28.1.0/lib/python3.10/site-packages/designate/producer/tasks.py:303

Followed by periodic messages from all services similar to:

Apr 02 11:56:25 infra1-designate-container-289d0f6e designate-producer[277764]: 2024-04-02 11:56:25.993 277764 INFO designate.producer.tasks [None req-a432f661-4514-4afd-b149-b492d95fa260 - - - - - -] Recovering zones for shards 0 to 1365

Having just tried to create a couple of A records on test zones I was previously having issues with these did correctly transition to active, but a further try then revealed the issue again. This is suggesting whatever the cause is is intermittent.

Revision history for this message
Erik Olof Gunnar Andersson (eandersson) wrote :

It's difficult to see what is going on based on these logs. Would you be able to share more worker logs? Maybe using https://paste.openstack.org/?

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

Sure. The following paste has all Designate service logs in debug mode from our 3x HA nodes. In this case it captures a DELETE operation against an A record. By the end of these logs the zone is still stuck in a pending state with the record having been deleted, but the SOA change pending. I've stripped out oslo.policy deprecation warnings from one node as these made it too large for the paste.

https://paste.openstack.org/show/823718/ <- Delete operation on line 136

If you'd prefer me to focus a paste on any one service, or disable services on certain nodes (assuming that won't have other adverse effects) I'd be happy to do so.

Revision history for this message
Erik Olof Gunnar Andersson (eandersson) wrote :

Is this happening with all changes, or a subset? I tried to recreate something similar and wasn't able to. Are you seeing any errors in logs or similar?

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

I've done some more diagnosis and I think this is our issue, but I'll describe what I've found anyway as we may not be the only ones to encounter it. I'm not certain if a change in Designate's behaviour made this more visible or if it's a change in behaviour elsewhere which I've only picked up on at upgrade time, but either way the latter clearly causes a problem.

Our Designate pool config uses different addresses for the 'nameservers' and the 'targets'. It appears we have ended up with some unexpected caching between the two of these, so whilst the 'targets' always receive updates correctly, if a recent query has been made to the 'nameservers' then we have to wait for the SOA's TTL to expire before the non-cached record becomes available and Designate knows that the change was successful.

Thanks for your time looking into it.

Changed in designate:
status: New → Invalid
Revision history for this message
Erik Olof Gunnar Andersson (eandersson) wrote :

Thanks for the update.

I don’t know the cause, but we made a significant change with how notifications are sent in this version of Designate. In the previous version it would notify on each change, and now we batch all changes and only notify every 5 second per zone. Depending on use patterns the upstream DNS server should see exponentially less notifies from designate and maybe that caused the behavior you were seeing, as previously it would be spamming your bind servers.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.