Unhandled exception in periodic_sync/recovery halts the process

Bug #1523691 reported by Tim Simmons on 2015-12-07
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Designate
Critical
Federico Ceratto
Liberty
Undecided
Federico Ceratto

Bug Description

As it stands today:
https://github.com/openstack/designate/blob/5333caa09c41a071aa62a29c19a30584f54ad4a8/designate/pool_manager/service.py#L170-L238

Any unhandled exception that bubbles up (a MessagingTimeout for instance) will cause the periodic process (running in a single greenthread (?)) to halt, and not complete.

If there were a systemic issue with a certain ERROR'd zone that raised an unhandled exception, this would ensure that zones that were in a fixable ERROR state might never recover.

The condition I observed was a great number of ERROR'd zones on a system under load, that caused a messagingtimeout. If this were to run in many threads, the problem might be less impactful, but that might be a separate bug report.

Tim Simmons (timsim) on 2015-12-09
Changed in designate:
status: New → Triaged
importance: Undecided → Critical
milestone: none → mitaka-2
Tim Simmons (timsim) wrote :

I'm wondering if the right way to do this might be for periodic sync to grab a connection to the Pool Manager's rpcapi, and call crud_zone's that way.

That should enable any other pool manager worker processes locally or elsewhere to take up the jobs that are being created due to the periodic task, not relying on just the one process, and any exceptions that come up won't be catastrophic to the periodic process.

Thoughts @Kiall / @mugsie ?

Changed in designate:
assignee: nobody → Federico Ceratto (federico-ceratto)

If we were to flag a zone in ERROR status after an unhandled exception and then move on to the next one we might incur in failure mode where a transient external error (e.g. a short network outage) leads to periodic_sync aggressively flagging many zones.
Periodic_recovery could kick in and restore the zones only after a relatively long time (1h).

We can implement a simple retry logic in periodic_sync to go through the failing zones N times (default 3) after a sleep interval (default 30s)

Tim Simmons (timsim) wrote :

There are cases where the exception that bubbles won't be the fault of the zone (say, a MessagingException), and you don't want to mark the zone in ERROR.

If the load for the CRUD operations is spread over the various processes, the errors that bubble up will be caught and handled just like a normal operation, either during the actual create/delete process, or by the change not getting out when MiniDNS polls for it.

Fix proposed to branch: master
Review: https://review.openstack.org/260997

Changed in designate:
status: Triaged → In Progress

Change abandoned by Federico Ceratto (<email address hidden>) on branch: master
Review: https://review.openstack.org/260997

Backport to stable/liberty https://review.openstack.org/264891

Reviewed: https://review.openstack.org/263295
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=8b761de321401e89d78ba45e3dba801e2b3bfd96
Submitter: Jenkins
Branch: master

commit 8b761de321401e89d78ba45e3dba801e2b3bfd96
Author: Federico Ceratto <email address hidden>
Date: Wed Dec 23 14:20:33 2015 +0000

    Add retry logic on periodic_sync

    Change-Id: I15dadb0d90c7cbdd5fcead9e64398c75be3304b5
    Closes-Bug: #1523691

Changed in designate:
status: In Progress → Fix Released
Kiall Mac Innes (kiall) on 2016-01-18
Changed in designate:
milestone: mitaka-2 → mitaka-3
milestone: mitaka-3 → mitaka-2

Reviewed: https://review.openstack.org/264891
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=9b5f4c1f3d9d827252ae6b8f253fc572df827cea
Submitter: Jenkins
Branch: stable/liberty

commit 9b5f4c1f3d9d827252ae6b8f253fc572df827cea
Author: Federico Ceratto <email address hidden>
Date: Wed Dec 23 14:20:33 2015 +0000

    Add retry logic on periodic_sync to stable/liberty

    Change-Id: I15dadb0d90c7cbdd5fcead9e64398c75be3304b5
    Closes-Bug: #1523691

This issue was fixed in the openstack/designate 2.0.0.0b3 development milestone.

This issue was fixed in the openstack/designate 1.0.2 release.

This issue was fixed in the openstack/designate 1.0.2 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers