Designate

Unhandled exception in periodic_sync/recovery halts the process

Series liberty
Bug #1523691

Bug #1523691 reported by Tim Simmons on 2015-12-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Designate	Fix Released	Critical	Federico Ceratto	Designate mitaka-2 "m2"
	Liberty	Fix Committed	Undecided	Federico Ceratto

Bug Description

As it stands today:
https://github.com/openstack/designate/blob/5333caa09c41a071aa62a29c19a30584f54ad4a8/designate/pool_manager/service.py#L170-L238

Any unhandled exception that bubbles up (a MessagingTimeout for instance) will cause the periodic process (running in a single greenthread (?)) to halt, and not complete.

If there were a systemic issue with a certain ERROR'd zone that raised an unhandled exception, this would ensure that zones that were in a fixable ERROR state might never recover.

The condition I observed was a great number of ERROR'd zones on a system under load, that caused a messagingtimeout. If this were to run in many threads, the problem might be less impactful, but that might be a separate bug report.

Tim Simmons (timsim) on 2015-12-09

Changed in designate:
status:	New → Triaged
importance:	Undecided → Critical
milestone:	none → mitaka-2

Revision history for this message

Tim Simmons (timsim) wrote on 2015-12-09:

I'm wondering if the right way to do this might be for periodic sync to grab a connection to the Pool Manager's rpcapi, and call crud_zone's that way.

That should enable any other pool manager worker processes locally or elsewhere to take up the jobs that are being created due to the periodic task, not relying on just the one process, and any exceptions that come up won't be catastrophic to the periodic process.

Thoughts @Kiall / @mugsie ?

Federico Ceratto (federico-ceratto) on 2015-12-11

Changed in designate:
assignee:	nobody → Federico Ceratto (federico-ceratto)

Revision history for this message

Federico Ceratto (federico-ceratto) wrote on 2015-12-14:

If we were to flag a zone in ERROR status after an unhandled exception and then move on to the next one we might incur in failure mode where a transient external error (e.g. a short network outage) leads to periodic_sync aggressively flagging many zones.
Periodic_recovery could kick in and restore the zones only after a relatively long time (1h).

We can implement a simple retry logic in periodic_sync to go through the failing zones N times (default 3) after a sleep interval (default 30s)

Revision history for this message

Tim Simmons (timsim) wrote on 2015-12-14:

There are cases where the exception that bubbles won't be the fault of the zone (say, a MessagingException), and you don't want to mark the zone in ERROR.

If the load for the CRUD operations is spread over the various processes, the errors that bubble up will be caught and handled just like a normal operation, either during the actual create/delete process, or by the change not getting out when MiniDNS polls for it.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-23: Fix proposed to designate (master)

Fix proposed to branch: master
Review: https://review.openstack.org/260997

Changed in designate:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-04: Change abandoned on designate (master)

Change abandoned by Federico Ceratto (<email address hidden>) on branch: master
Review: https://review.openstack.org/260997

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-04: Fix proposed to designate (master)

Fix proposed to branch: master
Review: https://review.openstack.org/263295

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-07: Fix proposed to designate (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/264891

Revision history for this message

Federico Ceratto (federico-ceratto) wrote on 2016-01-07:

Backport to stable/liberty https://review.openstack.org/264891

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-18: Fix merged to designate (master)

Reviewed: https://review.openstack.org/263295
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=8b761de321401e89d78ba45e3dba801e2b3bfd96
Submitter: Jenkins
Branch: master

commit 8b761de321401e89d78ba45e3dba801e2b3bfd96
Author: Federico Ceratto <email address hidden>
Date: Wed Dec 23 14:20:33 2015 +0000

Add retry logic on periodic_sync

Change-Id: I15dadb0d90c7cbdd5fcead9e64398c75be3304b5
Closes-Bug: #1523691

Changed in designate:
status:	In Progress → Fix Released

Kiall Mac Innes (kiall) on 2016-01-18

Changed in designate:
milestone:	mitaka-2 → mitaka-3
milestone:	mitaka-3 → mitaka-2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-18: Fix merged to designate (stable/liberty)

#10

Reviewed: https://review.openstack.org/264891
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=9b5f4c1f3d9d827252ae6b8f253fc572df827cea
Submitter: Jenkins
Branch: stable/liberty

commit 9b5f4c1f3d9d827252ae6b8f253fc572df827cea
Author: Federico Ceratto <email address hidden>
Date: Wed Dec 23 14:20:33 2015 +0000

Add retry logic on periodic_sync to stable/liberty

Change-Id: I15dadb0d90c7cbdd5fcead9e64398c75be3304b5
Closes-Bug: #1523691

Revision history for this message

Thierry Carrez (ttx) wrote on 2016-03-03: Fix included in openstack/designate 2.0.0.0b3

#11

This issue was fixed in the openstack/designate 2.0.0.0b3 development milestone.

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2016-03-22: Fix included in openstack/designate 1.0.2

#12

This issue was fixed in the openstack/designate 1.0.2 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-10:

#13

This issue was fixed in the openstack/designate 1.0.2 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.