Unhandled exception in periodic_sync/recovery halts the process
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Designate |
Fix Released
|
Critical
|
Federico Ceratto | ||
Liberty |
Fix Committed
|
Undecided
|
Federico Ceratto |
Bug Description
As it stands today:
https:/
Any unhandled exception that bubbles up (a MessagingTimeout for instance) will cause the periodic process (running in a single greenthread (?)) to halt, and not complete.
If there were a systemic issue with a certain ERROR'd zone that raised an unhandled exception, this would ensure that zones that were in a fixable ERROR state might never recover.
The condition I observed was a great number of ERROR'd zones on a system under load, that caused a messagingtimeout. If this were to run in many threads, the problem might be less impactful, but that might be a separate bug report.
Changed in designate: | |
status: | New → Triaged |
importance: | Undecided → Critical |
milestone: | none → mitaka-2 |
Changed in designate: | |
assignee: | nobody → Federico Ceratto (federico-ceratto) |
Changed in designate: | |
milestone: | mitaka-2 → mitaka-3 |
milestone: | mitaka-3 → mitaka-2 |
I'm wondering if the right way to do this might be for periodic sync to grab a connection to the Pool Manager's rpcapi, and call crud_zone's that way.
That should enable any other pool manager worker processes locally or elsewhere to take up the jobs that are being created due to the periodic task, not relying on just the one process, and any exceptions that come up won't be catastrophic to the periodic process.
Thoughts @Kiall / @mugsie ?