Comment 10 for bug 1747764

Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [2.3, ha] rack controller HA fails during a network partition

IMHO, based on the logs from rackd.log it seems that after the rack is unable to connect, there are a lot of tracebacks due to unhandled errors. These unhandled errors could be blocking the code that stops dhcpd.

So, I see various improvements or different bugs. rackd shouldn't traceback on unhandled errors, in return, it should recognize it cannot do what it needs to do and stop all services, and that includes:

1. spitting out a message that because of the connection it cannot update ntp:
2018-02-06 21:15:33 provisioningserver.rackdservices.ntp: [critical] Failed to update NTP configuration.
[...]
twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1933:cmd=GetTimeConfiguration:ask=58f3]')

2. The neighbours discovery should do the same as above (and in fact, it could be this isse the one that's preventing the rack on stopping dhcpd)

Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5976]')
2018-02-06 21:17:01 provisioningserver.rpc.clusterservice: [info] Region not available: User timeout caused connection failure. (While requesting RPC info at b'http://[::ffff:10.245.32.102]/MAAS/rpc/').
2018-02-06 21:17:09 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:17:09 twisted.internet.defer: [critical]

Traceback (most recent call last):
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1946:cmd=ReportNeighbours:ask=5914]')
2018-02-06 21:17:10 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:17:10 twisted.internet.defer: [critical]

Traceback (most recent call last):
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5977]')
2018-02-06 21:17:10 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:17:10 twisted.internet.defer: [critical]

Traceback (most recent call last):
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5978]')

3. Same for image download service:

2018-02-06 21:17:33 provisioningserver.rackdservices.image_download_service: [critical] Downloading images failed.
[...]
twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1946:cmd=GetProxies:ask=5916]')