MAAS

Bug #1747764
Comment #10

Comment 10 for bug 1747764

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2018-02-06: Re: [2.3, ha] rack controller HA fails during a network partition

#10

IMHO, based on the logs from rackd.log it seems that after the rack is unable to connect, there are a lot of tracebacks due to unhandled errors. These unhandled errors could be blocking the code that stops dhcpd.

So, I see various improvements or different bugs. rackd shouldn't traceback on unhandled errors, in return, it should recognize it cannot do what it needs to do and stop all services, and that includes:

1. spitting out a message that because of the connection it cannot update ntp:
2018-02-06 21:15:33 provisioningserver.rackdservices.ntp: [critical] Failed to update NTP configuration.
[...]
twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1933:cmd=GetTimeConfiguration:ask=58f3]')

2. The neighbours discovery should do the same as above (and in fact, it could be this isse the one that's preventing the rack on stopping dhcpd)

Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5976]')
2018-02-06 21:17:01 provisioningserver.rpc.clusterservice: [info] Region not available: User timeout caused connection failure. (While requesting RPC info at b'http://[::ffff:10.245.32.102]/MAAS/rpc/').
2018-02-06 21:17:09 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:17:09 twisted.internet.defer: [critical]

Traceback (most recent call last):
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1946:cmd=ReportNeighbours:ask=5914]')
2018-02-06 21:17:10 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:17:10 twisted.internet.defer: [critical]

Traceback (most recent call last):
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5977]')
2018-02-06 21:17:10 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:17:10 twisted.internet.defer: [critical]

Traceback (most recent call last):
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5978]')

3. Same for image download service:

2018-02-06 21:17:33 provisioningserver.rackdservices.image_download_service: [critical] Downloading images failed.
[...]
twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1946:cmd=GetProxies:ask=5916]')

2. The neighbours discovery should do the same as above (and in fact, it could be this isse the one that's preventing the rack on stopping dhcpd)

Traceback (most recent call last):
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5978]')

3. Same for image download service: