IMHO, based on the logs from rackd.log it seems that after the rack is unable to connect, there are a lot of tracebacks due to unhandled errors. These unhandled errors could be blocking the code that stops dhcpd.
So, I see various improvements or different bugs. rackd shouldn't traceback on unhandled errors, in return, it should recognize it cannot do what it needs to do and stop all services, and that includes:
1. spitting out a message that because of the connection it cannot update ntp:
2018-02-06 21:15:33 provisioningserver.rackdservices.ntp: [critical] Failed to update NTP configuration.
[...]
twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1933:cmd=GetTimeConfiguration:ask=58f3]')
2. The neighbours discovery should do the same as above (and in fact, it could be this isse the one that's preventing the rack on stopping dhcpd)
Failure: twisted.protocols.amp.UnhandledCommand: (b'UNHANDLED', 'Unknown Error [everitt:pid=1944:cmd=ReportNeighbours:ask=5976]')
2018-02-06 21:17:01 provisioningserver.rpc.clusterservice: [info] Region not available: User timeout caused connection failure. (While requesting RPC info at b'http://[::ffff:10.245.32.102]/MAAS/rpc/').
2018-02-06 21:17:09 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-02-06 21:17:09 twisted.internet.defer: [critical]
IMHO, based on the logs from rackd.log it seems that after the rack is unable to connect, there are a lot of tracebacks due to unhandled errors. These unhandled errors could be blocking the code that stops dhcpd.
So, I see various improvements or different bugs. rackd shouldn't traceback on unhandled errors, in return, it should recognize it cannot do what it needs to do and stop all services, and that includes:
1. spitting out a message that because of the connection it cannot update ntp: ver.rackdservic es.ntp: [critical] Failed to update NTP configuration. protocols. amp.UnhandledCo mmand: (b'UNHANDLED', 'Unknown Error [everitt: pid=1933: cmd=GetTimeConf iguration: ask=58f3] ')
2018-02-06 21:15:33 provisioningser
[...]
twisted.
2. The neighbours discovery should do the same as above (and in fact, it could be this isse the one that's preventing the rack on stopping dhcpd)
Failure: twisted. protocols. amp.UnhandledCo mmand: (b'UNHANDLED', 'Unknown Error [everitt: pid=1944: cmd=ReportNeigh bours:ask= 5976]') ver.rpc. clusterservice: [info] Region not available: User timeout caused connection failure. (While requesting RPC info at b'http://[::ffff: 10.245. 32.102] /MAAS/rpc/ '). internet. defer: [critical] Unhandled error in Deferred: internet. defer: [critical]
2018-02-06 21:17:01 provisioningser
2018-02-06 21:17:09 twisted.
2018-02-06 21:17:09 twisted.
Traceback (most recent call last): protocols. amp.UnhandledCo mmand: (b'UNHANDLED', 'Unknown Error [everitt: pid=1946: cmd=ReportNeigh bours:ask= 5914]') internet. defer: [critical] Unhandled error in Deferred: internet. defer: [critical]
Failure: twisted.
2018-02-06 21:17:10 twisted.
2018-02-06 21:17:10 twisted.
Traceback (most recent call last): protocols. amp.UnhandledCo mmand: (b'UNHANDLED', 'Unknown Error [everitt: pid=1944: cmd=ReportNeigh bours:ask= 5977]') internet. defer: [critical] Unhandled error in Deferred: internet. defer: [critical]
Failure: twisted.
2018-02-06 21:17:10 twisted.
2018-02-06 21:17:10 twisted.
Traceback (most recent call last): protocols. amp.UnhandledCo mmand: (b'UNHANDLED', 'Unknown Error [everitt: pid=1944: cmd=ReportNeigh bours:ask= 5978]')
Failure: twisted.
3. Same for image download service:
2018-02-06 21:17:33 provisioningser ver.rackdservic es.image_ download_ service: [critical] Downloading images failed. protocols. amp.UnhandledCo mmand: (b'UNHANDLED', 'Unknown Error [everitt: pid=1946: cmd=GetProxies: ask=5916] ')
[...]
twisted.