further investigation and working with Adam indicate that my assesment in comment 14 is probably correct.
Looking at cloud-init logs, we see things like:
Jul 22 14:46:53 ubuntu [CLOUDINIT] handlers.py[DEBUG]: start: init-network/config-ubuntu-init-switch: running config-ubuntu-init-switch with frequency once-per-instance
Jul 22 14:46:53 ubuntu [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://192.168.10.114:5240/MAAS/metadata/status/4y3h84' with {'allow_redirects': True, 'url': 'http://192.168.10.114:5240/MAAS/metadata/status/4y3h84', 'headers': {'Authorization': 'OAuth oauth_nonce="107803251774003364951469198813", oauth_timestamp="1469198813", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="KCpd3JJTTbtFMjKz7D", oauth_token="9kJS9LRLK7ZaQELPCv", oauth_signature="%26JXB592yRU6RaWGa7GTLegfqWN8H6B9ZE"'}, 'method': 'POST'} configuration
Jul 22 14:47:00 ubuntu [CLOUDINIT] url_helper.py[DEBUG]: Read from http://192.168.10.114:5240/MAAS/metadata/status/4y3h84 (200, 2b) after 1 attempts
Jul 22 14:47:00 ubuntu [CLOUDINIT] util.py[DEBUG]: Writing to /var/lib/cloud/instances/4y3h84/sem/config_ubuntu_init_switch - wb: [420] 25 bytes
Jul 22 14:47:00 ubuntu [CLOUDINIT] helpers.py[DEBUG]: Running co
there, the event post to maas took 7 seconds.
So what happened was curtin finished, cloud-init was in the middle of posting an event, and curtin's reboot fired. cloud-init got the kill signal from systemd and reported its failure.
We can definitely look to improve how curtin interacts with cloud-init for rebooting so that this doesn't happen, but its not good if an event takes 7 seconds to post. Say an post event took 2 seconds, and a deploy-install-boot reported 120 events.. that'd be 4 minutes of wall clock spent waiting for api responses.
I dont' have any good ideas on how we could handle that. We background / batch off the status posts and go, but probably should still wait for them to come back and check their result at some point rather than just ignoring the possible post failure.
further investigation and working with Adam indicate that my assesment in comment 14 is probably correct.
Looking at cloud-init logs, we see things like:
Jul 22 14:46:53 ubuntu [CLOUDINIT] handlers.py[DEBUG]: start: init-network/ config- ubuntu- init-switch: running config- ubuntu- init-switch with frequency once-per-instance py[DEBUG] : [0/1] open 'http:// 192.168. 10.114: 5240/MAAS/ metadata/ status/ 4y3h84' with {'allow_redirects': True, 'url': 'http:// 192.168. 10.114: 5240/MAAS/ metadata/ status/ 4y3h84', 'headers': {'Authorization': 'OAuth oauth_nonce= "10780325177400 336495146919881 3", oauth_timestamp ="1469198813" , oauth_version= "1.0", oauth_signature _method= "PLAINTEXT" , oauth_consumer_ key="KCpd3JJTTb tFMjKz7D" , oauth_token= "9kJS9LRLK7ZaQE LPCv", oauth_signature ="%26JXB592yRU6 RaWGa7GTLegfqWN 8H6B9ZE" '}, 'method': 'POST'} configuration py[DEBUG] : Read from http:// 192.168. 10.114: 5240/MAAS/ metadata/ status/ 4y3h84 (200, 2b) after 1 attempts cloud/instances /4y3h84/ sem/config_ ubuntu_ init_switch - wb: [420] 25 bytes
Jul 22 14:46:53 ubuntu [CLOUDINIT] url_helper.
Jul 22 14:47:00 ubuntu [CLOUDINIT] url_helper.
Jul 22 14:47:00 ubuntu [CLOUDINIT] util.py[DEBUG]: Writing to /var/lib/
Jul 22 14:47:00 ubuntu [CLOUDINIT] helpers.py[DEBUG]: Running co
there, the event post to maas took 7 seconds.
So what happened was curtin finished, cloud-init was in the middle of posting an event, and curtin's reboot fired. cloud-init got the kill signal from systemd and reported its failure.
We can definitely look to improve how curtin interacts with cloud-init for rebooting so that this doesn't happen, but its not good if an event takes 7 seconds to post. Say an post event took 2 seconds, and a deploy-install-boot reported 120 events.. that'd be 4 minutes of wall clock spent waiting for api responses.
I dont' have any good ideas on how we could handle that. We background / batch off the status posts and go, but probably should still wait for them to come back and check their result at some point rather than just ignoring the possible post failure.