Comment 16 for bug 1604962

Revision history for this message
Scott Moser (smoser) wrote :

further investigation and working with Adam indicate that my assesment in comment 14 is probably correct.
Looking at cloud-init logs, we see things like:

Jul 22 14:46:53 ubuntu [CLOUDINIT] handlers.py[DEBUG]: start: init-network/config-ubuntu-init-switch: running config-ubuntu-init-switch with frequency once-per-instance
Jul 22 14:46:53 ubuntu [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://192.168.10.114:5240/MAAS/metadata/status/4y3h84' with {'allow_redirects': True, 'url': 'http://192.168.10.114:5240/MAAS/metadata/status/4y3h84', 'headers': {'Authorization': 'OAuth oauth_nonce="107803251774003364951469198813", oauth_timestamp="1469198813", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="KCpd3JJTTbtFMjKz7D", oauth_token="9kJS9LRLK7ZaQELPCv", oauth_signature="%26JXB592yRU6RaWGa7GTLegfqWN8H6B9ZE"'}, 'method': 'POST'} configuration
Jul 22 14:47:00 ubuntu [CLOUDINIT] url_helper.py[DEBUG]: Read from http://192.168.10.114:5240/MAAS/metadata/status/4y3h84 (200, 2b) after 1 attempts
Jul 22 14:47:00 ubuntu [CLOUDINIT] util.py[DEBUG]: Writing to /var/lib/cloud/instances/4y3h84/sem/config_ubuntu_init_switch - wb: [420] 25 bytes
Jul 22 14:47:00 ubuntu [CLOUDINIT] helpers.py[DEBUG]: Running co

there, the event post to maas took 7 seconds.

So what happened was curtin finished, cloud-init was in the middle of posting an event, and curtin's reboot fired. cloud-init got the kill signal from systemd and reported its failure.

We can definitely look to improve how curtin interacts with cloud-init for rebooting so that this doesn't happen, but its not good if an event takes 7 seconds to post. Say an post event took 2 seconds, and a deploy-install-boot reported 120 events.. that'd be 4 minutes of wall clock spent waiting for api responses.

I dont' have any good ideas on how we could handle that. We background / batch off the status posts and go, but probably should still wait for them to come back and check their result at some point rather than just ignoring the possible post failure.