cloud-config-url ignored, install fails

Bug #1402861 reported by Jason Hobbs
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned

Bug Description

I have a node that's booting up for installation, but apparently doesn't successfully retrieve cloud config from the cloud-config-url provided as a kernel command line parameter. This manifests as a 'Failed Deployment', with the node left powered on but me unable to ssh into it, since it doesn't have keys installed. I had to capture console traffic to figure out what was happening. This happens when I try to install many nodes at once, and causes many of them to fail installation.

[ 0.000000] Command line: nomodeset iscsi_target_name=iqn.2004-05.com.ubuntu:maas:ephemeral-ubuntu-amd64-generic-trusty-daily iscsi_target_ip=10.245.0.10 iscsi_target_port=3260 iscsi_initiator=hayward-18 ip=::::hayward-18:BOOTIF ro root=/dev/disk/by-path/ip-10.245.0.10:3260-iscsi-iqn.2004-05.com.ubuntu:maas:ephemeral-ubuntu-amd64-generic-trusty-daily-lun-1 overlayroot=tmpfs cloud-config-url=http://10.245.0.10/MAAS/metadata/latest/by-id/node-9d4e6a5a-c4cd-11e3-824b-00163efc5068/?op=get_preseed log_host=10.245.0.10 log_port=514 -- console=ttyS0,9600n8 initrd=ubuntu/amd64/generic/trusty/daily/boot-initrd BOOT_IMAGE=ubuntu/amd64/generic/trusty/daily/boot-kernel BOOTIF=01-00-22-99-e0-01-36

2014-12-15 23:07:53,602 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [50/120s]: request error [(<urllib3.connectionpool.HTTPConnectionPool object at 0x7f7012eb2d90>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)')]
2014-12-15 23:08:44,655 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [101/120s]: request error [(<urllib3.connectionpool.HTTPConnectionPool object at 0x7f7012eb2e10>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)')]
2014-12-15 23:09:02,674 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [119/120s]: request error [(<urllib3.connectionpool.HTTPConnectionPool object at 0x7f7012eb2350>, 'Connection to 169.254.169.254 timed out. (connect timeout=17.0)')]
2014-12-15 23:09:03,675 - DataSourceEc2.py[CRITICAL]: Giving up on md from ['http://169.254.169.254/2009-04-04/meta-data/instance-id'] after 120 seconds
2014-12-15 23:09:53,768 - url_helper.py[WARNING]: Calling 'http://10.245.0.1//latest/meta-data/instance-id' failed [50/120s]: request error [(<urllib3.connectionpool.HTTPConnectionPool object at 0x7f7012eb24d0>, 'Connection to 10.245.0.1 timed out. (connect timeout=50.0)')]
2014-12-15 23:10:44,789 - url_helper.py[WARNING]: Calling 'http://10.245.0.1//latest/meta-data/instance-id' failed [101/120s]: request error [(<urllib3.connectionpool.HTTPConnectionPool object at 0x7f7012eb2450>, 'Connection to 10.245.0.1 timed out. (connect timeout=50.0)')]
2014-12-15 23:11:02,809 - url_helper.py[WARNING]: Calling 'http://10.245.0.1//latest/meta-data/instance-id' failed [119/120s]: request error [(<urllib3.connectionpool.HTTPConnectionPool object at 0x7f7012eb29d0>, 'Connection to 10.245.0.1 timed out. (connect timeout=17.0)')]
2014-12-15 23:11:03,810 - DataSourceCloudStack.py[CRITICAL]: Giving up on waiting for the metadata from ['http://10.245.0.1//latest/meta-data/instance-id'] after 120 seconds

Full console log:
http://paste.ubuntu.com/9533934/

Tags: oil
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I think the eth1 thing is a red herring, I booted the system again (this time by itself; the previous time was at the same time as many other systems). This time it booted from eth1 again, but succeeded. I don't see any info about retrieving the cloud-config data. Maybe the issue here is that it timed out trying to retrieve it before but there was no error message?

http://paste.ubuntu.com/9534353/

description: updated
description: updated
Christian Reis (kiko)
Changed in maas:
milestone: none → next
Revision history for this message
Blake Rouse (blake-rouse) wrote :

Have you been able to diagnose this issue anymore?

Seems like a weird failure, and more related to the network the nodes are coming up on vs MAAS.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

No, I haven't. My assumption is that it's a timeout in cloud-init, because MAAS is busy at the time (may have a fully utilized link and might be dropping pockets as a result). I spent some time digging in cloud-init code but couldn't nail down where the request to the cloud-config URL should be happening. I'd really like to find that and understand if it has a timeout associated with it, and why I'm not seeing any log messages indicating an attempt/failure to reach it. But, it's apparently not happening under normal conditions on OIL, only when we do load testing, so it's dropped off my priority list for now.

Changed in maas:
status: New → Invalid
Revision history for this message
Matt Rae (mattrae) wrote :

I experienced this issue with MAAS 2.0.0~beta3. The issue appeared on some machines enlisting into MAAS. I could see that cloud-init-url in the kernel parameters was the MAAS server, but when the node tried to cloud-init, it was attempting to connect to 169.254.169.254 and then to the gateway ip before giving up.

After restarting all MAAS services on the rack-controller and region-controller, the issue went away.

Changed in maas:
milestone: next → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.