[SRU] cloud-init Azure Datasource does not detect new instance

Bug #1269626 reported by Ben Howard on 2014-01-15
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Medium
Unassigned
Declined for Quantal by Ben Howard
Precise
Undecided
Unassigned
Saucy
Undecided
Unassigned
Trusty
Medium
Unassigned
walinuxagent (Ubuntu)
Undecided
Unassigned
Declined for Quantal by Ben Howard
Precise
Undecided
Unassigned
Saucy
Undecided
Unassigned
Trusty
Undecided
Unassigned

Bug Description

SRU Justification

[IMPACT] Capturing a Windows Azure Linux instance controlled by Cloud-init will result in any future instance based on it to fail to boot.

Since booting on Windows Azure for Ubuntu is controlled via Cloud-init and WALinuxAgent, there are two causes.

[WALinuxAgent Issue]: Sends old hostname on captured instances
In version 1.4.2, WALinuxagent changed the dhclient configuration to send the right hostname via DHCP to the fabric. The current version of WALinuxAgent configures DHCP to send the old hostname. Since the fabric never receives a request for the new host name, the VIP for the new instance is never opened. Users will see "Connection refused" when attempting to connect to the instance.

[WALinuxAgent Regression Potentional]: Low. This change simply changes "send host-name <hostname>" to "send host-name = gethostname();" on Windows Azure instances.

[Cloud-init Issue]: Race condition on captured instances
Cloud-init configures Windows Azure by triggering WALinuxAgent's waagent daemon. The daemon goes off and gathers some files from the fabric, which takes a few seconds. Cloud-init, after firing off the agent, waits for the file SharedConfig.xml to land on disk. In the event of an instance being captured, Cloud-init files off the waagent daemon but then immediately finds the old SharedConfig.xml, resulting in Cloud-init seeing the old instance XML. Thus, the new instance is not provisioned.

[Cloud-init Regression Potential]: Low. The change here detects if the machine is a new instance by the precense of the ovf-env.xml file; if the file exists, then SharedConfig.xml is removed. WALinuxAgent will fetch a new one. Further, this file is not required unless the instance is new.

[Test case]
1. Install new walinuxagent
2. Install new cloud-init
3. Shutdown instance
4. Capture instance
5. Launch new instance from captured instance
6. The instance should responds via the network

[ORIGINAL REPORT]

Upon 'capturing' a Ubuntu VM on Windows Azure, instances do not come up. It appears that the cause is that Cloud-init is not looking at any datasources.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: cloud-init 0.7.5~bzr902-0ubuntu1
ProcVersionSignature: Ubuntu 3.13.0-2.17-generic 3.13.0-rc7
Uname: Linux 3.13.0-2-generic x86_64
ApportVersion: 2.13.1-0ubuntu1
Architecture: amd64
Date: Wed Jan 15 22:29:49 2014
PackageArchitecture: all
SourcePackage: cloud-init
UpgradeStatus: No upgrade log present (probably fresh install)

Related branches

Changed in cloud-init (Ubuntu):
assignee: nobody → Ben Howard (utlemming)
importance: Undecided → Medium
status: New → In Progress

Log snip showing end of first boot and second boot.

fig_power_state_change - wb: [420] 20 bytes
Jan 15 20:09:31 utl-0115-lp1268050 [CLOUDINIT] helpers.py[DEBUG]: Running config-power-state-change using lock (<FileLock using file '/var/
lib/cloud/instances/9becc63a498847de99d6961c7ae0a95f/sem/config_power_state_change'>)
Jan 15 20:09:31 utl-0115-lp1268050 [CLOUDINIT] cc_power_state_change.py[DEBUG]: no power_state provided. doing nothing
Jan 15 20:09:31 utl-0115-lp1268050 [CLOUDINIT] cloud-init[DEBUG]: Ran 10 modules with 0 failures
Jan 15 20:09:31 utl-0115-lp1268050 [CLOUDINIT] util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
Jan 15 20:09:31 utl-0115-lp1268050 [CLOUDINIT] util.py[DEBUG]: Read 11 bytes from /proc/uptime
Jan 15 20:09:31 utl-0115-lp1268050 [CLOUDINIT] util.py[DEBUG]: cloud-init mode 'modules' took 0.407 seconds (0.41)
2014-01-15 21:52:19,552 - util.py[DEBUG]: Cloud-init v. 0.7.5 running 'init-local' at Wed, 15 Jan 2014 21:52:19 +0000. Up 22.39 seconds.
2014-01-15 21:52:19,668 - util.py[DEBUG]: Writing to /var/log/cloud-init.log - ab: [420] 0 bytes
2014-01-15 21:52:19,669 - util.py[DEBUG]: Changing the ownership of /var/log/cloud-init.log to 101:4
2014-01-15 21:52:19,669 - util.py[DEBUG]: Attempting to remove /var/lib/cloud/instance/boot-finished
2014-01-15 21:52:19,670 - util.py[DEBUG]: Attempting to remove /var/lib/cloud/instance
2014-01-15 21:52:19,670 - util.py[DEBUG]: Attempting to remove /var/lib/cloud/data/no-net
2014-01-15 21:52:19,670 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2014-01-15 21:52:19,672 - importer.py[DEBUG]: Looking for modules ['ubuntu', 'cloudinit.distros.ubuntu'] that have attributes ['Distro']
2014-01-15 21:52:19,672 - importer.py[DEBUG]: Failed at attempted import of 'ubuntu' due to: No module named ubuntu
2014-01-15 21:52:19,720 - importer.py[DEBUG]: Found ubuntu with attributes ['Distro'] in ['cloudinit.distros.ubuntu']
2014-01-15 21:52:19,721 - stages.py[DEBUG]: Using distro class <class 'cloudinit.distros.ubuntu.Distro'>
2014-01-15 21:52:19,721 - __init__.py[DEBUG]: Looking for for data source in: ['Azure'], via packages ['', 'cloudinit.sources'] that matche
s dependencies ['FILESYSTEM']
2014-01-15 21:52:19,721 - importer.py[DEBUG]: Looking for modules ['DataSourceAzure', 'cloudinit.sources.DataSourceAzure'] that have attrib
utes ['get_datasource_list']
2014-01-15 21:52:19,721 - importer.py[DEBUG]: Failed at attempted import of 'DataSourceAzure' due to: No module named DataSourceAzure
2014-01-15 21:52:19,965 - importer.py[DEBUG]: Found DataSourceAzure with attributes ['get_datasource_list'] in ['cloudinit.sources.DataSour
ceAzure']
2014-01-15 21:52:19,965 - __init__.py[DEBUG]: Searching for data source in: []
2014-01-15 21:52:19,965 - cloud-init[DEBUG]: No local datasource found
2014-01-15 21:52:19,966 - util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
2014-01-15 21:52:19,966 - util.py[DEBUG]: Read 11 bytes from /proc/uptime
2014-01-15 21:52:19,966 - util.py[DEBUG]: cloud-init mode 'init' took 0.594 seconds (0.59)

description: updated
summary: - cloud-init uses incorrect GUID for instance ID
+ cloud-init Azure Datasource does not detect new instance

Confirmed that this happening in Saucy too.

$ ssh <email address hidden>
ssh: connect to host utl-0115-utl-0115-s1-A1.cloudapp.net port 22: Connection refused
ssh: connect to host utl-0115-utl-0115-s1-A1.cloudapp.net port 22: Connection refused

Download full text (3.2 KiB)

From Saucy instance, here is walinuxagent's /var/lib/waagent/waagent.log

Cloud-init is using ba5ce29e7b4d435f8d70ed5b7c471875 as the instance ID, but that instance ID is not unique when a machine is captured.

2014/01/12 08:12:30 Configured SSH client probing to keep connections alive.
2014/01/15 22:44:47 Windows Azure Linux Agent Version: WALinuxAgent-1.3.2
2014/01/15 22:44:47 Linux Distribution Detected : Ubuntu
2014/01/15 22:44:47 IPv4 address: 100.92.36.21
2014/01/15 22:44:47 MAC address: 00:15:5D:86:BE:4D
2014/01/15 22:44:47 Probing for Windows Azure environment.
2014/01/15 22:44:48 DoDhcpWork: Setting socket.timeout=10, entering recv
2014/01/15 22:44:48 Discovered Windows Azure endpoint: 100.92.36.60
2014/01/15 22:44:48 WARNING:Newer wire protocol version detected. Please consider updating waagent.
2014/01/15 22:44:48 Negotiated wire protocol version: 2011-12-31
2014/01/15 22:44:48 Retrieved GoalState from Windows Azure Fabric.
2014/01/15 22:44:48 ExpectedState: Started
2014/01/15 22:44:48 ContainerId: fc217d55-31e7-485e-a82f-75b2a90a1a0a
2014/01/15 22:44:48 RoleInstanceId: ba5ce29e7b4d435f8d70ed5b7c471875.utl-0115-s1
2014/01/15 22:44:49 Public cert with thumbprint: D3BCD6F2904D5E4B5E8155ED1E0A698C7B14F007 was retrieved.
2014/01/15 22:46:09 Configured SSH client probing to keep connections alive.
2014/01/15 22:46:09 Windows Azure Linux Agent Version: WALinuxAgent-1.3.2
2014/01/15 22:46:09 Linux Distribution Detected : Ubuntu
2014/01/15 22:46:09 IPv4 address: 100.92.36.21
2014/01/15 22:46:09 MAC address: 00:15:5D:86:BE:4D
2014/01/15 22:46:09 Probing for Windows Azure environment.
2014/01/15 22:46:09 DoDhcpWork: Setting socket.timeout=10, entering recv
2014/01/15 22:46:09 Discovered Windows Azure endpoint: 100.92.36.60
2014/01/15 22:46:09 WARNING:Newer wire protocol version detected. Please consider updating waagent.
2014/01/15 22:46:09 Negotiated wire protocol version: 2011-12-31
2014/01/15 22:46:09 Retrieved GoalState from Windows Azure Fabric.
2014/01/15 22:46:09 ExpectedState: Started
2014/01/15 22:46:09 ContainerId: fc217d55-31e7-485e-a82f-75b2a90a1a0a
2014/01/15 22:46:09 RoleInstanceId: ba5ce29e7b4d435f8d70ed5b7c471875.utl-0115-s1
2014/01/15 22:46:09 Public cert with thumbprint: D3BCD6F2904D5E4B5E8155ED1E0A698C7B14F007 was retrieved.
2014/01/15 22:54:32 Windows Azure Linux Agent Version: WALinuxAgent-1.3.2
2014/01/15 22:54:32 Linux Distribution Detected : Ubuntu
2014/01/15 22:54:32 IPv4 address: 100.92.38.42
2014/01/15 22:54:32 MAC address: 00:15:5D:86:BC:79
2014/01/15 22:54:32 Probing for Windows Azure environment.
2014/01/15 22:54:33 DoDhcpWork: Setting socket.timeout=10, entering recv
2014/01/15 22:54:33 Discovered Windows Azure endpoint: 100.92.38.150
2014/01/15 22:54:33 WARNING:Newer wire protocol version detected. Please consider updating waagent.
2014/01/15 22:54:33 Negotiated wire protocol version: 2011-12-31
2014/01/15 22:54:33 Retrieved GoalState from Windows Azure Fabric.
2014/01/15 22:54:33 ExpectedState: Started
2014/01/15 22:54:33 ContainerId: 6a3ea5a9-6f7c-422f-a026-e6b5eb8f0fe5
2014/01/15 22:54:33 RoleInstanceId: fd9cfa887ac24eeab7efc8e5b48d799a.utl-0115-utl-0115-s1-A1
2014/01/15 22:54:33 Public c...

Read more...

Download full text (3.3 KiB)

Here is a diff between the SharedConfig.xml, with cloud-init.log as an attachment

--- waagent.pre/SharedConfig.xml 2014-01-15 22:46:09.619537000 +0000
+++ waagent/SharedConfig.xml 2014-01-15 22:54:33.606734299 +0000
@@ -1,32 +1,32 @@
 <?xml version="1.0" encoding="utf-8"?>
 <SharedConfig version="1.0.0.0" goalStateIncarnation="1">
- <Deployment name="ba5ce29e7b4d435f8d70ed5b7c471875" guid="{71d37c98-32e5-4627-a9c7-7498c603ed50}" incarnation="0">
- <Service name="utl-0115-s1" guid="{00000000-0000-0000-0000-000000000000}" />
- <ServiceInstance name="ba5ce29e7b4d435f8d70ed5b7c471875.0" guid="{c63183ea-1284-426b-a288-6af3f7397a36}" />
+ <Deployment name="fd9cfa887ac24eeab7efc8e5b48d799a" guid="{5fcccddd-25c5-47c1-9ed0-2c37ab5a8612}" incarnation="0">
+ <Service name="utl-0115-utl-0115-s1-A1" guid="{00000000-0000-0000-0000-000000000000}" />
+ <ServiceInstance name="fd9cfa887ac24eeab7efc8e5b48d799a.0" guid="{a610f152-aa8f-492e-82bc-053384ea88dd}" />
   </Deployment>
- <Incarnation number="1" instance="utl-0115-s1" guid="{07230d59-14a8-446a-9e2c-893250c47407}" />
- <Role guid="{692f906f-115d-e08d-2178-90db611dc71b}" name="utl-0115-s1" settleTimeSeconds="0" />
+ <Incarnation number="1" instance="utl-0115-utl-0115-s1-A1" guid="{7ef9d281-0865-459c-a1aa-4acf1f10123a}" />
+ <Role guid="{04c0d2f4-1503-977c-d3f7-408268609c71}" name="utl-0115-utl-0115-s1-A1" settleTimeSeconds="0" />
   <LoadBalancerSettings timeoutSeconds="0" waitLoadBalancerProbeCount="8">
     <Probes>
       <Probe name="D41D8CD98F00B204E9800998ECF8427E" />
- <Probe name="AEEC72A7AAC0423C26FCB170FD1035DA" />
+ <Probe name="B295630DB75FB8FCBC300D7E1500940F" />
     </Probes>
   </LoadBalancerSettings>
   <OutputEndpoints>
- <Endpoint name="utl-0115-s1:openInternalEndpoint" type="SFS">
- <Target instance="utl-0115-s1" endpoint="openInternalEndpoint" />
+ <Endpoint name="utl-0115-utl-0115-s1-A1:openInternalEndpoint" type="SFS">
+ <Target instance="utl-0115-utl-0115-s1-A1" endpoint="openInternalEndpoint" />
     </Endpoint>
   </OutputEndpoints>
   <Instances>
- <Instance id="utl-0115-s1" address="100.92.36.21">
+ <Instance id="utl-0115-utl-0115-s1-A1" address="100.92.38.42">
       <FaultDomains randomId="0" updateId="0" updateCount="0" />
       <InputEndpoints>
- <Endpoint name="openInternalEndpoint" address="100.92.36.21" protocol="any" isPublic="false" enableDirectServerReturn="false" isDirectAddress="false" disableStealthMode="false">
+ <Endpoint name="openInternalEndpoint" address="100.92.38.42" protocol="any" isPublic="false" enableDirectServerReturn="false" isDirectAddress="false" disableStealthMode="false">
           <LocalPorts>
             <LocalPortSelfManaged />
           </LocalPorts>
         </Endpoint>
- <Endpoint name="ssh" address="100.92.36.21:22" protocol="tcp" hostName="utl-0115-s1ContractContract" isPublic="true" loadBalancedPublicAddress="191.235.134.61:22" enableDirectServerReturn="false" isDirectAddress="false" disableStealthMode="false">
+ <Endpoint name="ssh" address="100.92.38.42:22" protocol="tcp" hostName="utl-0115-utl-0115-s1-A1ContractContract" isPublic="true" loadBalancedPublic...

Read more...

Cloud-init log for above comment

Okay, I figured it out....we have a race condition.

What is happening is that the Windows Azure datasource starts WALinuxAgent. WALinuxAgent normally goes off and fetches a bunch of files. The datasource normally waits for files to show up from WALinuxAgent. In the case of a capture, the files remain there, unless the user removes them. The end result is that by the time WALinuxAgent gets the files, cloud-init has consumed the previous files.

Submitted MP against cloud-init. Pending MP. I am intending on SRU'ing this for 12.04 and 13.10. Since 12.10 does not use cloud-init provisioning, there is no reason for SRU'ing against 12.10.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.5~bzr950-0ubuntu1

---------------
cloud-init (0.7.5~bzr950-0ubuntu1) trusty; urgency=medium

  * New upstream snapshot.
    * support for vendor-data in NoCloud
    * fix in is_ipv4 to accept IP addresses with a '0' in them.
    * Azure: fix issue when stale data in /var/lib/waagent (LP: #1269626)
    * skip config_modules that declare themselves only verified on a set of
      distros. Add them to 'unverified_modules' list to run anyway.
    * Add CloudSigma datasource [Kiril Vladimiroff]
    * Add initial support for Gentoo and Arch distributions [Nate House]
    * Add GCE datasource [Vaidas Jablonskis]
    * Add native Openstack datasource which reads openstack metadata
      rather than relying on EC2 data in openstack metadata service.
 -- Scott Moser <email address hidden> Fri, 14 Feb 2014 14:39:56 -0500

Changed in cloud-init (Ubuntu):
status: In Progress → Fix Released
summary: - cloud-init Azure Datasource does not detect new instance
+ [SRU] cloud-init Azure Datasource does not detect new instance
Changed in walinuxagent (Ubuntu Trusty):
status: New → Fix Released
description: updated

Hello Ben, or anyone else affected,

Accepted cloud-init into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/cloud-init/0.6.3-0ubuntu1.12 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-init (Ubuntu Precise):
status: New → Fix Committed
tags: added: verification-needed
Changed in cloud-init (Ubuntu Saucy):
status: New → Fix Committed
Dave Walker (davewalker) wrote :

Hello Ben, or anyone else affected,

Accepted cloud-init into saucy-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/cloud-init/0.7.3-0ubuntu2.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Performed through validation testing on multiple instances. Marking as verification done.

tags: added: verification-done
removed: trusty verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.6.3-0ubuntu1.12

---------------
cloud-init (0.6.3-0ubuntu1.12) precise-proposed; urgency=low

  * debian/patches/lp-1269626-azure_new_instance.patch: fix handling of new
    instances on Windows Azure. Backport of fix from 14.04 (LP: #1269626).
  * debian/patches/lp-1292648-azure-format-ephemeral-new.patch: Azure,
    re-format ephemeral disk if necessary (LP: #1292648).
 -- Ben Howard <email address hidden> Tue, 18 Mar 2014 10:58:12 -0600

Changed in cloud-init (Ubuntu Precise):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.3-0ubuntu2.2

---------------
cloud-init (0.7.3-0ubuntu2.2) saucy-proposed; urgency=low

  * debian/patches/lp-1269626-azure_new_instance.patch:
    fix handling of new instances on Windows Azure (LP: #1269626).
  * debian/patches/lp-1292648-azure-format-ephemeral-new.patch:
    re-format ephemeral disk if necessary (LP: #1292648).
 -- Ben Howard <email address hidden> Wed, 19 Mar 2014 16:31:51 -0600

Changed in cloud-init (Ubuntu Saucy):
status: Fix Committed → Fix Released
Rolf Leggewie (r0lf) wrote :

saucy has seen the end of its life and is no longer receiving any updates. Marking the saucy task for this ticket as "Won't Fix".

Changed in walinuxagent (Ubuntu Saucy):
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers