race condition / transient failure to provision

Bug #1195524 reported by Scott Moser on 2013-06-28
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
walinuxagent (Ubuntu)
High
Ben Howard
Precise
Medium
Ben Howard
Raring
Medium
Ben Howard

Bug Description

[Impact]:
WALinuxAgent currently handles the provisioning of Ubuntu. This bug is fixed by Bug #1037723, which causes provisioning to be handled by Cloud-init.

[Regression] : By moving provisioning functions to cloud-init, the regression potential is low. Further, since this handled by a configuration change in WALinuxAgent, current users will not be affected.

[Test Case]: Boot on Windows Azure and make sure that the system provisioned.

[ORIGINAL REPORT]

when starting instances on azure, there is a fair chance (seems more likely for some users thanothers) that the instance will fail to reach provisioned state.

I do not have a a good guess as to why this is.

ProblemType: Bug
DistroRelease: Ubuntu 13.04
Package: walinuxagent 1.3.2-0ubuntu1
ProcVersionSignature: Ubuntu 3.8.0-25.37-generic 3.8.13
Uname: Linux 3.8.0-25-generic x86_64
ApportVersion: 2.9.2-0ubuntu8.1
Architecture: amd64
Date: Fri Jun 28 01:29:25 2013
MarkForUpload: True
ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: walinuxagent
UpgradeStatus: No upgrade log present (probably fresh install)

Scott Moser (smoser) wrote :
Scott Moser (smoser) wrote :

as I'm trying to debug this instance that failed for me, and then came up after a 'vm restart', there is no indication that / was ever mounted RW the first time. Ie, there is no evidence in /var/log of *anything* having run.
cloud-init starts on mounted MOUNTPOINT=/
and logs pretty much immediately to /var/log/cloud-init.log
but there is no timestamps other than this boot.

Scott Moser (smoser) wrote :

given my previous comment, i would have suspected that there was disk or kernel failure.
however, ssh-keyscan seemed to indicate ssh was running:
$ ssh-keyscan us-west-1.cloudapp.net
# us-west-1.cloudapp.net SSH-2.0-OpenSSH_6.1p1 Debian-4
Connection closed by 137.135.115.232

You can get kdump working in Azure (not that you really need it for this)
with the attached patch.

Ref: http://support.microsoft.com/kb/2858695

On Thu, Jun 27, 2013 at 8:01 PM, Scott Moser <email address hidden> wrote:

> given my previous comment, i would have suspected that there was disk or
> kernel failure.
> however, ssh-keyscan seemed to indicate ssh was running:
> $ ssh-keyscan us-west-1.cloudapp.net
> # us-west-1.cloudapp.net SSH-2.0-OpenSSH_6.1p1 Debian-4
> Connection closed by 137.135.115.232
>
> --
> You received this bug notification because you are a member of Canonical
> Microsoft Azure Collaboration, which is subscribed to walinuxagent in
> Ubuntu.
> Matching subscriptions: walinxuagnet bugs
> https://bugs.launchpad.net/bugs/1195524
>
> Title:
> race condition / transient failure to provision
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/walinuxagent/+bug/1195524/+subscriptions
>

--
-dave

Scott Moser (smoser) wrote :
Download full text (5.6 KiB)

ok. so i have 2 other systems that are showing this failure now.
I was able to ssh into them, though.
walinux-agent had provisioned the user, populated ssh keys and then also started sshd (which it actually should not do).

it shouldn't start sshd because it is possibly doing that before sshd has the required facilities up (sshd starts on 'filesystem or runlevel [2345]).

that wouldn't seem to be the problem here, and actually has allowed us into the instance to debug.

$ ls -tr --full-time /var/log/upstart/*.log
-rw-r----- 1 root root 46 2013-06-28 15:07:55.843772000 +0000 /var/log/upstart/container-detect.log
-rw-r----- 1 root root 95 2013-06-28 15:07:56.095772000 +0000 /var/log/upstart/console-setup.log
-rw-r----- 1 root root 282 2013-06-28 15:07:56.183772000 +0000 /var/log/upstart/procps-virtual-filesystems.log
-rw-r----- 1 root root 118 2013-06-28 15:07:56.311772000 +0000 /var/log/upstart/module-init-tools.log
-rw-r----- 1 root root 282 2013-06-28 15:07:58.310376600 +0000 /var/log/upstart/procps-static-network-up.log
-rw-r----- 1 root root 110 2013-06-28 15:08:02.993943800 +0000 /var/log/upstart/udev-fallback-graphics.log
-rw-r----- 1 root root 158 2013-06-28 15:08:09.876561300 +0000 /var/log/upstart/ureadahead-other.log
-rw-r----- 1 root root 64 2013-06-28 15:09:30.346411301 +0000 /var/log/upstart/rsyslog.log
-rw-r----- 1 root root 64 2013-06-28 15:09:30.370411301 +0000 /var/log/upstart/dbus.log

$ cat /proc/mounts
rootfs / rootfs rw 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,relatime,size=335336k,nr_inodes=83834,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,relatime,size=137672k,mode=755 0 0
/dev/disk/by-uuid/65a0705a-7afe-482f-917d-c59e75cf0c52 / ext4 rw,relatime,user_xattr,barrier=1,data=ordered,discard 0 0
none /sys/fs/fuse/connections fusectl rw,relatime 0 0
none /sys/kernel/debug debugfs rw,relatime 0 0
none /sys/kernel/security securityfs rw,relatime 0 0
none /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
none /run/shm tmpfs rw,nosuid,nodev,relatime 0 0
/dev/sdb1 /mnt/resource ext4 rw,relatime,user_xattr,barrier=1,data=ordered 0 0

$ cat /etc/fstab
UUID=65a0705a-7afe-482f-917d-c59e75cf0c52 / ext4 defaults,discard 0 0

mountall is not running.
$ sudo status mountall
mountall stop/waiting

$ ls -altr /var/run/landscape
ls: cannot access /var/run/landscape: No such file or directory

$ runlevel
N 2

$ ps axw
..
root 389 1 0 15:07 ? 00:00:00 upstart-udev-bridge --daemon
root 391 1 0 15:07 ? 00:00:00 /sbin/udevd --daemon
root 508 1 0 15:07 ? 00:00:00 /usr/bin/python /usr/sbin/waagent -daemo
root 574 391 0 15:07 ? 00:00:00 /sbin/udevd --daemon
root 577 391 0 15:07 ? 00:00:00 /sbin/udevd --daemon
root 598 2 0 15:07 ? 00:00:00 [kpsmoused]
root 633 1 0 15:07 ? 00:00:00 upstart-socket-bridge --daemon
root 906 2 0 15:08 ? 00:00:00 [jbd2/sdb1-8]
root 907 2 0 15:08 ? 00:00:00 [ext4-dio...

Read more...

Scott Moser (smoser) wrote :
David Medberry (med) wrote :

I noticed while doing the kdump debugging that you can ssh into a
cloud-image instance very early (well before the console allows login) on a
kvm based cloud-image instance. I was surprised that service was available
before console login was available. Not sure it is related to what your are
seeing on azure though.

(During kdump there is about a 40 second delay while the dump is
post-processed before console login is available--but ssh login is already
available and you can 'watch' the kdump post processing.) This means (at
least to me) that ssh is "on" fairly early.

On Fri, Jun 28, 2013 at 10:06 AM, Scott Moser <email address hidden> wrote:

> ** Attachment added: "waagent.log and waagent dir."
>
> https://bugs.launchpad.net/ubuntu/+source/walinuxagent/+bug/1195524/+attachment/3717115/+files/waagent-info.tar
>
> --
> You received this bug notification because you are a member of Canonical
> Microsoft Azure Collaboration, which is subscribed to walinuxagent in
> Ubuntu.
> Matching subscriptions: walinxuagnet bugs
> https://bugs.launchpad.net/bugs/1195524
>
> Title:
> race condition / transient failure to provision
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/walinuxagent/+bug/1195524/+subscriptions
>

--
-dave

Scott Moser (smoser) wrote :

ok... i think ihave a reasonable description of what caused this specific hang.
The last entry in this /var/log/waagent.log was:
2013/06/28 14:50:12 Provisioning image using OVF settings in the DVD.
2013/06/28 14:50:12 Resource disk (/dev/sdb1) is mounted at /mnt/resource with fstype ext4
2013/06/28 14:50:14 Created user account: test
2013/06/28 14:50:15 EnvMonitor: Detected host name change: ubuntu -> gwaclhost24lm4crqqc7kqfl9hya2ieuvxz9fvdqpn6ipfgdi4gh1p7snuwjezcw
2013/06/28 14:50:15 Setting host name: gwaclhost24lm4crqqc7kqfl9hya2ieuvxz9fvdqpn6ipfgdi4gh1p7snuwjezcw

In a *good* run, the entry after Created user account should be somethin glike:

2013/06/28 16:38:40 Provisioning image using OVF settings in the DVD.
2013/06/28 16:38:40 Disabled SSH password-based authentication methods.
2013/06/28 16:38:41 Created user account: smoser
2013/06/28 16:38:42 EnvMonitor: Detected host name change: ubuntu -> smoser0628p
2013/06/28 16:38:42 Setting host name: smoser0628p
2013/06/28 16:38:55 ERROR:CalledProcessError. Error Code: 1
2013/06/28 16:38:55 ERROR:CalledProcessError. Command string: "service ssh status | grep running"
2013/06/28 16:38:55 ERROR:CalledProcessError. Command result: ""
2013/06/28 16:38:55 Posted Role Properties. CertificateThumbprint=075533b9075b4130651b1f74c451ca9b
2013/06/28 16:38:55 Root password deleted.

Note, the ERROR information is coming from a differen thread in the process. you can just ignore it.
The key is that it should be Posting Role Properties, but this instnace didn't get there.

an 'strace' of the pid, shows:
$ sudo strace -p 498
Process 498 attached - interrupt to quit
select(0, NULL, NULL, NULL, {0, 321648}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0}

the other key bit of info here is that 'Setting host name' comes from 'UpdateAndPublishHostName', which , amoung other questionable operations, does:

    for ethernetInterface in PossibleEthernetInterfaces:
        Run("ifdown " + ethernetInterface + " && ifup " + ethernetInterface,chk_err=False) # We supress error logging on error.
    self.RestoreRoutes()

So what I think happened here, is that the attempt to ReportRoleProperties ("Posting Role Properties") ends up blocking/hanging in httplib.HTTPConnection(self.Endpoint), where no 'timeout' is specified.

One good thing is that this can be disabled via:
 Provisioning.MonitorHostName=n

Changed in walinuxagent (Ubuntu):
importance: Undecided → High
status: New → Confirmed
Long Li (longli) wrote :

Can you also share the kernel logs, and all the files in /var/lib/waagent?

In the waagent.log, the new host name seems uncommon:

2013/06/28 15:08:12 EnvMonitor: Detected host name change: ubuntu -> gwaclhostblkhljy4re3yp9swkdwp63kswkss9bqhn0zm3f3gunipzu5vwdr8qzw
2013/06/28 15:08:12 Setting host name: gwaclhostblkhljy4re3yp9swkdwp63kswkss9bqhn0zm3f3gunipzu5vwdr8qzw

Is it expected?

Stephen A. Zarkos (stevez) wrote :

Aside from the kernel logs, please also include /var/log/syslog and any other relevant logs. Long, I believe the contents of /var/lib/waagent were posted here:
https://bugs.launchpad.net/ubuntu/+source/walinuxagent/+bug/1195524/+attachment/3717115/+files/waagent-info.tar

There is retry logic in httplib, and there would be logs if we encountered issues/timeouts from ReportRoleProperties(), so I'm not sure if the VM even got that far. I see we also have a defunct 'sh' process, possibly something suffered earlier when calling ifup.

Raphaël Badin (rvb) wrote :

> 2013/06/28 15:08:12 EnvMonitor: Detected host name change: ubuntu -> gwaclhostblkhljy4re3yp9swkdwp63kswkss9bqhn0zm3f3gunipzu5vwdr8qzw
> 2013/06/28 15:08:12 Setting host name: gwaclhostblkhljy4re3yp9swkdwp63kswkss9bqhn0zm3f3gunipzu5vwdr8qzw
> Is it expected?

Yes, this machine was created using (a slightly modified version of) this script: http://bazaar.launchpad.net/~gwacl-hackers/gwacl/trunk/view/head:/example/management/run.go

The name is randomly generated. It is a bit long (64 characters) but I guess this cannot be the cause of our problem.

Raphaël Badin (rvb) wrote :

The doc (http://msdn.microsoft.com/en-us/library/windowsazure/jj157194.aspx) explicitly says the hostname can get 64 characters long: "HostName: Required. Specifies the host name for the VM. Host names are ASCII character strings 1 to 64 characters in length. Used with the LinuxProvisioningConfigurationSet configuration set."

Stephen A. Zarkos (stevez) wrote :

Hello!

Thank you for confirming that. I think the hostname just looked odd so just wanted to confirm as we don't have access to this environment.

If the instance is still available we really need more logs to figure out what went wrong. This will help determine if there's something wrong on the VM (disk or network issue) or a timing issue that we may be able work around.

We know the timeout set on the socket by httplib is the system default timeout for sockets, so if the following http request from ReportRoleProperties() was 'blocked' it would throw an exception on a timeout or other IO error. However, clearly the EnvMonitor was not able to complete in this case. Most likely the ifup call could not complete for some reason. If this is not a reliable operation then perhaps disabling the HostMonitor a good solution, of course the VMs other services may not be aware of a hostname change.

Raphaël Badin (rvb) wrote :

I did some testing and it turns out the hostname size seems to be the problem! I started a bunch of machines with a hostname of 64 characters and a bunch of machines with a hostname of 25 characters (with the script I linked above): all the machines with a hostname of 25 ended up in the state "Running" and all the others were stuck "Commissioning". Just to be sure, I had someone else confirm that behavior.

I did some more testing and I can now say that Azure is ok with hostnames up to 55 characters. A machine with a hostname of size 56 or more won't get out of the commissioning state.

Raphaël Badin (rvb) wrote :

s/Commissioning/Provisioning/ sorry about that.

Scott Moser (smoser) wrote :

I'm fairly certain there is still a race condition that I described comment 8.
Please raise the hostname issue in another bug.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package walinuxagent - 1.3.2-0ubuntu4

---------------
walinuxagent (1.3.2-0ubuntu4) saucy; urgency=low

  * debian/patches/shadow_permissions.patch: apply the appropriate
    permissions to /etc/shadow (LP: #1188820).
  * debian/patches/verbose_logging.patch: use the appropriate log
    faculty when using verbose logging (LP: #1193404).
  * Mark bugs fixed in 1.3.2-0ubuntu3:
    debian/patches/config_for_cloud-init.patch:
    - fix for race condition between cloud-init and waagent (LP: #1195524)
    - mount resource disk on /mnt (LP: #1193380)
    - move walinuxagent init functionality to cloud-init (LP: #1037723)
 -- Ben Howard <email address hidden> Tue, 23 Jul 2013 09:43:40 -0600

Changed in walinuxagent (Ubuntu):
status: Confirmed → Fix Released
Ben Howard (utlemming) on 2013-08-07
description: updated

Hello Scott, or anyone else affected,

Accepted walinuxagent into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/walinuxagent/1.3.2-0ubuntu4~12.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Adam Conrad (adconrad) wrote :

Hello Scott, or anyone else affected,

Accepted walinuxagent into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/walinuxagent/1.3.2-0ubuntu4~12.04.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Adam Conrad (adconrad) wrote :

Hello Scott, or anyone else affected,

Accepted walinuxagent into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/walinuxagent/1.3.2-0ubuntu4~12.04.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Ben Howard (utlemming) wrote :

Makring verification-done as part of SRU testing.

tags: added: verification-done
Stéphane Graber (stgraber) wrote :

Hello Scott, or anyone else affected,

Accepted walinuxagent into raring-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/walinuxagent/1.3.2-0ubuntu2~13.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in walinuxagent (Ubuntu Raring):
status: New → Fix Committed
Ben Howard (utlemming) on 2013-10-23
Changed in walinuxagent (Ubuntu Precise):
status: New → Fix Committed
importance: Undecided → Medium
Changed in walinuxagent (Ubuntu Raring):
importance: Undecided → Medium
Changed in walinuxagent (Ubuntu):
assignee: nobody → Ben Howard (utlemming)
Changed in walinuxagent (Ubuntu Precise):
assignee: nobody → Ben Howard (utlemming)
Changed in walinuxagent (Ubuntu Raring):
assignee: nobody → Ben Howard (utlemming)
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package walinuxagent - 1.3.2-0ubuntu2~13.04.1

---------------
walinuxagent (1.3.2-0ubuntu2~13.04.1) raring-proposed; urgency=low

  * Backport of 1.3.2-0ubuntu5 from 13.10
    * disable ephemeral disk formating by default (LP: #1231490)
    * debian/patches/shadow_permissions.patch: apply the appropriate
      permissions to /etc/shadow (LP: #1188820).
    * debian/patches/verbose_logging.patch: use the appropriate log
      faculty when using verbose logging (LP: #1193404).
    * Mark bugs fixed in 1.3.2-0ubuntu3:
      debian/patches/config_for_cloud-init.patch:
      - fix for race condition between cloud-init and waagent (LP: #1195524)
      - mount resource disk on /mnt (LP: #1193380)
      - move walinuxagent init functionality to cloud-init (LP: #1037723)
  * Add requirement of cloud-init (LP: #1037723).
 -- Ben Howard <email address hidden> Thu, 10 Oct 2013 09:24:46 -0600

Changed in walinuxagent (Ubuntu Raring):
status: Fix Committed → Fix Released
Ben Howard (utlemming) on 2013-11-21
Changed in walinuxagent (Ubuntu Precise):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers