raring instance failed to find EC2 datasource

Bug #1078926 reported by Scott Moser on 2012-11-14
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
mountall (Ubuntu)
High
Steve Langasek
Precise
Undecided
Unassigned
Quantal
High
Steve Langasek
Raring
High
Steve Langasek

Bug Description

[Impact]
The previous SRU, while fixing the problem it was intended to fix, partially reintroduced the problem from before 2.41 where some filesystem events would end up blocking on one another when they shouldn't. This is particularly noticeable for filesystems that have been mounted by the initramfs and there are jobs started on the 'mounted' event for one of these. In the particular case of cloud-init, the jobs that start on mounted MOUNTPOINT=/ block waiting for the network to come up, which needs the 'virtual-filesystems' event which isn't happening because the 'mounted MOUNTPOINT=/run' event hasn't finished processing - it's behind the 'mounted /' in the event queue. This intermittently causes cloud boot times in excess of 5 minutes.

[Test case]
1. Boot the current quantal daily cloud images.
2. Confirm that at least sometimes, the images take 5 minutes to boot.
3. Upgrade mountall to the quantal-proposed version.
4. Confirm that the images now boot without hitting the "wait for network" timeout.

[Regression potential]
Minimal, as this is correcting a regression from the previous SRU.

This seems sporadic failure at best. I had seen it on openstack fail similarly.
Today I launched 7 instances of us-east-1 t1.micro from each of :
 ami-be70f5d7 ebs/ubuntu-raring-daily-amd64-server-20121109
 ami-de27a2b7 ebs/ubuntu-raring-daily-amd64-server-20121110
 ami-f6c94d9f ebs/ubuntu-raring-daily-amd64-server-20121111
 ami-cc8307a5 ebs/ubuntu-raring-daily-amd64-server-20121112
 ami-5c43c735 ebs/ubuntu-raring-daily-amd64-server-20121113
and then another one of 20121109 and 20121113.

One of the 2 20121109 failed to boot, and I will attach its console log and cloud-init.log. The log was obtained from stopping the instance, attaching to another system, and getting it. Then I restarted the instance, and it booted fine.

Things that were of interest:
 * from cloud-init's perspective, the thing that failed was finding the ec2 metadata service.
 * In the failure case, there were no messages to the console log after initramfs, while
cloud-init's log shows it WARNing, which should go to console.

Related bugs:
 * mountall: bug 1059471 2.41 fails to mount root partition
 * plymouth: bug 1086072 some output to /dev/console does not reach /dev/console

ProblemType: Bug
DistroRelease: Ubuntu 13.04
Package: cloud-init 0.7.0-0ubuntu2
ProcVersionSignature: User Name 3.5.0-17.28-generic 3.5.5
Uname: Linux 3.5.0-17-generic x86_64
ApportVersion: 2.6.2-0ubuntu3
Architecture: amd64
Date: Wed Nov 14 21:30:05 2012
Ec2AMI: ami-be70f5d7
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-east-1b
Ec2InstanceType: t1.micro
Ec2Kernel: aki-825ea7eb
Ec2Ramdisk: unavailable
MarkForUpload: True
PackageArchitecture: all
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: cloud-init
UpgradeStatus: No upgrade log present (probably fresh install)

Scott Moser (smoser) wrote :
Scott Moser (smoser) wrote :
Scott Moser (smoser) wrote :
affects: cloud-init (Ubuntu) → ubuntu
Changed in ubuntu:
status: New → Confirmed
importance: Undecided → High
Scott Moser (smoser) wrote :

saw this on t1.micro on
   us-east-1 ami-3f70f756 canonical ebs/ubuntu-raring-daily-amd64-server-20121119
same symptoms
  * no console output after initramfs
  * could not ssh in as None datasource is selected

Here, a reboot fixed the issue.

I'll attach /var/log/boot.log which had some interesting information. Notably, it seems to show that networking did not come up.
The stuff here generally *should* have gone to /dev/console.

Also, interesting was that networking after the reboot was not very reliable.
'apt-get install pastebinit' would work, but 'pastebinit /var/log/boot.log' did not work.
Also did not work:
  ssh $hostname cat /var/log/boot.log > /tmp/boot.log
  scp user@$hostname:/var/log/boot.log /tmp/boot.log

the ssh and scp connections would basically block.

Scott Moser (smoser) wrote :

Ok, more interesting information... in /var/log/syslog.
Note, 'failsafe' fired, *then* dhcp seemed to be happy to get an address. Strange. as if some ordering blocked ifup of eth0 until after failsafe.

Nov 19 14:46:06 ubuntu kernel: [ 6.503401] EXT4-fs (xvda1): re-mounted. Opts: (null)
Nov 19 14:46:06 ubuntu kernel: [ 269.934672] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Nov 19 14:46:06 ubuntu kernel: [ 270.455365] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
Nov 19 14:46:06 ubuntu kernel: [ 270.584212] microcode: CPU0 sig=0x1067a, pf=0x40, revision=0xa07
Nov 19 14:46:06 ubuntu kernel: [ 270.748120] microcode: Microcode Update Driver: v2.00 <email address hidden>, Peter Oruba
Nov 19 14:46:07 ubuntu kernel: [ 271.620667] type=1400 audit(1353336367.149:2): apparmor="STATUS" operation="profile_load" name="/sbin/dhclient" pid=545 comm="apparmor_parser"
Nov 19 14:46:07 ubuntu kernel: [ 271.620843] type=1400 audit(1353336367.149:3): apparmor="STATUS" operation="profile_load" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=545 comm="apparmor_parser"
Nov 19 14:46:07 ubuntu kernel: [ 271.620913] type=1400 audit(1353336367.149:4): apparmor="STATUS" operation="profile_load" name="/usr/lib/connman/scripts/dhclient-script" pid=545 comm="apparmor_parser"
Nov 19 14:46:07 ubuntu kernel: [ 271.641049] type=1400 audit(1353336367.169:5): apparmor="STATUS" operation="profile_replace" name="/sbin/dhclient" pid=559 comm="apparmor_parser"
Nov 19 14:46:07 ubuntu kernel: [ 271.641161] type=1400 audit(1353336367.169:6): apparmor="STATUS" operation="profile_replace" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=559 comm="apparmor_parser"
Nov 19 14:46:07 ubuntu kernel: [ 271.641234] type=1400 audit(1353336367.169:7): apparmor="STATUS" operation="profile_replace" name="/usr/lib/connman/scripts/dhclient-script" pid=559 comm="apparmor_parser"
Nov 19 14:46:08 ubuntu failsafe: Failsafe of 120 seconds reached.
Nov 19 14:46:08 ubuntu dhclient: Internet Systems Consortium DHCP Client 4.2.4
Nov 19 14:46:08 ubuntu dhclient: Copyright 2004-2012 Internet Systems Consortium.
Nov 19 14:46:08 ubuntu dhclient: All rights reserved.
Nov 19 14:46:08 ubuntu dhclient: For info, please visit https://www.isc.org/software/dhcp/
Nov 19 14:46:08 ubuntu dhclient:
Nov 19 14:46:08 ubuntu dhclient: Listening on LPF/eth0/12:31:3b:04:a5:65
Nov 19 14:46:08 ubuntu dhclient: Sending on LPF/eth0/12:31:3b:04:a5:65
Nov 19 14:46:08 ubuntu dhclient: Sending on Socket/fallback
Nov 19 14:46:08 ubuntu dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
Nov 19 14:46:08 ubuntu dhclient: DHCPREQUEST of 10.212.170.147 on eth0 to 255.255.255.255 port 67
Nov 19 14:46:08 ubuntu dhclient: DHCPOFFER of 10.212.170.147 from 169.254.1.0
Nov 19 14:46:08 ubuntu dhclient: DHCPACK of 10.212.170.147 from 169.254.1.0
Nov 19 14:46:08 ubuntu ntpdate[721]: Can't find host ntp.ubuntu.com: Name or service not known (-2)
Nov 19 14:46:08 ubuntu ntpdate[721]: no servers can be used, exiting
Nov 19 14:46:08 ubuntu dhclient: bound to 10.212.170.147 -- renewal in 33565 seconds.

Stefan Bader (smb) wrote :

There seems to be a generic issue in v3.7 on how much tx traffic out of the PVM can be handled. I actually can see something similar with Xen 4.2 with dom0 running a Quantal kernel (I know it is a daring combination). But I think that points either to net in general or to the netfront driver. I would believe we would have noticed a generic net problem by now, so I would check on netfront first.

Stefan Bader (smb) wrote :

Found the following thread which starts off with something that affects the netback side but later on people have tx issues like reported here. There does not seem to be a solution for that, yet (http://markmail.org/message/kce5jybhfo2x5lxn).
From what I understand it is a problem where generic network code changed to create bigger fragments but the netfront driver can only handle page sized objects.

Stefan Bader (smb) wrote :

This may only be partial as the outcome of some discussion about addition error/safety handling is pending, but it seems to be ok enough as is to at least allow sensible networking. For the time being I have test kernel package at:

http://people.canonical.com/~smb/lp1078926

tags: added: patch
Robie Basak (racb) wrote :

Here's a failure I had that I think is the same issue: http://paste.ubuntu.com/1374464/. On a Canonical internal OpenStack deployment.

Scott Moser (smoser) wrote :

It seems to me that there are 3 issues at play here:
 a.) kernel: the device driver issue that stefan pointed to above.
 clearly a broken network device driver can result in failure to find a network resource

 b.) plymouth: many of our boot messages get "eaten".
Robie's paste there shows the issue as to many of mine. I'll also attach a good full console log. Note, that the "error: unexpectedly disconnected from boot status daemon" message occurs in both good and bad paths. In the bad path, we lose console output of the upstart jobs cloud-init and cloud-init-local. The 'cloud-init' job would normally output warnings to the console about inability to find the datasource. In some cases we see failed datasource access and still get this, but often we lose it.

 c.) networking bringup race
   It seems there must be something more to the issue, because 'b' shouldn't be fatal, and 'a' should not occur on kvm. However, as Robbie showed this also occurs on openstack nodes with kvm.

Scott Moser (smoser) wrote :

This seems to have been SRU'd to quantal under bug 1059471 see comment at https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/1059471/comments/14

description: updated
Scott Moser (smoser) wrote :

I've split the plymouth / loss-of-console-output bug into bug 1086072.

description: updated
Steve Langasek (vorlon) on 2012-12-03
affects: ubuntu → mountall (Ubuntu)
Steve Langasek (vorlon) wrote :

Seems the problem is that / and /run have both been mounted in the first pass, and mountall is stuck waiting (in try_mounts()) for the 'mounted MOUNTPOINT=/' event to return before it considers the 'mounted MOUNTPOINT=/run' event, which has already triggered. Since the former *depends* on the latter as a result of cloud-init and virtual-filesystems, we need to not block while waiting for it.

Changed in mountall (Ubuntu):
assignee: nobody → Steve Langasek (vorlon)
status: Confirmed → In Progress
Steve Langasek (vorlon) on 2012-12-03
Changed in mountall (Ubuntu Quantal):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Steve Langasek (vorlon)
Steve Langasek (vorlon) on 2012-12-03
description: updated
Scott Moser (smoser) wrote :

Using 2.45 + this patch I've tested ~ 20 instances and gotten no failures.
failure rates previously were ~ 50%.

Hello Scott, or anyone else affected,

Accepted mountall into quantal-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/mountall/2.42ubuntu0.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in mountall (Ubuntu Quantal):
status: In Progress → Fix Committed
tags: added: verification-needed
Adam Conrad (adconrad) wrote :

Hello Scott, or anyone else affected,

Accepted mountall into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/mountall/2.36.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in mountall (Ubuntu Precise):
status: New → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package mountall - 2.46

---------------
mountall (2.46) unstable; urgency=low

  * Fix a further remaining issue from 2.41: by fixing the missing 'mounted'
    events in 2.43, we have again introduced a case where mountall may block
    waiting for 'mounted MOUNTPOINT=/' to return before handling 'mounted
    MOUNTPOINT=/run' and emitting virtual-filesystems, thus causing
    dependency loops and long boot timeouts. LP: #1078926.

 -- Steve Langasek <email address hidden> Mon, 03 Dec 2012 23:44:16 +0000

Changed in mountall (Ubuntu Raring):
status: In Progress → Fix Released
Scott Moser (smoser) wrote :

The output at [1] compared to the output at [2] pretty much verifies this in raring as "fix released". Yellow actually means "passed".

[1] https://jenkins.qa.ubuntu.com/view/Raring/view/Smoke%20Testing/job/raring-server-ec2-daily/8/
[2] https://jenkins.qa.ubuntu.com/view/Raring/view/Smoke%20Testing/job/raring-server-ec2-daily/7/

Steve Langasek (vorlon) wrote :

In practice since this bug was only present in precise-proposed, this is now fixed for precise. Further verification of the SRU will be handled on the other existing bug reports.

Changed in mountall (Ubuntu Precise):
status: Fix Committed → Fix Released
Steve Langasek (vorlon) wrote :

Marking verified based on comment #18.

tags: added: verification-done
removed: verification-needed
Scott Moser (smoser) wrote :

I've verified mountall in:
 quantal: 2.42ubuntu0.2
 precise: 2.36.2
with respect to problem that originally opened this bug.

At this point I've launched 20+ instances of each and have not seen similar failure failure. In raring, failure rates were ~50% in the same openstack cluster.

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package mountall - 2.42ubuntu0.2

---------------
mountall (2.42ubuntu0.2) quantal-proposed; urgency=low

  * Fix a further remaining issue from 2.41: by fixing the missing 'mounted'
    events in 2.42ubuntu0.1, we have again introduced a case where mountall
    may block waiting for 'mounted MOUNTPOINT=/' to return before handling
    'mounted MOUNTPOINT=/run' and emitting virtual-filesystems, thus causing
    dependency loops and long boot timeouts. LP: #1078926.
 -- Steve Langasek <email address hidden> Mon, 03 Dec 2012 23:44:16 +0000

Changed in mountall (Ubuntu Quantal):
status: Fix Committed → Fix Released
Margarita Manterola (marga-9) wrote :

Hi,

I'm encountering either this problem or a very similar one (mountall gets stuck, even if all the fs are already mounted). The plymouth bug makes it very hard to debug this. How did you end up debugging the problem?

Steve Langasek (vorlon) wrote :

There was a lot of visual inspection of the mountall --verbose output and upstart output, to examine the ordering / timing of the mounting and mounted events. I was heavily reliant on Scott's help to produce useful logs.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers