[SRU] CloudSigma DS for causes hangs when serial console present

Bug #1316475 reported by Robert Collins on 2014-05-06
24
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init
High
Unassigned
diskimage-builder
Critical
Adam Gandelman
tripleo
Critical
Adam Gandelman
cloud-init (Ubuntu)
Medium
Unassigned
Trusty
High
Unassigned

Bug Description

SRU Justification

Impact: The Cloud Sigma Datasource read and writes to /dev/ttyS1 if present; the Datasource does not have a time out. On non-CloudSigma Clouds or systems w/ /dev/ttyS1, Cloud-init will block pending a response, which may never come. Further, it is dangerous for a default datasource to write blindly on a serial console as other control plane software and Clouds use /dev/ttyS1 for communication.

Fix: The patch queries the BIOS to see if the instance is running on CloudSigma before querying /dev/ttys1.

Verification: On both a CloudSigma instance and non-CloudSigma instance with /dev/ttys1:
1. Install new cloud-init
2. Purge existing cloud-init data (rm -rf /var/lib/cloud)
3. Run "cloud-init --debug init"
4. Confirm that CloudSigma provisioned while CloudSigma datasource skipped non-CloudSigma instance

Regression: The risk is low, as this change further restrict where the CloudSigma Datasource can run.

[Original Report]
DHCPDISCOVER on eth2 to 255.255.255.255 port 67 interval 3 (xid=0x7e777c23)
DHCPREQUEST of 10.22.157.186 on eth2 to 255.255.255.255 port 67 (xid=0x7e777c23)
DHCPOFFER of 10.22.157.186 from 10.22.157.149
DHCPACK of 10.22.157.186 from 10.22.157.149
bound to 10.22.157.186 -- renewal in 39589 seconds.
 * Starting Mount network filesystems [ OK ]
 * Starting configure network device [ OK ]
 * Stopping Mount network filesystems [ OK ]
 * Stopping DHCP any connected, but unconfigured network interfaces [ OK ]
 * Starting configure network device [ OK ]
 * Stopping DHCP any connected, but unconfigured network interfaces [ OK ]
 * Starting configure network device [ OK ]

And it stops there.

I see this on about 10% of deploys.

Related branches

Changed in tripleo:
assignee: nobody → Gregory Haynes (greghaynes)
Robert Collins (lifeless) wrote :

+ HAS_LINK=0
+ '[' 0 == 1 ']'
+ sleep 1
+ TRIES=1
+ '[' 0 == 0 -a 1 -gt 0 ']'
++ get_if_link eth0
++ cat /sys/class/net/eth0/carrier
+ HAS_LINK=0
+ '[' 0 == 1 ']'
+ sleep 1
+ TRIES=0
+ '[' 0 == 0 -a 0 -gt 0 ']'
+ '[' 0 == 1 ']'
+ disable_interface eth0
+ local interface=eth0
+ serialize_me
+ '[' eni == eni ']'
+ FLOCKED=true
+ '[' -z true ']'
+ '[' eni == netscripts ']'
+ echo 'No link detected, skipping'
No link detected, skipping
 * Stopping DHCP any connected, but unconfigured network interfaces [ OK ]
 * Starting configure network device [ OK ]

Robert Collins (lifeless) wrote :

the job things hang on:
description "configure network device"

emits net-device-up
emits net-device-down
emits static-network-up

start on net-device-added
stop on net-device-removed INTERFACE=$INTERFACE

instance $INTERFACE
export INTERFACE

pre-start script
    if [ "$INTERFACE" = lo ]; then
        # bring this up even if /etc/network/interfaces is broken
        ifconfig lo 127.0.0.1 up || true
        initctl emit -n net-device-up \
            IFACE=lo LOGICAL=lo ADDRFAM=inet METHOD=loopback || true
    fi
    mkdir -p /run/network
    exec ifup --allow auto $INTERFACE
end script

post-stop exec ifdown --force --allow auto $INTERFACE

Adam Gandelman (gandelman-a) wrote :

FWIW, getting a shell on an instance stuck in this state shows the cloud-init is still running:

root 894 0.0 0.0 15260 636 ? S May22 0:00 upstart-socket-bridge --daemon
root 1051 0.0 0.0 86100 20936 ? Ss May22 0:00 /usr/bin/python /usr/bin/cloud-init init
root 1060 0.0 0.0 4444 648 ? S May22 0:00 /bin/sh -c tee -a /var/log/cloud-init-output.log
root 1061 0.0 0.0 4348 584 ? S May22 0:00 tee -a /var/log/cloud-init-output.log
root 1263 0.0 0.0 10224 2408 ? Ss May22 0:00 dhclient -1 -v -pf /run/dhclient.eth2.pid -lf /var/lib/dhcp/dhclient.eth2.leases eth2
ntp 1395 0.0 0.0 31444 2012 ? Ss May22 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 107:112
root 1417 0.0 0.0 25108 1056 ? S May22 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 107:112

At this point /etc/network/interfaces has the correct entry (eth2 in this case) and it has dhcp'd its address.

Adam Gandelman (gandelman-a) wrote :

gdb'ing the stuck cloud-init process shows it suck in a select() caused by http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/trunk/view/head:/cloudinit/cs_utils.py#L81. It looks like a new datasource (DataSourceCloudSigma) was added to cloud-init since saucy. It attempts to read/write from /dev/ttyS0, hangs and blocks boot. Killing the process gets boot going (albeit incomplete WRT cloud-init). As a workaround, updating the image and simply deleting usr/lib/python2.7/dist-packages/cloudinit/sources/DataSourceCloudSigma.py fixes the issue.

Gregory Haynes (greghaynes) wrote :

I was able to reproduce this in a VM reliably by simply adding a second serial device and booting a cloud image with no cloud-init datasources.

Epic find Adam!

Scott Moser (smoser) on 2014-05-23
Changed in cloud-init:
importance: Undecided → High
status: New → Confirmed
Ben Howard (utlemming) wrote :

The culprit here is that there is no timeout on the serial console read/write.

From cloudinit/cs_utils.py
 73 def __init__(self, request):
 74 self.request = request
 75 self.raw_result = self._execute()
 76 self.result = self._marshal(self.raw_result)
 77
 78 def _execute(self):
 79 connection = serial.Serial(SERIAL_PORT)
 80 connection.write(self.request)
 81 return connection.readline().strip('\x04\n')

Further, since we are blocking on the serial port, I have to question whether or not this should be a default enabled source. The other serial terminal DS is SmartOS, which is disabled by default. There are a lot of good reasons why people attach serial consoles, but assuming that it safe for cloud-init to read/write to a serial console seems like a great way to break infrastructure or control planes.

IMHO, I think that the fix should be twofold 1) disable this ds by default; 2) enforce a reasonable time out. I've attached a rough patch of what I am thinking here.

We should get CloudSigma to clarify what the timeout should be before we enforce the timeout.

That said, I think that an SRU that disables the DS is warranted.

tags: added: patch
Robert Collins (lifeless) wrote :

+1 on disabling cloudsigma by default.

Ben Howard (utlemming) on 2014-05-27
summary: - trusty hang on first boot post deploy
+ [SRU] CloudSigma DS for causes hangs when serial console present
Ben Howard (utlemming) wrote :
Ben Howard (utlemming) on 2014-05-27
description: updated
Scott Moser (smoser) wrote :

We're looking at this. The general rule in cloud-init should be "enabled by default if and only if there is no negative side effects". The one exception is the EC2 metadata service (it polls and has very annoying timeouts). However, its generally configured to be last, so all others have failed at that point.

We'll see if there is some way we can determine that we're running on CloudSigma and if so, then block on ttyS1. If not, go on quickly.

Changed in diskimage-builder:
status: New → In Progress
assignee: nobody → Adam Gandelman (gandelman-a)
Viktor Petersson (vpetersson) wrote :

@scott We're looking at this internally now and hope to have a fix that addresses that adds some unique variables shortly as suggested.

Steve Kowalik (stevenk) on 2014-05-30
Changed in diskimage-builder:
importance: Undecided → Critical
Scott Moser (smoser) on 2014-05-30
Changed in cloud-init (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
Changed in tripleo:
assignee: Gregory Haynes (greghaynes) → Adam Gandelman (gandelman-a)
Adam Gandelman (gandelman-a) wrote :

Proposed DIB fix here: https://review.openstack.org/95598

Scott Moser (smoser) on 2014-06-03
Changed in cloud-init:
status: Confirmed → Fix Committed
Changed in cloud-init (Ubuntu Trusty):
importance: Undecided → High
status: New → Triaged
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.6~bzr976-0ubuntu1

---------------
cloud-init (0.7.6~bzr976-0ubuntu1) utopic; urgency=medium

  * debian/cloud-init.templates: fix choices so dpkg-reconfigure works as
    expected (LP: #1325746)
  * New upstream snapshot.
    * tests: SmartOS test not depend on /dev/ttyS1 device node (LP: #1316597)
    * poll ttyS1 only after check for 'cloudsigma' in dmidecode (LP: #1316475)
    * cloudsigma: support vendor-data (LP: #1303986)
 -- Scott Moser <email address hidden> Tue, 03 Jun 2014 16:41:07 -0400

Changed in cloud-init (Ubuntu):
status: Triaged → Fix Released

Reviewed: https://review.openstack.org/95598
Committed: https://git.openstack.org/cgit/openstack/diskimage-builder/commit/?id=f645287ec45ef49eaee9a04f5d18e2a9c7d928db
Submitter: Jenkins
Branch: master

commit f645287ec45ef49eaee9a04f5d18e2a9c7d928db
Author: Adam Gandelman <email address hidden>
Date: Mon May 26 14:35:57 2014 -0700

    Add new cloud-init-datasources element

    This moves cloud-init data source configuration to a general purpose
    cloud-init-datasources element that can be used to explicitly configure
    the list of cloud-init sources that will be queried on first boot.

    cloud-init-nocloud now depends on this new element to configure the
    datasource_list while continuing to prep the image for a nocloud first boot.

    Change-Id: Ibcc3b86d6ca567a23f89b7a1a36bc713e444ef68
    Closes-bug: #1316475

Changed in diskimage-builder:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/97634
Committed: https://git.openstack.org/cgit/openstack/diskimage-builder/commit/?id=f61c1acf81dc73aaa3ed80ff734dbe0a6817b284
Submitter: Jenkins
Branch: master

commit f61c1acf81dc73aaa3ed80ff734dbe0a6817b284
Author: Adam Gandelman <email address hidden>
Date: Tue Jun 3 14:54:22 2014 -0700

    Only use Ec2 cloud-init data source for Ubuntu

    Default to only having cloud-init query Ec2 on first boot for Ubuntu,
    until cloud-init has been SRU'd to fix the CloudSigma data source issue
    that causes Trusty boots to hang.

    Change-Id: Icb3734d5ae78f4a0a6c0fae1af4a2ce3c809308c
    Partial-bug: #1316475

Changed in diskimage-builder:
status: Fix Committed → Fix Released
Changed in tripleo:
status: Triaged → Invalid
Ben Howard (utlemming) wrote :

Proposing backported CloudSigma DS from 14.10 as fixing this issue for SRU.

description: updated

Hello Robert, or anyone else affected,

Accepted cloud-init into trusty-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/cloud-init/0.7.5-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-init (Ubuntu Trusty):
status: Triaged → Fix Committed
tags: added: verification-needed
Adam Gandelman (gandelman-a) wrote :

Was able to test the proposed package and verify the issue is resolved. Test:

1)
- Boot a fresh trusty VM using libvirt
- Install cloud-init
- Reboot
SUCCESS

2)
- Shutdown
- Using libvirt, add a serial device / pty to the domain
- Boot same VM
FAIL (boot hangs)

- Shutdown
- Remove serial device
- Boot same VM, boot succeeds
- Install proposed cloud-init 0.7.5-0ubuntu1.1
- Repeat boot with and without serial port attached and boot succeeds in both case.

Thanks for the fix.

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.5-0ubuntu1.1

---------------
cloud-init (0.7.5-0ubuntu1.1) trusty-proposed; urgency=medium

  [ Ben Howard ]
  * debian/patches/lp1316475-1303986-cloudsigma.patch: Backport of
    CloudSigma Datasource from 14.10
    - [FFe] Support VendorData for CloudSigma (LP: #1303986).
    - Only query /dev/ttys1 when CloudSigma is detected (LP: #1316475).

  [ Scott Moser ]
  * debian/cloud-init.templates: fix choices so dpkg-reconfigure works as
    expected (LP: #1325746)
 -- Scott Moser <email address hidden> Fri, 20 Jun 2014 13:29:29 -0400

Changed in cloud-init (Ubuntu Trusty):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Ben Howard (utlemming) on 2014-08-25
Changed in cloud-init:
status: Fix Committed → Fix Released
Scott Moser (smoser) on 2014-08-25
Changed in cloud-init:
status: Fix Released → Fix Committed
Scott Moser (smoser) wrote :

fixed in 0.7.6

Changed in cloud-init:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers