Cloud images fail to boot when a serial port is not available

Bug #1573095 reported by zero on 2016-04-21
176
This bug affects 35 people
Affects Status Importance Assigned to Milestone
cloud-images
Undecided
Unassigned
cloud-init
Undecided
Unassigned
Ubuntu
Undecided
Unassigned
initramfs-tools (Ubuntu)
Medium
Guilherme G. Piccoli

Bug Description

I tried to launch a ubuntu 16.04 cloud image within KVM.
The image is not booting up and hangs at

"Btrfs loaded"

Hypervisor env is Proxmox 4.1

[racb: see comment 40 for minimal steps to reproduce using Ubuntu-provided tooling only]

Related bugs:
 * bug 1016695: add console=tty1 to cloud-image kernel boot parameters
 * bug 1123220: cloud-image VM causes kernel panic if image is resized
 * bug 1061977: Machine fails to commission when console=ttyS0 is present on kernel opts
 * bug 1573095: Cloud images fail to boot when a serial port is not available
 * bug 1122245: booting from a cloud image hangs until virsh console is used

zero (x-rbuntu-z) on 2016-04-21
affects: livecd-rootfs (Ubuntu) → ubuntu
summary: - Cloud image hangs at first boot
+ 16.04 cloud image hangs at first boot
zero (x-rbuntu-z) on 2016-04-24
tags: added: xenial

Can confirm this bug, attached is a screenshot. The VM will hang and have a CPU load of 100%, but the boot will never continue.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ubuntu:
status: New → Confirmed
Kenneth Østrup (kennetho) wrote :

I am also seeing this issue, with the same results as screenshot submitted by Nick Douma.

Dan Watkins (oddbloke) wrote :

Hi zero, Kenneth, Nick,

Thanks for reporting and confirming this bug! Could one of you include a list of instructions to reliably reproduce this, please? That will make it much easier for someone investigating the bug to be sure that they are hitting the same issue that you are. :)

Thanks,

Dan

affects: ubuntu → cloud-images
Changed in cloud-images:
status: Confirmed → Incomplete
zero (x-rbuntu-z) wrote :

Hello,

Here is the steps I followed to reproduce the bug on Proxmox 4.1 :

1. Download current cloud image (cow image version) at : https://uec-images.ubuntu.com/xenial/current/xenial-server-cloudimg-amd64-disk1.img
2. Put image inside image storage location on proxmox
3. Run qemu-img resize $image_path 20G (might be optional to reproduce the issue)
4. Launch VM with the following command :

pvesh create /nodes/$hostname/qemu -name $hostname -bootdisk virtio0 -vmid $vmid -memory 1024 -sockets 1 -cores 1 -net0 virtio,bridge=vmbr0 -virtio0=local:$vmid/$image_path

5. Start the created VM and display the console
6. Boot will hang at "Btrfs loaded"

Rodrigo Bahiense (rodbzro) wrote :

I'm also having this issue.

Tried with the .img and .vmdk distributions of "xenial-server-cloudimg-amd64-disk1".

Using VirtualBox 5.0.16r105871 on Windows 10 Pro x64 Build 10586

The boot freezes at the same point demonstrated in the #1 comment screenshot: https://bugs.launchpad.net/cloud-images/+bug/1573095/+attachment/4645921/+files/xenial-boot-freeze.png

John Petrini (john-d-petrini) wrote :

I'm experiencing this bug also. Running KVM on a 16.04 host. Hangs at Btrfs loaded.

John Petrini (john-d-petrini) wrote :

I should add that the cloud image does work in our OpenStack environment which is running KVM on 14.04 qemu-kvm version 1:2.5+dfsg-5ubuntu10. It does not work on 16.04 with qemu-kvm version 1:2.5+dfsg-5ubuntu10.

John Petrini (john-d-petrini) wrote :

Sorry copy paste mistake. OpenStack is running qemu-kvm version 2.0.0+dfsg-2ubuntu1.22.

zero (x-rbuntu-z) wrote :

Hello,

I tried again with the build 20160502 and have the same issue.

zero (x-rbuntu-z) wrote :

Hello,

Does anyone have an idea of what might be the root cause of this issue ?

I'm happy to help but don't really know where to look/investigate

Changed in cloud-images:
status: Incomplete → New
milestone: none → y-2016-06-02
Scott Moser (smoser) wrote :

I suspect the issue is related to cloud-init writing networking configuration data.
Could you please shut down the system and then mount it (mount-image-callback will mount easily enough) and copy out /var/log/cloud-init.log ?

The other possibility is related to bug 1577844 .

In both cases tehre should be timeouts eventually (maybe the 5 minute mark) that continue with boot, but likely without networking.

I can confirm there is no timeout, it hangs forever (at least, I left it overnight).

Shutdown doesn't work either, I needed a hard stop. After mounting the image, there's no cloud-init.log.

Here is a workaround (or better said two) I am using (after converting it to raw) to get it work in Proxmox:

sudo kpartx -a xenial-server-cloudimg-amd64-disk1.raw
sudo mkdir -p /tmp/foo && sudo mount /dev/mapper/loop0p1 /tmp/foo

replace console=ttyS0 from

/tmp/foo/boot/grub/grub.cfg
/tmp/foo/etc/default/grub

with net.ifnames=0

sudo umount /tmp/foo
sudo kpartx -d xenial-server-cloudimg-amd64-disk1.raw

Dan Watkins (oddbloke) wrote :

Julian, Fryderyk, or someone else who's affected,

If you aren't seeing a cloud-init.log on affected instances, could you instead tar up all of /var/log and put it somewhere we can examine?

Thanks,

Dan

I am experiencing same issue when booting on bare metal server with Ironic.

I am now seeing the same with the 12.04 images currently up
 20160607/ 07-Jun-2016 06:49 -
 20160610.1/ 11-Jun-2016 05:13 -
 20160610/ 10-Jun-2016 12:13 -

On Tuesday, 31 May 2016 16:18:49 AEST you wrote:
> Julian, Fryderyk, or someone else who's affected,
>
> If you aren't seeing a cloud-init.log on affected instances, could you
> instead tar up all of /var/log and put it somewhere we can examine?
>

The problem is that the disk image doesn't get flushed at any point, so
there's nothing in the logs at all - it's the original qcow. And because I
have to hard kill the VM, it will never flush.

root@proxmox15:/var/lib/vz/images/204# qemu-nbd --connect=/dev/nbd0 vm-
disk-0.qcow2
root@proxmox15:/var/lib/vz/images/204# mount /dev/nbd0p1 /mnt/tmp
root@proxmox15:/var/lib/vz/images/204# ls /mnt/tmp/var/log
apt btmp dist-upgrade fsck landscape lastlog unattended-upgrades
upstart wtmp

15.10 images seem to work, however.

I can confirm the workaround above, removing console=ttyS0 from the kernel parameters stops it from hanging.

Is a permanent resolution imminent on this? The faulty cloud image renders it useless on various platforms.

Dan Watkins (oddbloke) wrote :

Hi Julian,

It's still not 100% clear to me what is actually causing the problem, and what workaround fixed it. Can you describe precisely what workaround you used to get a booting image?

Thanks,

Dan

Hi - I just loop mounted the image and removed the console=ttyS0 from the kernel args in the grub config, and it boots fine.

Dan Watkins (oddbloke) wrote :

Hi Julian,

Does enabling serial consoles in proxmox[0] fix the issue for you?

Dan

[0] https://pve.proxmox.com/wiki/Serial_Terminal

Raju (rajustha2000) on 2016-07-12
Changed in cloud-init:
status: New → Fix Released
Mark - Syminet (mark-syminet) wrote :

Most recent image as of today also hard-locked, ttyS0 fix described above worked.

MaxZhang (maxzhangx) wrote :

Hi,

I think the problem may be that the ttyS0's parameter is not complete, the speed is not set,
change it from:
console=ttyS0
to:
console=ttyS0,115200n8

would fix it.

KingJ (kj-kingj) wrote :

I can confirm that I am affected by this, running on ESXi 6.5.

I took a slightly different approach to fixing it - adding a virtual serial port to the VM's hardware allowed it to boot successfully.

Sebastian (sebek-h) on 2017-01-24
no longer affects: tuxlab
Sebastian (sebek-h) wrote :

this problem affected my envirnoment with MAAS and img 16.04/16.10/17.04
On some servers we use console with ttyS0 on other ttyS1
Remove console=ttyS1,115200n8 from Global Kernel Parameters in MAAS resolve problem (partly)
Problem not occurs on 14.04

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ubuntu:
status: New → Confirmed
Mathieu Mitchell (mat128) wrote :

Any update on when the updated kernel parameters can make it to official cloud images?

Also worth noting, OpenStack image docs [1] also indicate ttyS0 at 115200n8.

1: https://docs.openstack.org/image-guide/openstack-images.html#ensure-image-writes-boot-log-to-console

Evan Felix (karcaw) wrote :

I am seeing this issue when booting 16.04 images under ovirt. if i add a serial console to the VM it boots fine.

Evan Felix (karcaw) wrote :

I can also confirm that this issue happens in the cloud images for xenial, zesty, artful, and current bionic

Dan Watkins (oddbloke) on 2017-12-21
Changed in cloud-images:
milestone: y-2016-06-02 → none
Andrew Paxson (paxsonsa) wrote :

I am not sure if this is relevant to your inquiry but I also found having to add a isa-serial (in virt-manager thats Serial PPTY) to the machine, it then when past that section.

Keenan Verbrugge (keenanv) wrote :

Same issue here. Using ubuntu 16.04

Adding a console for qemu/kvm was able to get me past this:

virsh edit vmname

add:

<console type='pty'>
  <target port='0'/>
</console>

ironstorm (ironstorm-gmail) wrote :

The same problem exists on VirtualBox using the Apr 2 nightly of bionic cloud image... :(

Workaround on Virtualbox is to add a disconnected serial port to allow booting to continue using the following:

VBoxManage modifyvm "${VM}" --uart1 0x3f8 4 --uartmode1 disconnected

This problem also reported at https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/356

If someone who has seen this has done a workaround specifically for Openstack I'd appreciate it.

Jose Phillips (jose-phillips) wrote :

Hi Everyone

Just add a serial port and will fix the issue
Cloud Images try to log the boot to serial port 1

Robie Basak (racb) wrote :

Here are full steps to reproduce this issue using tooling from Ubuntu only:

uvt-simplestreams-libvirt sync release=bionic arch=amd64 label=release
uvt-kvm create --no-start lp1573095 release=bionic arch=amd64 label=release
virsh edit lp1573095 # delete <serial/> and <console/> blocks
virsh start lp1573095
uvt-kvm wait lp1573095

Expected behaviour: succeeds when the VM is available
Actual behaviour: hangs and eventually times out

Additionally you can examine the screen with virt-manager. On that screen, I
expect a login prompt. Instead I see nothing beyond the normal kernel messages
(nothing from userspace).

If you skip the serial/console definition deletion in the steps above, you'll
see that the VM works. In other words, the VM stops working if a serial port is
not available.

Workaround: remove console=ttyS0 from GRUB_CMD_LINUX_DEFAULT in
/etc/default/grub.d/50-cloudimg-settings.cfg, leaving only console=tty1, and
then run "sudo update-grub". However this must either be done on a system with
aserial port, or you have to jump through the appropriate hoops to be able to
get the result of "update-grub" happen without having booted the system. Note
that editing /etc/default/grub is insufficient since
/etc/default/grub.d/50-cloudimg-settings.cfg overrides it (see bug 1812752).

summary: - 16.04 cloud image hangs at first boot
+ Cloud images fail to boot when a serial port is not available
Changed in ubuntu:
status: Confirmed → Invalid
Changed in cloud-images:
status: New → Confirmed
Changed in cloud-init:
status: Fix Released → Invalid
description: updated
description: updated
Jeremy Busk (busk) wrote :

While you can workaround the issue with

```
sudo sed -i 's/ console=ttyS0//g' /etc/default/grub.d/50-cloudimg-settings.cfg
sudo update-grub
```

You need ttyS0 in grub in order to interact with vm guest using

```
virsh console <vm-name>
```

I added a bug to virtualbox as it could be a compound issue or an issue on how they handle ttyS0 from os. https://www.virtualbox.org/ticket/18463

David (davidjaquier) wrote :

Have the same trouble when I try to deploy cloud images based templates in a cloudstack managed environment on top of esxi 6.5 (GTT VDC).

Is there a way to remove that without deploying a virtual machine? I tried to tar -x the ova, modify the vmdk via guestmount on ubuntu 18 or via fuse for osx, without success.

If someone can tell me an efficient and short way to remove this setting from the .ova, it could be really great.

Scott Moser (smoser) wrote :

I just added a bunch of other bugs that really are dups of this.
The goal of doing so is just to inform whoever might be looking at making a change to more context on the unfortunate complexity of doing so.

Related bugs:
 * bug 1016695: add console=tty1 to cloud-image kernel boot parameters
 * bug 1123220: cloud-image VM causes kernel panic if image is resized
 * bug 1061977: Machine fails to commission when console=ttyS0 is present on kernel opts
 * bug 1573095: Cloud images fail to boot when a serial port is not available
 * bug 1122245: booting from a cloud image hangs until virsh console is used

description: updated
Grant Emsley (grantemsley) wrote :

I ran into this bug trying to use cloud images on Hyper-V.

The workaround in #40 does work - add a serial console to the VM, and change /etc/default/grub.d/50-cloudimg-settings.cfg

If you still want to be able to use a serial console if available, but not require it to be able to boot, just change the line from 'GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0"' to 'GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0 console=tty1"'

Then run update-grub.

It seems /dev/console takes on whichever console is listed LAST in the kernel options. If that's ttyS0 and there is no serial port connected, that breaks things. Swapping the order ensures /dev/console goes to tty1, and the boot process works with or without a serial port attached to the VM. If there is a serial port, the serial console will still work with this method.

Alejandro Torras (atec-post) wrote :

Related bug:
* bug 1829625: Vagrant box startup timeout due to no serial port

WGH (wgh) wrote :

I debugged this problem a bit. The problem stems from initramfs attempting to use /dev/console (which refers to nonexisting /dev/ttyS0), having its logging functions unexpectedly return errors, and broking everything around.

You may have already noticed that when this happens, 100% CPU time is consumed. If you enable sysrq keys with sysrq_always_enabled=1, and dump the task list (e.g. virsh send-key ubuntu18.04 KEY_LEFTALT KEY_SYSRQ KEY_T), you'll notice that there's always a combination of console_setup/loadkeys/setfont processes with evergrowing PIDs, which likely means that something is running them in tight loop.

Now, if you patch "panic()" in /usr/share/initramfs-tools/scripts/functions so it would print its argument to the console (echo "panic 1: " "$@" >/dev/kmsg), you'll see that the panic reason is that "filesystem on /dev/vda1 requires manual fsck", and it's printed in a loop. Indeed, the function does contain a loop:

checkfs()
{
        while ! _checkfs_once "$@"; do
                panic "The $2 filesystem on $1 requires a manual fsck"
        done
}

This is actually a bogus error. The filesystem is (most likely) fine. There's no fsck included in initramfs, so what happens is that the following fragment is executed:

        if ! command -v fsck >/dev/null 2>&1; then
                log_warning_msg "fsck not present, so skipping $NAME file system"
                return
        fi

log_warning_msg, however, returns non-zero status due to stdout being broken, which causes _checkfs_once return non-zero status as well.

panic doesn't work correctly either: it simply can't spawn a shell on broken /dev/console, and exits immediately, and that's what causes the infinite loop.

What I think about the solution.

First, debugging this is PITA. Adding a serial device might be a perfectly acceptable fix for many, but when this issue happens, absolutely nothing in the console points to the direction that this's what's missing. Even if it's necessary to leave ttyS0 as the main console, initramfs should at least warn the user (through kmsg) that /dev/console is broken.

Second, errors returned by logging function causing _checkfs_once return error as well is a bug. I think errors in _log_msg should be suppressed. If you do that, unless panic happens (which is rare), the boot will succeed.

Third, as Grant Emsley said, maybe ttyS0 doesn't really have to be the main console?

Scott Moser (smoser) wrote :

The real fix here is kernel improvement (or bug fix if you want to consider current kernel behavior a bug). Anything else is just going to push around the failure.

That is what was determined in 2013, and its probably still true how.
  https://bugs.launchpad.net/ubuntu/+source/cloud-initramfs-tools/+bug/1123220/comments/8

WGH (wgh) wrote :

Although I completely agree that's the kernel could've automatically chosen the working /dev/console "backend", and that would be the best fix, this won't be fixed soon. Right now users without serial port have unexplicable hang that is pretty hard to debug.

Having initramfs init script report broken /dev/console would help this situation tremendously, and the fix is very easy: just add

print "$@" || echo "/dev/console appears broken"

to _log_msg, and users will at least know the source of the problem.

WGH (wgh) wrote :

I of course meant

print "$@" || echo "/dev/console appears broken" >/dev/kmsg

Scott Moser (smoser) wrote :

@wgh,
My experience is that it is unfortunately not that simple.
It may have worked for you.

At the point in which it starts to fail, it repeatedly will fail.
But up until some point, writes to stdout work fine. I believe this is because there is a buffer and it only begins failing when it has filled the buffer and tried to flush.

I have a script that I had put into the initramfs in one of the other bugs that shows this. Its quite possible that the behavior has changed in 8 years, but before you basically just had to write some amount of data to determine if it would fail.

First, I'd like to thank Scott (for comment #43) and Alejandro (comment #45) - it seems
there's a bunch of LP bugs orbiting around the same issue: Ubuntu isn't bootable
if we set an invalid serial console on kernel command-line (and have no "quiet"
option there), it seems.

Specially, I'd like to thank WGH for the great debug work on comment #46, it saved
me a lot of time debugging, and you're right, it's a pain to debug issues related
to console, not easy to output stuff. I used the trick to echo debug messages
to the right console, helped me to narrow down some stuff. But in the end, you're
debug exposed what seems to be the major problem here: due to a return value
carriage among functions (starting with printf returning 1 due to bad console),
we end up looping in checkfs(), preventing the boot.

I respectfully disagree with Scott: although I consider there are potential improvements
on kernel in the console "front", we have here a userspace bug in init scripts,
due to an error in printf if console is not correctly set. I don't agree we should
let it alone and pursue a kernel-only solution, specially due to the easy nature
of reproducing the issue, and hard nature of debugging it. Also, it seems there
are long-term bugs reporting similar issues, it bothers a bunch of people.

I cannot be 100% sure we don't have more issues than the checkfs() one found by
WGH, but this one is definitely an issue and an easy one to fix; I proposed a pretty
simple fix in the below test-only PPA:
https://launchpad.net/~gpiccoli/+archive/ubuntu/lp1573095

The more users can test that, more confidence we'll have that there are no more
initramfs bugs if console is wrongly set. I agree with the idea of showing some
output message on kmsg if serial console is broken, it's helpful. We can do that
as part of an improvement, maybe in the same "commit" as the fix.

More opinions are welcome on this matter, of course. If my solution either doesn't
resolve the issue for users or is not the optimal one, let's discuss alternatives
to fix this initramfs long-term flaw.

Thanks,

Guilherme

Sorry for the bad formatting of last comment, I should had the line breaks fixed before submitting.
I'd like to point another duplicate one which was reported by a colleague: LP #1879987.
I'll close that one to keep the effort in this single LP.

Cheers,

Guilherme

tags: added: id-5b49154499e416396a3e983c

We had reports of good results from an user using my PPA. Anybody else was able to test it?
Cheers,

Guilherme

Scott Moser (smoser) wrote :

@Guilherme,

Simply returning non-error (0) in one function in the initramfs isn't going to solve the problem. Anything that is checking the return value of a write() to its stdout will fail.
That could be a shell 'echo', it could be a C write().

In order to take that path completion, you'd have to have all programs ignore errors when writing to stdout, which might happen to be /dev/console.

Here's an example of something else (growpart) caring: https://bugs.launchpad.net/ubuntu/+source/cloud-initramfs-tools/+bug/1123220

Hi Scott, thanks for you comment. While I agree with you that simply returning 0 in one function won't solve *all* problems, it'll solve this one, in a cheap and fast way.

I tend to think initramfs-tool is a quite important package, it's part of the boot process. And yet, we have plenty of 5yr+ bugs complaining about this, while we couldn't find a perfect/generic solution.

So I proposed we fix this one, for the sake of the giant user base in Ubuntu and Debian, and at small steps pursue a generic solution for the write() problem, that may involve a discussion with kernel and a change in long-term behavior. I'd rather not let users waiting while we do that...

Oh, and I read the other LP you mentioned, it seems a different place of failure, in a different project. I say we go and fix there too, while we work a more generic/elegant solution. But that bug (cloud-init growpart related) it's not so common than this one (here we just need to remove 'quiet' and set the wrong console to break boot completely), so that one is a bit less priority than this one. Initramfs-tools is full of quirks to prevent issues, given its relevant role in the boot process.

Thanks,

Guilherme

Eric Desrochers (slashd) wrote :

I agree that if we can solve one problem, with the certainty to not introduce more harm/regression, let''s do it. We should do it instead of waiting for a fix/refactoring/.... that we all know won't happen in short future.

My 2 cents are that if we can convince Debian upstream, let's do it.
Debian maintainer will be the ultimate approver/merger.

- Eric

Erlon R. Cruz (sombrafam) wrote :

I have tested the PPA and works great for me[1]. Given the simplicity of this fix and the way we would need to go to provide a generic fix this looks a good trade off for me.

[1] https://pastebin.ubuntu.com/p/z3Jsfnf5fK/

tags: added: sts

Good news! Debian maintainer merged my fix today: https://salsa.debian.org/kernel-team/initramfs-tools/-/commit/c3cbf355 (after only 10 weeks heh).

The Ubuntu SRU process is ongoing on LP #1879987.
Cheers,

Guilherme

The specific issue (the bad return from printf) was worked in LP #1879987 - it was fixed recently on initramfs-tools versions:
0.122ubuntu8.17 (Ubuntu 16.04 - Xenial)
0.130ubuntu3.11 (Ubuntu 18.04 - Bionic)
0.136ubuntu6.3 (Ubuntu 20.04 - Focal)
0.137ubuntu12 (Ubuntu 20.10 - Groovy)

I'll nominate this LP for initramfs-tools and set as Fix Released. If that's still reproducible in any way, likely there's another correlated issue, please open another LP and mention it here.
Cheers,

Guilherme

Changed in cloud-images:
status: Confirmed → Invalid
Changed in initramfs-tools (Ubuntu):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
importance: Undecided → Medium
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.