Bug #1573095 “Cloud images fail to boot when a serial port is no...” : Bugs : cloud-images

zero (x-rbuntu-z) on 2016-04-21

affects:	livecd-rootfs (Ubuntu) → ubuntu
summary:	- Cloud image hangs at first boot + 16.04 cloud image hangs at first boot

zero (x-rbuntu-z) on 2016-04-24

tags:

added: xenial

Revision history for this message

Nick Douma (lordgaav) wrote on 2016-04-24: Re: 16.04 cloud image hangs at first boot

#1

xenial-boot-freeze.png Edit (15.6 KiB, image/png)

Can confirm this bug, attached is a screenshot. The VM will hang and have a CPU load of 100%, but the boot will never continue.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-04-24:

#2

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ubuntu:
status:	New → Confirmed

Revision history for this message

Kenneth Østrup (kennetho) wrote on 2016-04-25:

#3

I am also seeing this issue, with the same results as screenshot submitted by Nick Douma.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2016-04-25:

#4

Hi zero, Kenneth, Nick,

Thanks for reporting and confirming this bug! Could one of you include a list of instructions to reliably reproduce this, please? That will make it much easier for someone investigating the bug to be sure that they are hitting the same issue that you are. :)

Thanks,

Dan

affects:	ubuntu → cloud-images
Changed in cloud-images:
status:	Confirmed → Incomplete

Revision history for this message

zero (x-rbuntu-z) wrote on 2016-04-27:

#5

Hello,

Here is the steps I followed to reproduce the bug on Proxmox 4.1 :

1. Download current cloud image (cow image version) at : https://uec-images.ubuntu.com/xenial/current/xenial-server-cloudimg-amd64-disk1.img
2. Put image inside image storage location on proxmox
3. Run qemu-img resize $image_path 20G (might be optional to reproduce the issue)
4. Launch VM with the following command :

pvesh create /nodes/$hostname/qemu -name $hostname -bootdisk virtio0 -vmid $vmid -memory 1024 -sockets 1 -cores 1 -net0 virtio,bridge=vmbr0 -virtio0=local:$vmid/$image_path

5. Start the created VM and display the console
6. Boot will hang at "Btrfs loaded"

Revision history for this message

Rodrigo Bahiense (rodbzro) wrote on 2016-04-29:

#6

I'm also having this issue.

Tried with the .img and .vmdk distributions of "xenial-server-cloudimg-amd64-disk1".

Using VirtualBox 5.0.16r105871 on Windows 10 Pro x64 Build 10586

The boot freezes at the same point demonstrated in the #1 comment screenshot: https://bugs.launchpad.net/cloud-images/+bug/1573095/+attachment/4645921/+files/xenial-boot-freeze.png

Revision history for this message

John Petrini (john-d-petrini) wrote on 2016-04-30:

#7

I'm experiencing this bug also. Running KVM on a 16.04 host. Hangs at Btrfs loaded.

Revision history for this message

John Petrini (john-d-petrini) wrote on 2016-04-30:

#8

I should add that the cloud image does work in our OpenStack environment which is running KVM on 14.04 qemu-kvm version 1:2.5+dfsg-5ubuntu10. It does not work on 16.04 with qemu-kvm version 1:2.5+dfsg-5ubuntu10.

Revision history for this message

John Petrini (john-d-petrini) wrote on 2016-04-30:

#9

Sorry copy paste mistake. OpenStack is running qemu-kvm version 2.0.0+dfsg-2ubuntu1.22.

Revision history for this message

zero (x-rbuntu-z) wrote on 2016-05-03:

#10

Hello,

I tried again with the build 20160502 and have the same issue.

Revision history for this message

zero (x-rbuntu-z) wrote on 2016-05-11:

#11

Hello,

Does anyone have an idea of what might be the root cause of this issue ?

I'm happy to help but don't really know where to look/investigate

Patricia Gaughen (gaughen) on 2016-05-20

Changed in cloud-images:
status:	Incomplete → New
milestone:	none → y-2016-06-02

Revision history for this message

Scott Moser (smoser) wrote on 2016-05-23:

#12

I suspect the issue is related to cloud-init writing networking configuration data.
Could you please shut down the system and then mount it (mount-image-callback will mount easily enough) and copy out /var/log/cloud-init.log ?

The other possibility is related to bug 1577844 .

In both cases tehre should be timeouts eventually (maybe the 5 minute mark) that continue with boot, but likely without networking.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2016-05-24:

#13

I can confirm there is no timeout, it hangs forever (at least, I left it overnight).

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2016-05-24:

#14

Shutdown doesn't work either, I needed a hard stop. After mounting the image, there's no cloud-init.log.

Revision history for this message

Fryderyk Dziarmagowski (freddix) wrote on 2016-05-31:

#15

Here is a workaround (or better said two) I am using (after converting it to raw) to get it work in Proxmox:

sudo kpartx -a xenial-server-cloudimg-amd64-disk1.raw
sudo mkdir -p /tmp/foo && sudo mount /dev/mapper/loop0p1 /tmp/foo

replace console=ttyS0 from

/tmp/foo/boot/grub/grub.cfg
/tmp/foo/etc/default/grub

with net.ifnames=0

sudo umount /tmp/foo
sudo kpartx -d xenial-server-cloudimg-amd64-disk1.raw

Revision history for this message

Dan Watkins (oddbloke) wrote on 2016-05-31:

#16

Julian, Fryderyk, or someone else who's affected,

If you aren't seeing a cloud-init.log on affected instances, could you instead tar up all of /var/log and put it somewhere we can examine?

Thanks,

Dan

Revision history for this message

Łukasz Leszczuk (lukasz-leszczuk) wrote on 2016-06-02:

#17

I am experiencing same issue when booting on bare metal server with Ironic.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2016-06-16:

#18

I am now seeing the same with the 12.04 images currently up
20160607/ 07-Jun-2016 06:49 -
20160610.1/ 11-Jun-2016 05:13 -
20160610/ 10-Jun-2016 12:13 -

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2016-06-16: Re: [Bug 1573095] Re: 16.04 cloud image hangs at first boot

#20

On Tuesday, 31 May 2016 16:18:49 AEST you wrote:
> Julian, Fryderyk, or someone else who's affected,
>
> If you aren't seeing a cloud-init.log on affected instances, could you
> instead tar up all of /var/log and put it somewhere we can examine?
>

The problem is that the disk image doesn't get flushed at any point, so
there's nothing in the logs at all - it's the original qcow. And because I
have to hard kill the VM, it will never flush.

root@proxmox15:/var/lib/vz/images/204# qemu-nbd --connect=/dev/nbd0 vm-
disk-0.qcow2
root@proxmox15:/var/lib/vz/images/204# mount /dev/nbd0p1 /mnt/tmp
root@proxmox15:/var/lib/vz/images/204# ls /mnt/tmp/var/log
apt btmp dist-upgrade fsck landscape lastlog unattended-upgrades
upstart wtmp

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2016-06-16:

#19

15.10 images seem to work, however.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2016-06-16: Re: 16.04 cloud image hangs at first boot

#21

I can confirm the workaround above, removing console=ttyS0 from the kernel parameters stops it from hanging.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2016-07-04:

#22

Is a permanent resolution imminent on this? The faulty cloud image renders it useless on various platforms.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2016-07-04:

#23

Hi Julian,

It's still not 100% clear to me what is actually causing the problem, and what workaround fixed it. Can you describe precisely what workaround you used to get a booting image?

Thanks,

Dan

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2016-07-05:

#24

Hi - I just loop mounted the image and removed the console=ttyS0 from the kernel args in the grub config, and it boots fine.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2016-07-05:

#25

Hi Julian,

Does enabling serial consoles in proxmox[0] fix the issue for you?

Dan

[0] https://pve.proxmox.com/wiki/Serial_Terminal

Raju (rajustha2000) on 2016-07-12

Changed in cloud-init:
status:	New → Fix Released

Revision history for this message

Vladimir Rutsky (rutsky) wrote on 2016-07-27:

#26

This bug looks similar to https://bugs.launchpad.net/ubuntu/+source/livecd-rootfs/+bug/1546108

Revision history for this message

Mark - Syminet (mark-syminet) wrote on 2016-09-09:

#27

Most recent image as of today also hard-locked, ttyS0 fix described above worked.

Revision history for this message

MaxZhang (maxzhangx) wrote on 2016-11-30:

#28

Hi,

I think the problem may be that the ttyS0's parameter is not complete, the speed is not set,
change it from:
console=ttyS0
to:
console=ttyS0,115200n8

would fix it.

Revision history for this message

KingJ (kj-kingj) wrote on 2016-12-27:

#29

I can confirm that I am affected by this, running on ESXi 6.5.

I took a slightly different approach to fixing it - adding a virtual serial port to the VM's hardware allowed it to boot successfully.

Sebastian (sebek-h) on 2017-01-24

no longer affects:

tuxlab

Revision history for this message

Sebastian (sebek-h) wrote on 2017-01-24:

#30

this problem affected my envirnoment with MAAS and img 16.04/16.10/17.04
On some servers we use console with ttyS0 on other ttyS1
Remove console=ttyS1,115200n8 from Global Kernel Parameters in MAAS resolve problem (partly)
Problem not occurs on 14.04

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-05-30:

#31

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ubuntu:
status:	New → Confirmed

Revision history for this message

Mathieu Mitchell (mat128) wrote on 2017-08-21:

#32

Any update on when the updated kernel parameters can make it to official cloud images?

Also worth noting, OpenStack image docs [1] also indicate ttyS0 at 115200n8.

1: https://docs.openstack.org/image-guide/openstack-images.html#ensure-image-writes-boot-log-to-console

Revision history for this message

Evan Felix (karcaw) wrote on 2017-12-21:

#33

I am seeing this issue when booting 16.04 images under ovirt. if i add a serial console to the VM it boots fine.

Revision history for this message

Evan Felix (karcaw) wrote on 2017-12-21:

#34

I can also confirm that this issue happens in the cloud images for xenial, zesty, artful, and current bionic

Dan Watkins (oddbloke) on 2017-12-21

Changed in cloud-images:
milestone:	y-2016-06-02 → none

Revision history for this message

Andrew Paxson (paxsonsa) wrote on 2018-01-10:

#35

I am not sure if this is relevant to your inquiry but I also found having to add a isa-serial (in virt-manager thats Serial PPTY) to the machine, it then when past that section.

Revision history for this message

Keenan Verbrugge (keenanv) wrote on 2018-01-23:

#36

Same issue here. Using ubuntu 16.04

Adding a console for qemu/kvm was able to get me past this:

virsh edit vmname

add:

Revision history for this message

ironstorm (ironstorm-gmail) wrote on 2018-04-04:

#37

The same problem exists on VirtualBox using the Apr 2 nightly of bionic cloud image... :(

Workaround on Virtualbox is to add a disconnected serial port to allow booting to continue using the following:

VBoxManage modifyvm "${VM}" --uart1 0x3f8 4 --uartmode1 disconnected

Revision history for this message

Edward Vielmetti (edward-vielmetti) wrote on 2018-06-07:

#38

This problem also reported at https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/356

If someone who has seen this has done a workaround specifically for Openstack I'd appreciate it.

Revision history for this message

Jose Phillips (jose-phillips) wrote on 2018-11-23:

#39

Hi Everyone

Just add a serial port and will fix the issue
Cloud Images try to log the boot to serial port 1

Revision history for this message

Robie Basak (racb) wrote on 2019-01-21:

#40

Here are full steps to reproduce this issue using tooling from Ubuntu only:

uvt-simplestreams-libvirt sync release=bionic arch=amd64 label=release
uvt-kvm create --no-start lp1573095 release=bionic arch=amd64 label=release
virsh edit lp1573095 # delete <serial/> and <console/> blocks
virsh start lp1573095
uvt-kvm wait lp1573095

Expected behaviour: succeeds when the VM is available
Actual behaviour: hangs and eventually times out

Additionally you can examine the screen with virt-manager. On that screen, I
expect a login prompt. Instead I see nothing beyond the normal kernel messages
(nothing from userspace).

If you skip the serial/console definition deletion in the steps above, you'll
see that the VM works. In other words, the VM stops working if a serial port is
not available.

Workaround: remove console=ttyS0 from GRUB_CMD_LINUX_DEFAULT in
/etc/default/grub.d/50-cloudimg-settings.cfg, leaving only console=tty1, and
then run "sudo update-grub". However this must either be done on a system with
aserial port, or you have to jump through the appropriate hoops to be able to
get the result of "update-grub" happen without having booted the system. Note
that editing /etc/default/grub is insufficient since
/etc/default/grub.d/50-cloudimg-settings.cfg overrides it (see bug 1812752).

summary:	- 16.04 cloud image hangs at first boot + Cloud images fail to boot when a serial port is not available
Changed in ubuntu:
status:	Confirmed → Invalid
Changed in cloud-images:
status:	New → Confirmed
Changed in cloud-init:
status:	Fix Released → Invalid
description:	updated
description:	updated

Revision history for this message

Jeremy Busk (busk) wrote on 2019-03-01:

#41

While you can workaround the issue with

```
sudo sed -i 's/ console=ttyS0//g' /etc/default/grub.d/50-cloudimg-settings.cfg
sudo update-grub
```

You need ttyS0 in grub in order to interact with vm guest using

```
virsh console <vm-name>
```

I added a bug to virtualbox as it could be a compound issue or an issue on how they handle ttyS0 from os. https://www.virtualbox.org/ticket/18463

Revision history for this message

David (davidjaquier) wrote on 2019-03-18:

#42

Have the same trouble when I try to deploy cloud images based templates in a cloudstack managed environment on top of esxi 6.5 (GTT VDC).

Is there a way to remove that without deploying a virtual machine? I tried to tar -x the ova, modify the vmdk via guestmount on ubuntu 18 or via fuse for osx, without success.

If someone can tell me an efficient and short way to remove this setting from the .ova, it could be really great.

Revision history for this message

Scott Moser (smoser) wrote on 2019-03-26:

#43

I just added a bunch of other bugs that really are dups of this.
The goal of doing so is just to inform whoever might be looking at making a change to more context on the unfortunate complexity of doing so.

Related bugs:
* bug 1016695: add console=tty1 to cloud-image kernel boot parameters
* bug 1123220: cloud-image VM causes kernel panic if image is resized
* bug 1061977: Machine fails to commission when console=ttyS0 is present on kernel opts
* bug 1573095: Cloud images fail to boot when a serial port is not available
* bug 1122245: booting from a cloud image hangs until virsh console is used

description:

updated

Revision history for this message

Grant Emsley (grantemsley) wrote on 2019-05-02:

#44

I ran into this bug trying to use cloud images on Hyper-V.

The workaround in #40 does work - add a serial console to the VM, and change /etc/default/grub.d/50-cloudimg-settings.cfg

If you still want to be able to use a serial console if available, but not require it to be able to boot, just change the line from 'GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0"' to 'GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0 console=tty1"'

Then run update-grub.

It seems /dev/console takes on whichever console is listed LAST in the kernel options. If that's ttyS0 and there is no serial port connected, that breaks things. Swapping the order ensures /dev/console goes to tty1, and the boot process works with or without a serial port attached to the VM. If there is a serial port, the serial console will still work with this method.

Revision history for this message

Alejandro Torras (atec-post) wrote on 2020-04-25:

#45

Related bug:
* bug 1829625: Vagrant box startup timeout due to no serial port

Revision history for this message

WGH (wgh) wrote on 2020-05-07:

#46

I debugged this problem a bit. The problem stems from initramfs attempting to use /dev/console (which refers to nonexisting /dev/ttyS0), having its logging functions unexpectedly return errors, and broking everything around.

You may have already noticed that when this happens, 100% CPU time is consumed. If you enable sysrq keys with sysrq_always_enabled=1, and dump the task list (e.g. virsh send-key ubuntu18.04 KEY_LEFTALT KEY_SYSRQ KEY_T), you'll notice that there's always a combination of console_setup/loadkeys/setfont processes with evergrowing PIDs, which likely means that something is running them in tight loop.

Now, if you patch "panic()" in /usr/share/initramfs-tools/scripts/functions so it would print its argument to the console (echo "panic 1: " "$@" >/dev/kmsg), you'll see that the panic reason is that "filesystem on /dev/vda1 requires manual fsck", and it's printed in a loop. Indeed, the function does contain a loop:

checkfs()
{
        while ! _checkfs_once "$@"; do
                panic "The $2 filesystem on $1 requires a manual fsck"
        done
}

This is actually a bogus error. The filesystem is (most likely) fine. There's no fsck included in initramfs, so what happens is that the following fragment is executed:

        if ! command -v fsck >/dev/null 2>&1; then
                log_warning_msg "fsck not present, so skipping $NAME file system"
                return
        fi

log_warning_msg, however, returns non-zero status due to stdout being broken, which causes _checkfs_once return non-zero status as well.

panic doesn't work correctly either: it simply can't spawn a shell on broken /dev/console, and exits immediately, and that's what causes the infinite loop.

What I think about the solution.

First, debugging this is PITA. Adding a serial device might be a perfectly acceptable fix for many, but when this issue happens, absolutely nothing in the console points to the direction that this's what's missing. Even if it's necessary to leave ttyS0 as the main console, initramfs should at least warn the user (through kmsg) that /dev/console is broken.

Second, errors returned by logging function causing _checkfs_once return error as well is a bug. I think errors in _log_msg should be suppressed. If you do that, unless panic happens (which is rare), the boot will succeed.

Third, as Grant Emsley said, maybe ttyS0 doesn't really have to be the main console?

I debugged this problem a bit. The problem stems from initramfs attempting to use /dev/console (which refers to nonexisting /dev/ttyS0), having its logging functions unexpectedly return errors, and broking everything around.

You may have already noticed that when this happens, 100% CPU time is consumed. If you enable sysrq keys with sysrq_always_enabled=1, and dump the task list (e.g. virsh send-key ubuntu18.04 KEY_LEFTALT KEY_SYSRQ KEY_T), you'll notice that there's always a combination of console_setup/loadkeys/setfont processes with evergrowing PIDs, which likely means that something is running them in tight loop.

Now, if you patch "panic()" in /usr/share/initramfs-tools/scripts/functions so it would print its argument to the console (echo "panic 1: " "$@" >/dev/kmsg), you'll see that the panic reason is that "filesystem on /dev/vda1 requires manual fsck", and it's printed in a loop. Indeed, the function does contain a loop:

checkfs()
{
        while ! _checkfs_once "$@"; do
                panic "The $2 filesystem on $1 requires a manual fsck"
        done
}

This is actually a bogus error. The filesystem is (most likely) fine. There's no fsck included in initramfs, so what happens is that the following fragment is executed:

if ! command -v fsck >/dev/null 2>&1; then
                log_warning_msg "fsck not present, so skipping $NAME file system"
                return
        fi

log_warning_msg, however, returns non-zero status due to stdout being broken, which causes _checkfs_once return non-zero status as well.

panic doesn't work correctly either: it simply can't spawn a shell on broken /dev/console, and exits immediately, and that's what causes the infinite loop.

What I think about the solution.

First, debugging this is PITA. Adding a serial device might be a perfectly acceptable fix for many, but when this issue happens, absolutely nothing in the console points to the direction that this's what's missing. Even if it's necessary to leave ttyS0 as the main console, initramfs should at least warn the user (through kmsg) that /dev/console is broken.

Second, errors returned by logging function causing _checkfs_once return error as well is a bug. I think errors in _log_msg should be suppressed. If you do that, unless panic happens (which is rare), the boot will succeed.

Third, as Grant Emsley said, maybe ttyS0 doesn't really have to be the main console?

Revision history for this message

Scott Moser (smoser) wrote on 2020-05-07:

#47

The real fix here is kernel improvement (or bug fix if you want to consider current kernel behavior a bug). Anything else is just going to push around the failure.

That is what was determined in 2013, and its probably still true how.
https://bugs.launchpad.net/ubuntu/+source/cloud-initramfs-tools/+bug/1123220/comments/8

Revision history for this message

WGH (wgh) wrote on 2020-05-08:

#48

Although I completely agree that's the kernel could've automatically chosen the working /dev/console "backend", and that would be the best fix, this won't be fixed soon. Right now users without serial port have unexplicable hang that is pretty hard to debug.

Having initramfs init script report broken /dev/console would help this situation tremendously, and the fix is very easy: just add

print "$@" || echo "/dev/console appears broken"

to _log_msg, and users will at least know the source of the problem.

Revision history for this message

WGH (wgh) wrote on 2020-05-08:

#49

I of course meant

print "$@" || echo "/dev/console appears broken" >/dev/kmsg

Revision history for this message

Scott Moser (smoser) wrote on 2020-05-11:

#50

@wgh,
My experience is that it is unfortunately not that simple.
It may have worked for you.

At the point in which it starts to fail, it repeatedly will fail.
But up until some point, writes to stdout work fine. I believe this is because there is a buffer and it only begins failing when it has filled the buffer and tried to flush.

I have a script that I had put into the initramfs in one of the other bugs that shows this. Its quite possible that the behavior has changed in 8 years, but before you basically just had to write some amount of data to determine if it would fail.

Revision history for this message

Guilherme G. Piccoli (gpiccoli) wrote on 2020-05-22:

#51

First, I'd like to thank Scott (for comment #43) and Alejandro (comment #45) - it seems
there's a bunch of LP bugs orbiting around the same issue: Ubuntu isn't bootable
if we set an invalid serial console on kernel command-line (and have no "quiet"
option there), it seems.

Specially, I'd like to thank WGH for the great debug work on comment #46, it saved
me a lot of time debugging, and you're right, it's a pain to debug issues related
to console, not easy to output stuff. I used the trick to echo debug messages
to the right console, helped me to narrow down some stuff. But in the end, you're
debug exposed what seems to be the major problem here: due to a return value
carriage among functions (starting with printf returning 1 due to bad console),
we end up looping in checkfs(), preventing the boot.

I respectfully disagree with Scott: although I consider there are potential improvements
on kernel in the console "front", we have here a userspace bug in init scripts,
due to an error in printf if console is not correctly set. I don't agree we should
let it alone and pursue a kernel-only solution, specially due to the easy nature
of reproducing the issue, and hard nature of debugging it. Also, it seems there
are long-term bugs reporting similar issues, it bothers a bunch of people.

I cannot be 100% sure we don't have more issues than the checkfs() one found by
WGH, but this one is definitely an issue and an easy one to fix; I proposed a pretty
simple fix in the below test-only PPA:
https://launchpad.net/~gpiccoli/+archive/ubuntu/lp1573095

The more users can test that, more confidence we'll have that there are no more
initramfs bugs if console is wrongly set. I agree with the idea of showing some
output message on kmsg if serial console is broken, it's helpful. We can do that
as part of an improvement, maybe in the same "commit" as the fix.

More opinions are welcome on this matter, of course. If my solution either doesn't
resolve the issue for users or is not the optimal one, let's discuss alternatives
to fix this initramfs long-term flaw.

Thanks,

Guilherme

First, I'd like to thank Scott (for comment #43) and Alejandro (comment #45) - it seems
there's a bunch of LP bugs orbiting around the same issue: Ubuntu isn't bootable
if we set an invalid serial console on kernel command-line (and have no "quiet"
option there), it seems.

Specially, I'd like to thank WGH for the great debug work on comment #46, it saved
me a lot of time debugging, and you're right, it's a pain to debug issues related
to console, not easy to output stuff. I used the trick to echo debug messages
to the right console, helped me to narrow down some stuff. But in the end, you're
debug exposed what seems to be the major problem here: due to a return value
carriage among functions (starting with printf returning 1 due to bad console),
we end up looping in checkfs(), preventing the boot.

I respectfully disagree with Scott: although I consider there are potential improvements
on kernel in the console "front", we have here a userspace bug in init scripts,
due to an error in printf if console is not correctly set. I don't agree we should
let it alone and pursue a kernel-only solution, specially due to the easy nature
of reproducing the issue, and hard nature of debugging it. Also, it seems there
are long-term bugs reporting similar issues, it bothers a bunch of people.

I cannot be 100% sure we don't have more issues than the checkfs() one found by
WGH, but this one is definitely an issue and an easy one to fix; I proposed a pretty
simple fix in the below test-only PPA:
https://launchpad.net/~gpiccoli/+archive/ubuntu/lp1573095

The more users can test that, more confidence we'll have that there are no more
initramfs bugs if console is wrongly set. I agree with the idea of showing some
output message on kmsg if serial console is broken, it's helpful. We can do that
as part of an improvement, maybe in the same "commit" as the fix.

More opinions are welcome on this matter, of course. If my solution either doesn't
resolve the issue for users or is not the optimal one, let's discuss alternatives
to fix this initramfs long-term flaw.

Thanks,

Guilherme

Revision history for this message

Guilherme G. Piccoli (gpiccoli) wrote on 2020-05-22:

#52

Sorry for the bad formatting of last comment, I should had the line breaks fixed before submitting.
I'd like to point another duplicate one which was reported by a colleague: LP #1879987.
I'll close that one to keep the effort in this single LP.

Cheers,

Guilherme

Francis Ginther (fginther) on 2020-06-03

tags:

added: id-5b49154499e416396a3e983c

Revision history for this message

Guilherme G. Piccoli (gpiccoli) wrote on 2020-06-03:

#53

We had reports of good results from an user using my PPA. Anybody else was able to test it?
Cheers,

Guilherme

Revision history for this message

Scott Moser (smoser) wrote on 2020-06-03:

#54

@Guilherme,

Simply returning non-error (0) in one function in the initramfs isn't going to solve the problem. Anything that is checking the return value of a write() to its stdout will fail.
That could be a shell 'echo', it could be a C write().

In order to take that path completion, you'd have to have all programs ignore errors when writing to stdout, which might happen to be /dev/console.

Here's an example of something else (growpart) caring: https://bugs.launchpad.net/ubuntu/+source/cloud-initramfs-tools/+bug/1123220

Revision history for this message

Guilherme G. Piccoli (gpiccoli) wrote on 2020-06-03:

#55

Hi Scott, thanks for you comment. While I agree with you that simply returning 0 in one function won't solve *all* problems, it'll solve this one, in a cheap and fast way.

I tend to think initramfs-tool is a quite important package, it's part of the boot process. And yet, we have plenty of 5yr+ bugs complaining about this, while we couldn't find a perfect/generic solution.

So I proposed we fix this one, for the sake of the giant user base in Ubuntu and Debian, and at small steps pursue a generic solution for the write() problem, that may involve a discussion with kernel and a change in long-term behavior. I'd rather not let users waiting while we do that...

Oh, and I read the other LP you mentioned, it seems a different place of failure, in a different project. I say we go and fix there too, while we work a more generic/elegant solution. But that bug (cloud-init growpart related) it's not so common than this one (here we just need to remove 'quiet' and set the wrong console to break boot completely), so that one is a bit less priority than this one. Initramfs-tools is full of quirks to prevent issues, given its relevant role in the boot process.

Thanks,

Guilherme

Revision history for this message

Eric Desrochers (slashd) wrote on 2020-06-03:

#56

I agree that if we can solve one problem, with the certainty to not introduce more harm/regression, let''s do it. We should do it instead of waiting for a fix/refactoring/.... that we all know won't happen in short future.

My 2 cents are that if we can convince Debian upstream, let's do it.
Debian maintainer will be the ultimate approver/merger.

- Eric

Revision history for this message

Erlon R. Cruz (sombrafam) wrote on 2020-06-04:

#57

I have tested the PPA and works great for me[1]. Given the simplicity of this fix and the way we would need to go to provide a generic fix this looks a good trade off for me.

[1] https://pastebin.ubuntu.com/p/z3Jsfnf5fK/

Guilherme G. Piccoli (gpiccoli) on 2020-07-14

tags:

added: sts

Revision history for this message

Guilherme G. Piccoli (gpiccoli) wrote on 2020-08-20:

#58

Good news! Debian maintainer merged my fix today: https://salsa.debian.org/kernel-team/initramfs-tools/-/commit/c3cbf355 (after only 10 weeks heh).

The Ubuntu SRU process is ongoing on LP #1879987.
Cheers,

Guilherme

Revision history for this message

Guilherme G. Piccoli (gpiccoli) wrote on 2020-10-08:

#59

The specific issue (the bad return from printf) was worked in LP #1879987 - it was fixed recently on initramfs-tools versions:
0.122ubuntu8.17 (Ubuntu 16.04 - Xenial)
0.130ubuntu3.11 (Ubuntu 18.04 - Bionic)
0.136ubuntu6.3 (Ubuntu 20.04 - Focal)
0.137ubuntu12 (Ubuntu 20.10 - Groovy)

I'll nominate this LP for initramfs-tools and set as Fix Released. If that's still reproducible in any way, likely there's another correlated issue, please open another LP and mention it here.
Cheers,

Guilherme

Changed in cloud-images:
status:	Confirmed → Invalid
Changed in initramfs-tools (Ubuntu):
assignee:	nobody → Guilherme G. Piccoli (gpiccoli)
importance:	Undecided → Medium
status:	New → Fix Released

Revision history for this message

Javier (userjavier) wrote on 2020-11-19:

#60

I've been able to reproduce this with a 16.04 Amazon EC2 exported image on VMware vSphere 6.7. The VM boot fine after adding a serial port.
We will try to reproduce the same workaround adding a serial console connection in OCI (which is the final target for that cloud image).

Revision history for this message

Guilherme G. Piccoli (gpiccoli) wrote on 2020-11-19:

#61

Thanks for the report Javier! What version of initramfs-tools are you using ?

Revision history for this message

Javier (userjavier) wrote on 2020-11-19:

#62

ubuntu16.04.PNG Edit (25.5 KiB, image/png)

You are welcome. This is what i see in the ubuntu image i'm running on VMware for debugging purposes:

Revision history for this message

Guilherme G. Piccoli (gpiccoli) wrote on 2020-11-19:

#63

Oh, so it's explained! Thanks Javier for the data. You're using an outdated version of the package that doesn't contain this fix. You need initramfs-tools version 0.122ubuntu8.17 - you can either try to get an updated cloud image, or after the first boot you may be able to update it (you could disable the console=ttySX entry in the cmdline as a workaround in the 1st boot).

If possible, please try the new version and let us all know how it goes.
Cheers,

Guilherme

Revision history for this message

James Falcon (falcojr) wrote on 2023-05-10:

#64

Tracked in Github Issues as https://github.com/canonical/cloud-init/issues/2657

cloud-images

Cloud images fail to boot when a serial port is not available

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
cloud-images	Invalid	Undecided	Unassigned
cloud-init	Invalid	Undecided	Unassigned
Ubuntu	Invalid	Undecided	Unassigned
initramfs-tools (Ubuntu)	Fix Released	Medium	Guilherme G. Piccoli