Ubuntu Bionic freezes on Supermicro hardware when console redirection is configured in kernel parameters

Bug #1865145 reported by Przemyslaw Hausman
32
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

STEPS TO REPRODUCE

1. Use MAAS for deploying Ubuntu on bare metal servers.

2. Enable serial-over-lan in BIOS.

3. Enable console redirection in kernel parameters:
   console=tty0 console=ttyS1,115200n8

4. Connect to the ipmi console to watch the boot and installation process
   ipmiconsole -u <user> -p <pass> -h <host>

5. Start deploying the server with MAAS (bionic, GA kernel)
It may require several re-deployments until you experience the issue. In my case, I'm deploying 9 bare metal machines at the same time. The issue typically surfaces on one random node, 50% of the time.

6. During the deployment, after first reboot, typically during execution of cloud-init scripts, the ipmi console freezes. Typically in the middle of printing out some output.
You can still SSH to the node. While being logged into the node:
- running 'ps aux' takes more than 20 seconds,
- running 'systemctl' times out,
- the number of zombie processes increases.

Sometimes, running 'sosreport' unblocks the node, so that cloud-init scripts finish executing and the node shuts down as expected.
Sometimes, only logging into the node with SSH unblocks it.

7. Eventually, after 30 minutes, MAAS marks the node as "Failed deployment".

WORKAROUND

Remove console redirection from kernel parameters (console=tty0 console=ttyS1,115200n8).

AFFECTED HARDWARE

Supermicro SYS-2029U-TR4
https://www.supermicro.com/en/products/system/2U/2029/SYS-2029U-TR4.cfm

OTHER

The issue does not affect Xenial with GA kernel.

Tags: bionic sts
Revision history for this message
Przemyslaw Hausman (phausman) wrote :
Revision history for this message
Przemyslaw Hausman (phausman) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1865145

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Revision history for this message
Przemyslaw Hausman (phausman) wrote :

Subscribing field-critical as we're currently facing this issue on two customer OpenStack deployments.

Revision history for this message
Andrea Righi (arighi) wrote :

If you can still ssh to the system when the problem happens, can you run dmesg and post the output here? Thanks!

Revision history for this message
Przemyslaw Hausman (phausman) wrote :

Hi Andrea, I am not able to access the machines now and run dmesg. But in the attached sosreport you can find /var/log/kern.log and /var/log/syslog. Hope that would help.

Victor Tapia (vtapia)
tags: added: sts
Revision history for this message
Andrea Righi (arighi) wrote :

@phausman sorry for the late response, is this bug still happening? Unfortunately I don't see any error or potential problem in the attached kernel.log or syslog. I guess the only way to debug this issue is to reproduce the problem and run some commands via ssh...

Revision history for this message
Przemyslaw Hausman (phausman) wrote :

@arighi, unfortunately I do not have access to the hardware anymore. As far as I know, @agrebennikov is working on getting access to some test machines from the vendor. Once we have it and the problem is still reproducible, I think you might be able to access the machine to look around.

Revision history for this message
Andrea Righi (arighi) wrote :

@phausman thanks for the update! To be more specific, it would be interesting to take a look at dmesg while the problem is happening (I would expect to see a kernel oops: hung task timeout). It would be also interesting to take a look at the parent of these zombie tasks and try to figure out if it's stuck somewhere (cat /proc/<pid>/stack).

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.