Unable to network boot Ubuntu 16.04 installer normally on Briggs

Bug #1615021 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
busybox (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Won't Fix
Undecided
Unassigned
Yakkety
Fix Released
Undecided
Unassigned
debian-installer (Ubuntu)
Invalid
Undecided
Taco Screen team
Xenial
Invalid
Undecided
Unassigned
Yakkety
Invalid
Undecided
Taco Screen team
systemd (Ubuntu)
Fix Released
Undecided
Martin Pitt
Xenial
Fix Released
Undecided
Martin Pitt
Yakkety
Fix Released
Undecided
Martin Pitt

Bug Description

== Comment: #7 - Guilherme Guaglianoni Piccoli <email address hidden> - 2016-08-19 10:08:07 ==
The normal procedure to perform a Netboot installation of Ubuntu 16.04 is to download the latest vmlinux and initrd.gz files available, and kexec them with no parameters (at least in ppc64el).

We're experiencing a strange issue in which the installer freezes before menus are showed. The system hangs in the point specified below, right after the i40e driver initialization:

[ 11.052832] i40e 0002:01:00.0 enP2p1s0f0: renamed from eth0
[ 11.073976] i40e 0002:01:00.1 enP2p1s0f1: renamed from eth1
[ 11.117799] i40e 0002:01:00.2 enP2p1s0f2: renamed from eth2
[ 11.225745] i40e 0002:01:00.3 enP2p1s0f3: renamed from eth3
***HANG***

The most difficult part in this issue is that it seems to be a timing issue/race condition, and many debug trials end up by avoiding the issue reproduction (heisenbug).

We were successful though in getting logs by booting the kernel with the command-line "BOOT_DEBUG=2" and by changing the initrd in order to enable systemd debug; only the files "init" and "start-udev" were changed in initrd, both attached here.

We've attached here a saved screen session that shows the entire boot process until it gets flooded with lots of messages like:

"starting '/bin/readlink /etc/udev/rules.d/80-net-setup-link.rules'
'/bin/readlink /etc/udev/rules.d/80-net-setup-link.rules'(err) 'failed to execute '/bin/readlink' '/bin/readlink /etc/
udev/rules.d/80-net-setup-link.rules': No such file or directory'

seq 3244 queued, 'add' 'pci_bus'
starting '/bin/readlink /etc/udev/rules.d/80-net-setup-link.rules'
passed 408 byte device to netlink monitor 0x1003cfe8020seq 3236 running'/bin/readlink /etc/udev/rules.d/80-net-setup-l
ink.rules'(err) 'failed to execute '/bin/readlink' '/bin/readlink /etc/udev/rules.d/80-net-setup-link.rules': No such
file or directory'
'/bin/readlink /etc/udev/rules.d/80-net-setup-link.rules'(err) 'failed to execute '/bin/readlink' '/bin/readlink /etc/
udev/rules.d/80-net-setup-link.rules': No such file or directory'
Process '/bin/readlink /etc/udev/rules.d/80-net-setup-link.rules' failed with exit code 2.
PROGRAM '/bin/readlink /etc/udev/rules.d/80-net-setup-link.rules' /lib/udev/rules.d/73-usb-net-by-mac.rules:6
passed device to netlink monitor 0x1003d01f730
"

Then it keeps hanged in this stage. We re-tested it by changing the file 73-usb-net-by-mac.rules in initrd, replacing " /etc/udev/rules.d/80-net-setup-link.rules" to "/lib/udev/rules.d/80-net-setup-link.rules", since the former does not exist whereas the latter does. Same issue were observed!

Notice that if we boot the installer with command-line "net.ifnames=0" or "net.ifnames=1", the problem does not reproduces anymore.

We want to ask Canonical's help in investigating this issue.
Thanks,

Guilherme

SRU INFORMATION for systemd
===========================

Test case:
 * Check what happens for uevents on devices which are not USB network interfaces:
   udevadm test /sys/devices/virtual/mem/null
   udevadm test /sys/class/net/lo

 With the current version these will run

  PROGRAM '/bin/readlink /etc/udev/rules.d/80-net-setup-link.rules' /lib/udev/rules.d/73-usb-net-by-mac.rules:6

 which is pointless. With the proposed version these should be gone.

 * Ensure that the rule still works as intended by connecting an USB network device that has a permanent MAC address (e. g. Android tethering uses a temporary MAC): You should get a MAC-based name like "enx12345678" for it. Now disconnect it again, disable ifnames with

    sudo ln -s /dev/null /etc/udev/rules.d/80-net-setup-link.rules

and reconnect the device. You should now get a kernel name like "usb0" for it.

* Regression potential: Errors in the rule could break persistent naming - or its disabling - of USB network interfaces. Running the above test carefully is important to ensure this keeps working. This has little to no actual effect on anything else on the system (aside from a performance impact and spamming logs), so overall the regression potential is low.

Revision history for this message
bugproxy (bugproxy) wrote : screen session output

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-145180 severity-high targetmilestone-inin16041
Revision history for this message
bugproxy (bugproxy) wrote : init (modified on initrd)

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : start-udev (modified on initrd)

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → systemd (Ubuntu)
Changed in systemd (Ubuntu):
status: New → Confirmed
Revision history for this message
Steve Langasek (vorlon) wrote :

Examining the initrd shows that readlink is provided as /usr/bin/readlink -> /bin/busybox, not as /bin/readlink where systemd expects it (and where it's shipped on an installed system). This is a bug in debian-installer's construction of that image - though gee it would be nice if systemd didn't require hard-coded paths to everything.

There's no guarantee that fixing the bug that's causing this error message will fix the underlying problem preventing your boot, but it will at least fix the message spam.

affects: systemd (Ubuntu) → debian-installer (Ubuntu)
Revision history for this message
Steve Langasek (vorlon) wrote :

I've thought about this some more, and while the /bin/readlink /usr/bin/readlink in busybox is a bug, fixing this is definitely not going to fix the problem in the installer. In the installer, /etc/udev/rules.d/80-net-setup-link.rules will never exist since this is an admin override; so the readlink command - if it existed - would still return false. I'm reasonably sure the lack of /bin/readlink is not causing the udev rule to behave differently; so it's sufficient to fix this particular issue for 16.10 and later and not SRU it.

What is *more* of an issue is that the structure of /lib/udev/rules.d/73-usb-net-by-mac.rules causes a separate call out to readlink for every single udev event, because the readlink check happens *before* checking the ACTION/SUBSYSTEM/SUBSYSTEMS attributes of the event, unless net.ifnames=0 is set.

So regardless of whether this is the root cause of the install failure, this udev rule is causing hundreds of thousands of extra calls out to /bin/readlink on boot, which should definitely be fixed by reordering these checks.

Martin, can you please look into fixing this for xenial+yakkety?

Changed in systemd (Ubuntu):
assignee: nobody → Martin Pitt (pitti)
status: New → Triaged
Changed in busybox (Ubuntu Xenial):
status: New → Won't Fix
Changed in busybox (Ubuntu Yakkety):
status: New → Fix Committed
Changed in debian-installer (Ubuntu Xenial):
status: New → Triaged
Changed in debian-installer (Ubuntu Yakkety):
status: Confirmed → Triaged
Changed in systemd (Ubuntu Xenial):
status: New → Triaged
assignee: nobody → Martin Pitt (pitti)
Martin Pitt (pitti)
description: updated
Revision history for this message
Martin Pitt (pitti) wrote :

Thanks for reporting this! Indeed this is a silly rule construction, *brown paperbag*. I fixed this for the next Debian/Yakkety upload in https://anonscm.debian.org/cgit/pkg-systemd/systemd.git/commit/?id=b42e1f8af2 and backported it to Xenial in https://anonscm.debian.org/cgit/pkg-systemd/systemd.git/commit/?h=ubuntu-xenial&id=d244c9acd .

Changed in systemd (Ubuntu Yakkety):
status: Triaged → Fix Committed
Changed in systemd (Ubuntu Xenial):
status: Triaged → In Progress
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-08-24 12:34 EDT-------
Thanks very much vorlon and pitti. Pretty nice findings!

But...the issue still persists. I'll summarize the tests I made:

1) Firstly, I changed the link /usr/bin/readlink and, as vorlon predicted, this didn't solve the issue.

2) Then, independently of (1), I applied pitti's patch to xenial's "73-usb-net-by-mac.rules" and...unfortunately it also didn't solve the issue.

What impress me more is the difficult/interference of the simplest debug on the issue! After testing pitti's patch, still with the patch applied, I changed the start-udev load like this:

(before)
SYSTEMD_LOG_LEVEL=notice /lib/systemd/systemd-udevd --daemon --resolve-names=never
(after my change)
SYSTEMD_LOG_LEVEL=debug /lib/systemd/systemd-udevd --daemon --resolve-names=never --debug

Well, the issue reproduced and I didn't see a single extra log message.

After this, I kept both pitti's patch and this systemd debug parameter, but I booted with command-line "BOOT_DEBUG=1". Guess what? I was flooded by messages but the installer showed up. This is really weird for me...I'll attach a screen session of this last trial.

I appreciate any suggestion you have to debug the issue further - by the way, using "net.ifnames=1" workarounds the issue too. Basically, any command-line option seems to solve it, even the simplest debug parameter.

Thanks very much for the help and advice,

Guilherme

Revision history for this message
bugproxy (bugproxy) wrote : NEW screen output

------- Comment (attachment only) From <email address hidden> 2016-08-24 12:39 EDT-------

Revision history for this message
Steve Langasek (vorlon) wrote :

If this screen output is for a case when the installer *did* show up, I don't think it's going to tell us much about where things have hung in the case that it *didn't* show up.

If there's a particular invocation of udev that lets you reproduce the problem, I suggest sticking with that, and capturing the output of 'udevadm info -e' (possibly by using a fixed delay).

Revision history for this message
Martin Pitt (pitti) wrote :

I don't actually know what BOOT_DEBUG does -- I've never seen it before, it does not appear anywhere in my yakkety system, and it's for sure not something the kernel, initramfs-tools, or systemd look at. My best guess is that this is a debian-installer specific debug flag.

So from what I can tell, the readlink path issue is merely a red herring -- it's good to fix it of course, but it's unrelated to the boot failure.

Since this is a heisenbug, it rather seems to me that this is some timing issue -- any extra debugging, or time spent with changing boot parameters in the boot loader will change the behaviour (e. g. make the detection of network devices by the hardware finish earlier).

ATM I'm afraid there isn't enough useful information here yet to understand what's going on -- indeed having a screen output where the problem does happen would be helpful. dmesg logs and "udevadm info -e" as well, as Steve says.

Revision history for this message
bugproxy (bugproxy) wrote : Boot log for failed run, no debug options

------- Comment on attachment From <email address hidden> 2016-08-25 10:51 EDT-------

Added a boot log showing the hang, no boot options (no debug options).

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package busybox - 1:1.22.0-19ubuntu2

---------------
busybox (1:1.22.0-19ubuntu2) yakkety; urgency=medium

  * debian/patches/readlink-in-slash-bin.patch: put readlink in /bin/
    like coreutils. Closes LP: #1615021.

 -- Steve Langasek <email address hidden> Tue, 23 Aug 2016 12:36:39 -0700

Changed in busybox (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 231-5

---------------
systemd (231-5) unstable; urgency=medium

  [ Iain Lane ]
  * Let graphical-session-pre.target be manually started (LP: #1615341)

  [ Felipe Sateler ]
  * Add basic version of git-cherry-pick
  * Replace Revert-units-add-a-basic-SystemCallFilter-3471.patch with upstream
    patch
  * sysv-generator: better error reporting. (Closes: #830257)

  [ Martin Pitt ]
  * 73-usb-net-by-mac.rules: Test for disabling 80-net-setup-link.rules more
    efficiently. Stop calling readlink at all and just test if
    /etc/udev/rules.d/80-net-setup-link.rules exists -- a common way to
    disable an udev rule is to just "touch" it in /etc/udev/rule.d/ (i. e.
    empty file), and if the rule is customized we cannot really predict anyway
    if the user wants MAC-based USB net names or not. (LP: #1615021)
  * Ship kernel-install (Closes: #744301)
  * Add debian/extra/kernel-install.d/60-initrd.install.
    This kernel-install drop-in copies the initrd of the selected kernel to
    the EFI partition.
  * bootctl: Automatically detect ESP partition.
    This makes bootctl work with Debian's /boot/efi/ mountpoint without having
    to explicitly specify --path.
    Patches cherry-picked from upstream master.
  * systemd.NEWS: Point out that alternatively rcS scripts can be moved to
    rc[2-5]. Thanks to Petter Reinholdtsen for the suggestion!

  [ Michael Biebl ]
  * Enable iptables support (Closes: #787480)
  * Revert "logind: really handle *KeyIgnoreInhibited options in logind.conf"
    The special 'key handling' inhibitors should always work regardless of
    any *IgnoreInhibited settings – otherwise they're nearly useless.
    Update man pages to clarify that *KeyIgnoreInhibited only apply to a
    subset of locks (Closes: #834148)

 -- Martin Pitt <email address hidden> Fri, 26 Aug 2016 10:58:07 +0200

Changed in systemd (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-08-30 16:01 EDT-------
pitti/vorlon, thanks for your suggestions. Unfortunately, I wasn't able to get more information by placing a fixed delay in init script - what I did was to execute in background a little script on the beginning of init that waits for 8 seconds and run the command "udevadm info -e".

Problem is that init seems to not being executed, the issue happens first. I added a simple "echo" command as first thing on init, but never saw the message it should print.

Any more suggestions you have are really appreciated.

Thanks,

Guilherme

Revision history for this message
Andy Whitcroft (apw) wrote : Please test proposed package

Hello bugproxy, or anyone else affected,

Accepted systemd into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/systemd/229-4ubuntu8 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in systemd (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-09-07 10:25 EDT-------
Waiting for xenial-proposed installer to be updated. Currently, still shows 2016-09-02.

------- Comment From <email address hidden> 2016-09-07 10:26 EDT-------
Err, should be "2016-08-02" as current content date on xenial-proposed.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-09-08 07:34 EDT-------
Still waiting for updated installer images. http://ports.ubuntu.com/ubuntu-ports/dists/xenial-proposed/main/installer-ppc64el/current/images/netboot/ubuntu-installer/ppc64el/ still showing images from August.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-09-10 08:29 EDT-------
The installer images for xenial-proposed have not yet been updated.

Revision history for this message
bugproxy (bugproxy) wrote : screen session output

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : init (modified on initrd)

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : start-udev (modified on initrd)

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : NEW screen output

------- Comment (attachment only) From <email address hidden> 2016-08-24 12:39 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Boot log for failed run, no debug options

------- Comment on attachment From <email address hidden> 2016-08-25 10:51 EDT-------

Added a boot log showing the hang, no boot options (no debug options).

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-09-12 08:33 EDT-------
Still no new installer images for xenial-proposed.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-09-12 11:09 EDT-------
Reopening: xenial-proposed installer has not been updated, cannot verify fix. Set back to fixed/verify once http://ports.ubuntu.com/ubuntu-ports/dists/xenial-proposed installer has been updated with fixed systemd to be tested.

Revision history for this message
Martin Pitt (pitti) wrote :

Note, there hasn't been any debian installer fix yet, as we don't even understand what's actually happening there. There has just been an SRU to systemd/udev to fix the "No such file or directory" error message in udev rules, but apparently that was not the actual problem.

Revision history for this message
Martin Pitt (pitti) wrote :

I ran the test case for systemd on a 16.04.1 desktop live system with an USB ethernet device. I confirm that naming still works as intended, MAC naming can be disabled with the /dev/null symlink, and the readlink calls are gone.

(Again, note that this was merely the side issue, not the main boot problem here.)

tags: added: verification-done
removed: verification-needed
Revision history for this message
Breno Leitão (breno-leitao) wrote :

Martin,

Per previous comment, I understand that this bug is still not fixed, correct?

Revision history for this message
Martin Pitt (pitti) wrote : Re: [Bug 1615021] Re: Unable to network boot Ubuntu 16.04 installer normally on Briggs

Breno Leitão [2016-09-12 20:53 -0000]:
> Per previous comment, I understand that this bug is still not fixed,
> correct?

Yes, as it isn't even understood yet.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-09-13 14:16 EDT-------
This bug was opened because of a hang being experienced while booting the Ubuntu 16.04 network installer on Briggs & Stratton machines with their X710 ethernet adapters, using the i40e driver.

During investigation, and problem/mistake was found with systemd but is almost-certainly not the cause of the hang. This fixed systemd was supposedly being made available in xenial-proposed repositories, but so far does not seem to have appeared there.

This bug was placed in "verify" state and it started causing email to be sent several times a day reminding me to verify the fix.Since we don't believe that the "fix to systemd" will fix the hang during the installer boot, and since this new systemd has not been pushed out to the xenial-proposed installer after 6 days, I have taken this bug out of "verify" state by re-opening it.

When there actually is something to be tested, and it has made it's way into the xenial-proposed installer, then this bug can be set back to "verify" and I will test the fix.

------- Comment From <email address hidden> 2016-09-13 14:18 EDT-------
I should also ammend my previous comment by saying, if Canonical has some suggestions of how to gather more information in order to help debug this, they should let us know and we can make test runs for them.

Revision history for this message
Steve Langasek (vorlon) wrote : Re: [Bug 1615021] Comment bridged from LTC Bugzilla

On Tue, Sep 13, 2016 at 06:20:49PM -0000, bugproxy wrote:
> During investigation, and problem/mistake was found with systemd but is
> almost-certainly not the cause of the hang.

Agreed.

> This fixed systemd was supposedly being made available in xenial-proposed
> repositories, but so far does not seem to have appeared there.

The systemd package is present in the xenial-proposed repository, but no
updated installer image has yet been produced that includes it.

We have had sufficient verification of the systemd change that it will be
released to xenial users for the general problem; we will also update the
debian-installer images as a matter of course.

Based on the feedback from <email address hidden>, it does not appear that the
buggy udev rule is blocking progress on this bug.

> This bug was placed in "verify" state and it started causing email to be
> sent several times a day reminding me to verify the fix.

I don't know why this would be. Our process generates a single message to
the bug when a package is accepted into the -proposed repository, it does
not send daily reminder messages.

> ------- Comment From <email address hidden> 2016-09-13 14:18 EDT-------
> I should also ammend my previous comment by saying, if Canonical has some
> suggestions of how to gather more information in order to help debug this,
> they should let us know and we can make test runs for them.

My previous suggestion to gpiccoli on IRC was to modify the initrd to dump
the state of the udev database at a point after the hang. I haven't seen
such output attached here; does that mean it's not possible to produce such
results because the kernel hard locks? Currently the only debugging
information I've seen is that the /lib/debian-installer/start-udev script
never returns, but that does not mean the kernel has locked up - it only
shows that udev believes it has not finished processing. I would still like
to see a dump of the udev database at the point of the hang, not just a udev
debug log showing processing up to that point.

Is this problem only reproducible with the X710 ethernet adapter? Is this a
removable ethernet adapter, and have you tested what happens if it's
removed? If it's not removable, have you tested what happens if you
blacklist the i40e driver? The ethernet driver may be a complete red
herring, and the problem may be with something that normally happens after
ethernet driver initialization rather than with the ethernet driver itself.

I would also have asked whether this could be an issue with the console
output being redirected to some different device, but since Guilherme
indicated that the problem appeared to be racy, with boot to the installer
sometimes succeeding, that seems unlikely to be the problem.

If you can reproduce this problem with the cloud image from
<http://cloud-images.ubuntu.com/xenial/current/xenial-server-cloudimg-ppc64el-disk1.img>,
that would present additional debugging opportunities since that uses a
standard Ubuntu initramfs instead of the installer initramfs and will
support various 'break=' options to interrupt the boot and introspect the
system state.

Revision history for this message
Martin Pitt (pitti) wrote : Update Released

The verification of the Stable Release Update for systemd has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 229-4ubuntu8

---------------
systemd (229-4ubuntu8) xenial-proposed; urgency=medium

  * Queue loading transient units after setting their properties. Fixes
    starting VMs with libvirt. (LP: #1529079)
  * Connect pid1's stdin/out/err fds to /dev/null also for containers. This
    fixes generators which expect a valid stdout/err fd in some container
    technologies. (LP: #1608953)
  * 73-usb-net-by-mac.rules: Do not run readlink for *every* uevent, and
    merely check if /etc/udev/rules.d/80-net-setup-link.rules exists.
    A common way to disable an udev rule is to just "touch" it in
    /etc/udev/rule.d/ (i. e. empty file), and if the rule is customized we
    cannot really predict anyway if the user wants MAC-based USB net names or
    not. (LP: #1615021)
  * systemd-networkd-resolvconf-update.service: Also pick up DNS servers from
    individual link leases, as they sometimes don't appear in the global
    ifstate. (LP: #1620559)

 -- Martin Pitt <email address hidden> Tue, 06 Sep 2016 14:16:29 +0200

Changed in systemd (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla
Download full text (3.7 KiB)

------- Comment From <email address hidden> 2016-09-15 17:13 EDT-------
> On Tue, Sep 13, 2016 at 06:20:49PM -0000, bugproxy wrote:
[...]
> Based on the feedback from <email address hidden>, it does not appear that the
> buggy udev rule is blocking progress on this bug.
>
[...]
> > I should also ammend my previous comment by saying, if Canonical has some
> > suggestions of how to gather more information in order to help debug this,
> > they should let us know and we can make test runs for them.
>
> My previous suggestion to gpiccoli on IRC was to modify the initrd to dump
> the state of the udev database at a point after the hang. I haven't seen
> such output attached here; does that mean it's not possible to produce such
> results because the kernel hard locks? Currently the only debugging
> information I've seen is that the /lib/debian-installer/start-udev script
> never returns, but that does not mean the kernel has locked up - it only
> shows that udev believes it has not finished processing. I would still like
> to see a dump of the udev database at the point of the hang, not just a udev
> debug log showing processing up to that point.
>
> Is this problem only reproducible with the X710 ethernet adapter? Is this a
> removable ethernet adapter, and have you tested what happens if it's
> removed? If it's not removable, have you tested what happens if you
> blacklist the i40e driver? The ethernet driver may be a complete red
> herring, and the problem may be with something that normally happens after
> ethernet driver initialization rather than with the ethernet driver itself.
>
> I would also have asked whether this could be an issue with the console
> output being redirected to some different device, but since Guilherme
> indicated that the problem appeared to be racy, with boot to the installer
> sometimes succeeding, that seems unlikely to be the problem.
>
> If you can reproduce this problem with the cloud image from
> <http://cloud-images.ubuntu.com/xenial/current/xenial-server-cloudimg-
> ppc64el-disk1.img>,
> that would present additional debugging opportunities since that uses a
> standard Ubuntu initramfs instead of the installer initramfs and will
> support various 'break=' options to interrupt the boot and introspect the
> system state.

Vorlon, thanks very much for your assistance. In fact, your ideas were useful and we tried many of them. And finally we seem to have figured what's going on hehehe

Firstly, our bad trials:

i) "udev info -e" was impossible to accomplish in a bad boot, because even if I try to run it as one of the first things in init, the system seems still hangs.

ii) Adding modprobe blacklist to any driver makes things work. In fact, I added the command-line "vorlon" and it worked too hehehe

iii) I wasn't able to test this Cloud image - never installed this before, is it a complete functional image? I wondered if it needs to be write directly on the disk, perhaps...

Anyway, after all the analysis we finally observed something important: by putting any command-line we ended up overwriting the default cmdline, and that was the reason of _any_ command-line worked.
Now, the default cmdline was: "console=hvc0 co...

Read more...

Revision history for this message
Martin Pitt (pitti) wrote :

gpiccoli, great finding! Indeed it seems debian-installer interprets console= arguments and prefers the *last* one:

rootskel-1.115ubuntu1/src/sbin/reopen-console-linux:

        if [ -z "$console" ]; then
                # Locate the last enabled console present on the command line
                for arg in $(cat /proc/cmdline); do
                        case $arg in
                            console=*)
                                arg=${arg#console=}
                                cons=${arg%%,*}
                                if echo "$consoles" | grep -q "^$cons$"; then
                                        console=$cons
                                fi
                                ;;
                        esac
                done
        fi

        if [ -z "$console" ]; then
                # Still nothing? Default to /dev/console.
                console=console
        fi

Revision history for this message
Martin Pitt (pitti) wrote :

That d-i behaviour agrees with how the kernel interprets those: https://www.kernel.org/doc/Documentation/serial-console.txt -- i. e. kernel messages appear on all "console="s, but the last one defines what /dev/console points to.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-09-19 15:26 EDT-------
Very nice pitti, thanks for the clarification. And thanks a lot for your effort in this bug.

Seems there is a request/RFC to add GNU screen support to debian-installer, so it can show the menu in multiple terminals [1] [2]. Would be a great addition to Ubuntu installer as well, specially since it would allow a fully functional console to be opened at same time installer is running, allowing quick debug features.

We will close this bugzilla/LP now, since it's not a bug. Also, added some documentation [3] about the issue in order to enlighten customers and everybody that might face this situation.

Thanks,

Guilherme

---
[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=819988
[2] https://lists.debian.org/debian-boot/2016/08/msg00056.html

[3] https://wiki.ubuntu.com/ppc64el/Recommendations#Netboot_installation_over_IPMI

Revision history for this message
Martin Pitt (pitti) wrote :

Thanks for the followup. Closing the debian-installer part now. The work on the Debian side wrt. multiplexing the installer UI to mutiple consoles sounds interesting, but I don't think we should houd our breath for it -- I think this is going to be tricky given how different the capabilities of VT and serial consoles are wrt. geometry, colors, special chars, etc.

Changed in debian-installer (Ubuntu Yakkety):
status: Triaged → Invalid
Changed in debian-installer (Ubuntu Xenial):
status: Triaged → Invalid
Revision history for this message
Peter Maydell (pmaydell) wrote :

Just a note that the udev rules change from comment 6 seems to be necessary to reliably get an image booted under QEMU to bring up a getty on the serial console. What seems to happen without it is that udevd spends all its time running copies of 'readlink', and it doesn't get around to telling systemd about the presence of ttyAMA0 until after systemd's 1m30 timeout has expired and it gives up, reporting "Timed out waiting for device dev-ttyAMA0.device". (This happens most of the time on an emulated QEMU CPU and at least occasionally on one running with single-vcpu KVM, probably dependent on speed of the host hardware.)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.