219-1ubuntu1 regression: boot hangs, logind fails

Bug #1423811 reported by Martin Pitt
58
This bug affects 11 people
Affects Status Importance Assigned to Milestone
systemd
Fix Released
Undecided
Martin Pitt
systemd (Debian)
Fix Released
Unknown
systemd (Fedora)
Won't Fix
High
systemd (Ubuntu)
Fix Released
Critical
Martin Pitt

Bug Description

Since yesterday, systemd does not boot properly any more. It takes very long, ends up in an X failsafe session, and eventually you just get a getty on VT1.

Downgrading libpam-systemd libsystemd0 systemd systemd-sysv to https://launchpad.net/ubuntu/+source/systemd/218-10ubuntu2 fixes the boot again.

This happens both with the standard "quiet splash $vt_handoff" as well as without these tree options, i. e. text mode boot.

$ sudo systemctl list-jobs
 JOB UNIT TYPE STATE
1634 sound.target stop waiting
1176 NetworkManager.service start running
1598 cgproxy.service start waiting
1632 failsafe-graphical.target stop waiting
1563 alsa-restore.service start waiting
1594 plymouth-quit-wait.service start waiting
1588 getty-static.service start waiting
1562 alsa-state.service start waiting
 121 systemd-update-utmp-runlevel.service start waiting
1640 <email address hidden> stop waiting
1544 systemd-ask-password-plymouth.path start waiting
1507 friendly-recovery.service start waiting
1607 anacron.service start waiting
1489 systemd-binfmt.service start waiting
1637 system-systemd\x2drfkill.slice stop waiting
1636 <email address hidden> stop waiting
1631 failsafe-x.service stop waiting
1638 systemd-backlight@backlight:intel_backlight.service stop waiting
1610 pppd-dns.service start waiting
1483 systemd-machine-id-commit.service start waiting
1635 system-systemd\x2dbacklight.slice stop waiting
1557 systemd-hwdb-update.service start waiting
1630 systemd-backlight@backlight:acpi_video0.service stop waiting
1500 sys-kernel-config.mount start waiting
  92 multi-user.target start waiting
1509 debian-fixup.service start waiting
1641 <email address hidden> stop waiting
1613 plymouth-quit.service start waiting
1639 system-ifup.slice stop waiting
1642 <email address hidden> stop waiting
1633 acpid.service stop waiting
1643 systemd-logind.service restart waiting
1543 plymouth-start.service start waiting
1499 plymouth-read-write.service start waiting

Attaching debug journal.

ProblemType: Bug
DistroRelease: Ubuntu 15.04
Package: systemd 219-1ubuntu1
ProcVersionSignature: Ubuntu 3.18.0-13.14-generic 3.18.5
Uname: Linux 3.18.0-13-generic x86_64
ApportVersion: 2.16.1-0ubuntu2
Architecture: amd64
CurrentDesktop: Unity
Date: Fri Feb 20 07:39:03 2015
EcryptfsInUse: Yes
InstallationDate: Installed on 2014-11-20 (91 days ago)
InstallationMedia: Ubuntu 15.04 "Vivid Vervet" - Alpha amd64 (20141119)
MachineType: LENOVO 2324CTO
ProcKernelCmdLine: BOOT_IMAGE=/@/boot/vmlinuz-3.18.0-13-generic.efi.signed root=UUID=f86539b0-3a1b-4372-83b0-acdd029ade68 ro rootflags=subvol=@ quiet splash vt.handoff=7 init=/sbin/upstart
SourcePackage: systemd
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 07/09/2013
dmi.bios.vendor: LENOVO
dmi.bios.version: G2ET95WW (2.55 )
dmi.board.asset.tag: Not Available
dmi.board.name: 2324CTO
dmi.board.vendor: LENOVO
dmi.board.version: 0B98401 Pro
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvrG2ET95WW(2.55):bd07/09/2013:svnLENOVO:pn2324CTO:pvrThinkPadX230:rvnLENOVO:rn2324CTO:rvr0B98401Pro:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 2324CTO
dmi.product.version: ThinkPad X230
dmi.sys.vendor: LENOVO

Revision history for this message
Martin Pitt (pitti) wrote :
Changed in systemd (Ubuntu):
assignee: nobody → Martin Pitt (pitti)
importance: Undecided → Critical
status: New → In Progress
milestone: none → ubuntu-15.02
Martin Pitt (pitti)
description: updated
description: updated
Revision history for this message
Martin Pitt (pitti) wrote :

Notekeeping:

- installing self-built .debs: boot ok
- reinstalling ubuntu debs: boot fail
- Removing persistant journal (sudo rm -r /var/log/journal) with the ubuntu binaries: boot ok
- rebooting a few times with re-enabling persistant journal: boot ok
- QEMU with enabling persistant journal and booting a few times: bootok
- Didier has no persistant journal and got the issue once, so this only appears to change timing, not a fundamental thing

So now I need to figure out how to make boot fail again.

Revision history for this message
Martin Pitt (pitti) wrote :

I can reproduce this quite reliably with:

  sudo rm -r /var/log/journal
  sudo mkdir /var/log/journal
  sudo apt-get install --reinstall systemd

I now managed to reproduce the hang in a VM as well, with these commands.

Revision history for this message
Martin Pitt (pitti) wrote :

I continued to bisect binaries (/lib/systemd/{systemd,systemd-logind,systemd-journald,systemd-fsck,systemd-fsckd}.

I never got a hang with 219 + journald from 218, and I did get a hang with 218 + journald from 219, which indicates that this is related to journald. Didier confirmed the latter hang.

Revision history for this message
Martin Pitt (pitti) wrote :

This is an automated reproducer in the form of an autopkgtest. You can call it like this:

  adt-run --built-tree ./systemd-bootsmoke --- qemu /srv/vm/adt-pid1.img

This reboots the VM up to 20 times, and check for stuck jobs or non-running polkit on each iteration. You can also test with modified binaries by copying them into the testbed's /tmp/systemd-replace, like this:

  adt-run --built-tree ./systemd-bootsmoke --copy /tmp/systemd-journald:/tmp/systemd-replace/systemd-journald --- ....

With e. g. systemd-journald from 218 the test passes. Note that there is sometimes some testbed timeout and other testbed setup failures, due to various race conditions.

I added a few robustifications to autopkgtest 3.9.7 (just uploaded to Debian sid, will sync into vivid this evening). You can of course also just checkout latest autopkgtest git and call /checkout/dir/run-from-checkout instead of "adt-run".

In principle this works with a standard adt-buildvm-ubuntu-cloud minimal image (the test installs lightdm, network-manager, etc.), but I still find it rather hard to reproduce there. I instead built a custom "adt-pid1.img" which is a standard vivid desktop install with the following modifications:

  - Prepare the VM for autopkgtest (enable serial console and provide the root shell on ttyS1 mostly) by running /usr/share/autopkgtest/adt-setup-vm in it as root once (you *need* autopkgtest 3.9.7 for this, or grab it from git).

 - Install systemd-sysv

With such an image I usually get the hang after 2 or 3 reboots already, so that the 20 iterations that the test does should be quite sufficient.

I'll add this test to the systemd package soon, but for now this is standalone for easier hacking.

Revision history for this message
Martin Pitt (pitti) wrote :

I'm running "git bisect start v219 v218 -- src/shared src/journal" on upstream trunk now (some 220 commits to test), using the above autopkgtest. I don't use git bisect run, as the autopkgtest sometimes fails with exit code 16 (testbed setup); in those cases it should be re-run to either succeed, or "properly" fail the text (exit code 4).

Revision history for this message
Martin Pitt (pitti) wrote :

Bisect run finished. The culprit is

  http://cgit.freedesktop.org/systemd/systemd/commit/?id=13790add4bf64

This reverts cleanly against master, and running 219 with journald with that commit reverted runs stable.

Revision history for this message
Martin Pitt (pitti) wrote :

This is an improved git bisect script which is now robust against transient testbed failures and "make" build failures. This is now suitable for a fully automated "git bisect run ./systemd-bisect-run.sh", and was able to identify the regression in < 2 h fully noninteractively.

This is now mostly of historical interest of course, but it'll provide a nice starting point for the next regression.

Revision history for this message
Martin Pitt (pitti) wrote :
Martin Pitt (pitti)
Changed in systemd (Ubuntu):
status: In Progress → Fix Committed
Changed in systemd (Debian):
status: Unknown → Confirmed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (4.4 KiB)

This bug was fixed in the package systemd - 219-3ubuntu1

---------------
systemd (219-3ubuntu1) vivid; urgency=medium

  * Merge with Debian experimental. Remaining Ubuntu changes:
    - Hack to support system-image read-only /etc, and modify files in
      /etc/writable/ instead.
    - Keep our much simpler udev maintainer scripts (all platforms must
      support udev, no debconf).
    - initramfs init-top: Drop $ROOTDELAY, we do that in a more sensible way
      with wait-for-root. Will get applicable to Debian once Debian gets
      wait-for-root in initramfs-tools.
    - initramfs init-bottom: If LVM is installed, settle udev,
      otherwise we get missing LV symlinks. Workaround for LP #1185394.
    - Add debian/udev.lvm2.init: Dummy SysV init script to satisfy insserv
      dependencies to "lvm2" which is handled with udev rules in Ubuntu.
    - Provide shutdown fallback for upstart. (LP: #1370329)
    - debian/extra/ifup@.service: Additionally run for "auto" class. We don't
      really support "allow-hotplug" in Ubuntu at the moment, so we need to
      deal with "auto" devices appearing after "/etc/init.d/networking start"
      already ran. (LP: #1374521) Also, check if devices are actually defined
      in /etc/network/interfaces as we don't use Debian's net.agent.
    - ifup@.service: Drop dependency on networking.service (i. e.
      /etc/init.d/networking), and merely ensure that /run/network exists.
      This avoids unnecessary dependencies/waiting during boot and dependency
      cycles if hooks wait for other interfaces to come up (like ifenslave
      with bonding interfaces). (LP: #1414544)
    - Add Get-RTC-is-in-local-time-setting-from-etc-default-rc.patch: In
      Ubuntu we currently keep the setting whether the RTC is in local or UTC
      time in /etc/default/rcS "UTC=yes|no", instead of /etc/adjtime.
      (LP: #1377258)
    - Put session scopes into all cgroup controllers. This makes unprivileged
      user LXC containers work under systemd. (LP: #1346734)
    - Lower Breaks: to plymouth version which has the udev inotify fix in
      Ubuntu.
    - Lower libappamor1 dep to the Ubuntu version where it moved to /lib.
    - Make failure of boot-and-services NSpawn.test_boot non-fatal for now.
      This currently fails when being triggered by Jenkins, but is totally
      unreproducible when running this manually on the exact same machine.

    Upgrade fixes, keep until 16.04 LTS release:
    - systemd Conflicts/Replaces/Provides systemd-services.
    - Remove obsolete systemd-logind upstart job.
    - Clean up obsolete /etc/udev/rules.d/README.

systemd (219-3) experimental; urgency=medium

  * sysv-generator: fix wrong "Overwriting existing symlink" warnings.
    (Closes: #778700)
  * Add systemd-fsckd multiplexer and feed its output to plymouth. This
    provides an aggregate progress report of running file system checks and
    also allows cancelling them with ^C, in both text mode and Plymouth.
    (Closes: #775093, #758902; LP: #1316796)
  * Revert "journald: allow restarting journald without losing stream
    connections". This was a new feature in 219, but currently causes boot
    failures due to logind an...

Read more...

Changed in systemd (Ubuntu):
status: Fix Committed → Fix Released
Changed in systemd (Debian):
status: Confirmed → Fix Released
Martin Pitt (pitti)
Changed in systemd:
assignee: nobody → Martin Pitt (pitti)
Revision history for this message
Martin Pitt (pitti) wrote :

I'm getting a similar (perhaps/hopefully the same) hang if I do this:

systemctl mask --runtime systemd-journald systemd-logind; systemctl stop systemd-journald systemd-logind

SYSTEMD_LOG_LEVEL=debug /lib/systemd/systemd-journald
   # restart 5 times
SYSTEMD_LOG_LEVEL=debug /lib/systemd/systemd-logind
   # restart a few times

After a few restarts, logind fails to start with the 25s D-Bus timeout and the error message

Failed to add match for NameOwnerChanged: Connection timed out
Failed to fully start up daemon: Connection timed out

From then on it keeps failing. This can be reset by kill -9 the system dbus-daemon. systemd will auto-restart it, and then logind can start again.

Revision history for this message
Martin Pitt (pitti) wrote :

I don't get the hang with the reproducer in comment 11 on Fedora 21 with systemd 219 and dbus 1.8.16 from rawhide.

Revision history for this message
In , Adam (adam-redhat-bugs) wrote :

Created attachment 1001544
failure log

Since around March 6th, my Fedora 22 desktop system has frequently failed to boot properly.

Unfortunately the last time I booted the system before that was February 23rd, so there's quite a big list of possible changes in there.

I have tried downgrading from systemd-219-8.fc22 to systemd-219-5.fc22 and from kernel-4.0.0-0.rc3.git1.2 to 3.20.0-0.rc0.git10.1.fc22; neither seems to change anything, the bug still happens.

When the boot fails, various services don't attempt to start for some time, and when they try and start, many fail. The system never manages to start any consoles or gdm.

From the debug console (tty9), running 'systemctl' appears to do nothing for a long time, then shows an error like "Failed to register match for disconnected message".

I will attach logs of both failed and successful boots with systemd.log_level = debug.

The boot fails on average ~3 in 4 times.

Revision history for this message
In , Adam (adam-redhat-bugs) wrote :

Created attachment 1001545
success log

Revision history for this message
In , Zbigniew (zbigniew-redhat-bugs) wrote :

In the upcoming systemd update there's a fix for socket path handling. I'm not sure if this could be the same issue, but it'd be good if you could test if it helps.

Revision history for this message
In , Fedora (fedora-redhat-bugs) wrote :

systemd-219-9.fc22 has been submitted as an update for Fedora 22.
https://admin.fedoraproject.org/updates/systemd-219-9.fc22

Revision history for this message
In , Adam (adam-redhat-bugs) wrote :

Nope, sorry, doesn't help :/ saw the bug on 2 of 3 test boots with 219-9.

Revision history for this message
In , Fedora (fedora-redhat-bugs) wrote :

Package systemd-219-9.fc22:
* should fix your issue,
* was pushed to the Fedora 22 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing systemd-219-9.fc22'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-3949/systemd-219-9.fc22
then log in and leave karma (feedback).

Revision history for this message
In , Lukáš (luk-redhat-bugs) wrote :

This is definitely a bug in 219. I have the same issue with my rhel build.
Unfortunately on my machine it occurs quite rarely so its hard to debug.

Revision history for this message
In , Marek (marek-redhat-bugs) wrote :

I see similar problem, my case is described in bug 1204023.

Revision history for this message
In , Adam (adam-redhat-bugs) wrote :

Lukas: it occurs frequently for me, I'm happy to do any debugging anyone requests, but I'd really like to see this fixed. It's kind of unfortunate that my system neither starts up nor shuts down properly most of the time! (I suffer from one or other of the bugs in the disaster zone that is https://bugzilla.redhat.com/show_bug.cgi?id=1088619 ).

Revision history for this message
In , Michal (michal-redhat-bugs) wrote :

Just guessing... This could be the same problem:
http://lists.freedesktop.org/archives/systemd-devel/2015-February/028640.html
Martin Pitt is reverting commit 13790add4bf648fed816361794d8277a75253410 in Ubuntu as a workaround.

Revision history for this message
In , Adam (adam-redhat-bugs) wrote :

I just tested with systemd-219-11.fc22 and 8 boots in a row didn't hit the bug once. It may be a bit early to declare 'mission accomplished', but it looks good.

zbyszek, any chance you could submit an update for 219-11?

Revision history for this message
Adam Williamson (awilliamson) wrote :

FWIW I was seeing something that sounds very similar on F22:

https://bugzilla.redhat.com/show_bug.cgi?id=1201964

but it *seems* to have gone away with systemd-219-11.fc22, which is basically the same as systemd 'v219-stable' - http://cgit.freedesktop.org/systemd/systemd-stable/log/?h=v219-stable - at 85a6fabdd3e43cfab0fc6359e9f2a9e368d4a3ed , "core/namespace: fix path sorting". Seems like something fixed it for me.

219-9.fc22 was definitely broken for me; that build had 'v219-stable' as of commit 4acdc3835b2c9d395f1df95bb1dd5620a4a0e7e5 , "service: don't add After= dependencies on .busname units if kdbus support is disabled". So it seems like something between two those commits on v219-stable fixed it for me, assuming it stays fixed.

Revision history for this message
Martin Pitt (pitti) wrote :

I just applied all v219-stable patches on top of our package, and dropped the reverted patch. I now get the boot failures again, so I'm afraid that wasn't it.

Revision history for this message
In , Martin (martin-redhat-bugs) wrote :

I just applied all v219-stable patches on top of our package, and re-applied 13790a again (i. e. dropped the revert). I now get the boot failures again, so I'm afraid the patches in v219-stable aren't sufficient.

Revision history for this message
In , Adam (adam-redhat-bugs) wrote :

Hum, interesting. So either I got really lucky yesterday, or we're actually seeing two different bugs, or (I suppose) latest stable changes things enough to hide whatever's triggering the bug when I boot my system, but not to fix your reproducer...

Revision history for this message
Martin Pitt (pitti) wrote :

Also tracked in Arch Linux: https://bugs.archlinux.org/task/44016

Revision history for this message
Martin Pitt (pitti) wrote :

I cannot reproduce this with current upstream git master any more, so I ran another bisect. Apparently http://cgit.freedesktop.org/systemd/systemd/commit/?id=64144440 fixes this, so I guess the regression was introduced in http://cgit.freedesktop.org/systemd/systemd/commit/?id=a354329f72 and the reverted commit http://cgit.freedesktop.org/systemd/systemd/commit/?id=13790add4bf64 merely exposed the bug as journald was (one of the) first user of fd storage.

http://cgit.freedesktop.org/systemd/systemd/commit/?id=64144440 applies cleanly on top of v219-stable, so I replaced our reverted patch with a backport of that fix:

  http://anonscm.debian.org/cgit/pkg-systemd/systemd.git/commit/?h=experimental&id=894d0f2d21d8ed18

Now the boot-smoke test still succeeds with that package.

Changed in systemd:
status: New → Fix Released
Revision history for this message
In , Martin (martin-redhat-bugs) wrote :

Good news! It seems this got finally fixed (involuntarily) in upstream git master, and the patch cleanly backports to 219. See

  http://lists.freedesktop.org/archives/systemd-devel/2015-April/031348.html

for the details.

Revision history for this message
In , Adam (adam-redhat-bugs) wrote :

+1 resurrection of kittens

Revision history for this message
In , Ferry (ferry-redhat-bugs) wrote :

Guys,
Is this already fixed in F22?

I came here via a search on 'fedora 22 avahi-daemon fails'.
avahi-daemon fails to start on one of my servers, but not on others

Revision history for this message
In , Ferry (ferry-redhat-bugs) wrote :

(In reply to Ferry Huberts from comment #15)
> Guys,
> Is this already fixed in F22?
>
> I came here via a search on 'fedora 22 avahi-daemon fails'.
> avahi-daemon fails to start on one of my servers, but not on others

Nevermind, this was an install failure for the avahi package: the avahi user was not created.

I've seen this with the tcpdump packages as well.
I think there might be some rpm/dnf issue there.

Revision history for this message
quequotion (quequotion) wrote :

I am very, very painfully going through this process right now.

I REALLY need dpkg to stop trying to remove colord and network-manager (dependent on policykit-1). It would also help if Ubuntu Web Browser were a little more responsive.. Took 10 minutes to type this far.... probably unrelated...

Is the reboot absolutely necessary? I don't think my system will come back.

Revision history for this message
eiro (eiro1980) wrote :

I'm on Ubuntu 15.10 and still have the same problem:

Feb 10 09:07:09 DELL-PC systemd-logind[710]: Failed to add match for JobRemoved: Connection timed out
Feb 10 09:07:09 DELL-PC systemd-logind[710]: Failed to fully start up daemon: Connection timed out
Feb 10 09:07:35 DELL-PC dbus[729]: [system] Failed to activate service 'org.freedesktop.PolicyKit1': timed out
Feb 10 09:07:35 DELL-PC dbus[729]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Feb 10 09:07:35 DELL-PC systemd-logind[3237]: Failed to enable subscription: Failed to activate service 'org.freedesktop.systemd1': timed out
Feb 10 09:07:35 DELL-PC systemd-logind[3237]: Failed to fully start up daemon: Connection timed out
Feb 10 09:07:36 DELL-PC dbus[729]: [system] Failed to activate service 'org.freedesktop.ColorManager': timed out
Feb 10 09:07:36 DELL-PC dbus[729]: [system] Failed to activate service 'org.freedesktop.hostname1': timed out
Feb 10 09:07:45 DELL-PC dbus[729]: [system] Failed to activate service 'fi.w1.wpa_supplicant1': timed out
Feb 10 09:08:10 DELL-PC dbus[729]: [system] Failed to activate service 'fi.w1.wpa_supplicant1': timed out
Feb 10 09:08:10 DELL-PC dbus[729]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Feb 10 09:10:47 DELL-PC dbus[765]: [system] Failed to activate service 'org.freedesktop.ColorManager': timed out
Feb 10 09:10:47 DELL-PC systemd-logind[1919]: Failed to enable subscription: Connection timed out
Feb 10 09:10:47 DELL-PC dbus[765]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Feb 10 09:10:47 DELL-PC systemd-logind[1919]: Failed to fully start up daemon: Connection timed out
Feb 10 09:10:47 DELL-PC dbus[765]: [system] Failed to activate service 'org.freedesktop.PolicyKit1': timed out
Feb 10 09:10:48 DELL-PC dbus[765]: [system] Failed to activate service 'org.freedesktop.hostname1': timed out
Feb 10 09:10:56 DELL-PC dbus[765]: [system] Failed to activate service 'fi.w1.wpa_supplicant1': timed out
Feb 10 09:11:21 DELL-PC dbus[765]: [system] Failed to activate service 'fi.w1.wpa_supplicant1': timed out
Feb 10 09:11:21 DELL-PC dbus[765]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out

Revision history for this message
In , Fedora (fedora-redhat-bugs) wrote :

Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Changed in systemd (Fedora):
importance: Unknown → High
status: Unknown → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.