Reloading services can result in a deadlock under systemd

Bug #1417010 reported by Tim Lunn
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
samba (Debian)
Fix Released
Unknown
systemd (Debian)
Fix Released
Unknown
systemd (Ubuntu)
Fix Released
High
Martin Pitt
sysvinit (Debian)
Fix Released
Unknown
sysvinit (Ubuntu)
Fix Released
Medium
Martin Pitt

Bug Description

since today (although haven't rebooted in a few days), systemd init is hanging trying to bring up ifup@eth0. I had to revert to upstart boot to get in, although sometimes it would successfully boot with systemd.debug-shell

JOB UNIT TYPE STATE
120 winbind.service start waiting
175 nmbd.service start waiting
159 libvirt-bin.service start waiting
151 dns-clean.service start waiting
116 vmware-workstation-server.service start waiting
177 smbd.service start waiting
178 fail2ban.service start waiting
331 rpcbind.service start waiting
109 apache2.service start waiting
124 openvpn.service start waiting
184 x-display-manager.target start waiting
173 vmware.service start waiting
143 <email address hidden> start waiting
141 samba-ad-dc.service start waiting
  2 multi-user.target start waiting
332 rpcbind.target start waiting
108 hddtemp.service start waiting
172 plymouth-quit-wait.service start waiting
152 autofs.service start waiting
123 schroot.service start waiting
130 mail-transport-agent.target start waiting
129 postfix.service start waiting
128 isc-dhcp-server.service start waiting
131 mysql.service start waiting
142 getty.target start waiting
164 sabnzbdplus.service start waiting
  1 graphical.target start waiting
176 apt-cacher-ng.service start waiting
183 gdm.service start waiting
174 ssh.service start waiting
 77 network-online.target start waiting
112 systemd-update-utmp-runlevel.service start waiting
133 urfkill.service start waiting
111 kerneloops.service start waiting
161 rc-local.service start waiting
106 lxc.service start waiting
107 lxc-net.service start waiting
155 virtualbox.service start waiting
191 <email address hidden> start running
 50 network.target start waiting

40 jobs listed.
* <email address hidden> - ifup for eth0
   Loaded: loaded (/lib/systemd/system/ifup@.service; static; vendor preset: enabled)
   Active: activating (start-post) since Mon 2015-02-02 19:17:25 AEDT; 9min ago
  Process: 800 ExecStart=/sbin/ifup --allow=hotplug %I (code=exited, status=0/SUCCESS)
 Main PID: 800 (code=exited, status=0/SUCCESS); : 803 (ifup)
   CGroup: /<email address hidden>
           `-control
             |- 803 /sbin/ifup --allow=auto eth0
             |- 832 /bin/sh -c dhclient -1 -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases eth0
             |- 833 dhclient -1 -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases eth0
             |-1269 /bin/sh /sbin/dhclient-script
             |-1273 /bin/sh /usr/sbin/invoke-rc.d smbd reload
             `-1294 systemctl reload smbd.service

Feb 02 19:17:25 duhast dhclient[833]: Sending on Socket/fallback
Feb 02 19:17:25 duhast dhclient[833]: DHCPREQUEST of 192.168.15.67 on eth0 to 255.255.255.255 port 67 (xid=0x502572c4)
Feb 02 19:17:25 duhast ifup[803]: Listening on LPF/eth0/1c:6f:65:c4:63:1b
Feb 02 19:17:25 duhast ifup[803]: Sending on LPF/eth0/1c:6f:65:c4:63:1b
Feb 02 19:17:25 duhast ifup[803]: Sending on Socket/fallback
Feb 02 19:17:25 duhast ifup[803]: DHCPREQUEST of 192.168.15.67 on eth0 to 255.255.255.255 port 67 (xid=0x502572c4)
Feb 02 19:17:28 duhast dhclient[833]: DHCPREQUEST of 192.168.15.67 on eth0 to 255.255.255.255 port 67 (xid=0x502572c4)
Feb 02 19:17:28 duhast ifup[803]: DHCPREQUEST of 192.168.15.67 on eth0 to 255.255.255.255 port 67 (xid=0x502572c4)
Feb 02 19:17:28 duhast dhclient[833]: DHCPACK of 192.168.15.67 from 192.168.15.1
Feb 02 19:17:28 duhast ifup[803]: DHCPACK of 192.168.15.67 from 192.168.15.1

ProblemType: Bug
DistroRelease: Ubuntu 15.04
Package: systemd 218-6ubuntu1
ProcVersionSignature: Ubuntu 3.18.0-12.13-generic 3.18.4
Uname: Linux 3.18.0-12-generic x86_64
NonfreeKernelModules: nvidia
ApportVersion: 2.15.1-0ubuntu4
Architecture: amd64
CurrentDesktop: GNOME
Date: Mon Feb 2 19:47:21 2015
InstallationDate: Installed on 2012-09-23 (862 days ago)
InstallationMedia: Ubuntu GNOME Remix 12.10 "Quantal Quetzal" - Alpha amd64(20120922)
MachineType: Gigabyte Technology Co., Ltd. P67A-UD3R-B3
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.18.0-12-generic root=UUID=71eeb904-068b-41c2-9d34-59365ca4c3ad ro quiet splash
SourcePackage: systemd
SystemdDelta:
 [EXTENDED] /lib/systemd/system/graphical.target → /lib/systemd/system/graphical.target.d/xdiagnose.conf

 1 overridden configuration files found.
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 03/21/2012
dmi.bios.vendor: Award Software International, Inc.
dmi.bios.version: F9
dmi.board.name: P67A-UD3R-B3
dmi.board.vendor: Gigabyte Technology Co., Ltd.
dmi.board.version: x.x
dmi.chassis.type: 3
dmi.chassis.vendor: Gigabyte Technology Co., Ltd.
dmi.modalias: dmi:bvnAwardSoftwareInternational,Inc.:bvrF9:bd03/21/2012:svnGigabyteTechnologyCo.,Ltd.:pnP67A-UD3R-B3:pvr:rvnGigabyteTechnologyCo.,Ltd.:rnP67A-UD3R-B3:rvrx.x:cvnGigabyteTechnologyCo.,Ltd.:ct3:cvr:
dmi.product.name: P67A-UD3R-B3
dmi.sys.vendor: Gigabyte Technology Co., Ltd.

Revision history for this message
Tim Lunn (darkxst) wrote :
Revision history for this message
Martin Pitt (pitti) wrote :

It seems this is a dependency loop:

 (1) network-online.target waits for <email address hidden> as it's in /etc/network/interfaces
 (2) <email address hidden> starts during boot, triggered by udev; through samba's smbd dhclient-enter hook it calls invoke-rc.d smbd reload
 (3) /etc/init.d/smbd has "Requires: $network"
 (4) the SysV generator translates $network to network-online.target

We really want (1), (2) and (3) are justified. According to http://refspecs.linuxbase.org/LSB_4.1.0/LSB-Core-generic/LSB-Core-generic/facilname.html, $network should be mapped to network.service, not network-online.service. If others disagree and $network should have the "strong" is-online meaning, we need to fix it at (3) instead, e. g. by adding a proper systemd unit to samba. (We should really teach socket activation to stuff..)

Changed in systemd (Ubuntu):
importance: Undecided → High
status: New → Triaged
Revision history for this message
Tim Lunn (darkxst) wrote :

journalctl debug logs from hang

Revision history for this message
Martin Pitt (pitti) wrote :

Keeping notes:

> 50 network.target start waiting

This is another bug, because ifup@.service has Before=network.target. That's too strict, and should be Before=network-online.target.

Revision history for this message
Martin Pitt (pitti) wrote :

I was able to fudge a similar situation in a VM. Could you please try and edit /etc/dhcp/dhclient-enter-hooks.d/samba to append a "||true" in the third last line where it calls invoke-rc.d smbd reload?

For reference: While smbd is not yet started (as it waits for $network) this will fail with code 1 as the unit can't be reloaded yet.

Revision history for this message
Martin Pitt (pitti) wrote :

Further explaining: ... and if the DHCP enter hook fails, then the whole DHCP request fails, this ifup@eth0 never succeeds.

Revision history for this message
Tim Lunn (darkxst) wrote :

Martin, that doesn't work either since the invoke-rc.d call never actually returns

Revision history for this message
Martin Pitt (pitti) wrote :

> since the invoke-rc.d call never actually returns

Incidentally this is being discussed upstream at http://lists.freedesktop.org/archives/systemd-devel/2015-February/027877.html at the moment.

Can you please try to patch /usr/sbin/invoke-rc.d with this: http://paste.ubuntu.com/10048858/
This isn't a complete or completely correct fix, but confirming that it works would be appreciated.

Thanks!

Revision history for this message
Martin Pitt (pitti) wrote :

The samba DHCP enter hook needs to be fixed too, so adding a task.

Changed in samba (Ubuntu):
assignee: nobody → Martin Pitt (pitti)
importance: Undecided → Medium
status: New → In Progress
Revision history for this message
Martin Pitt (pitti) wrote :

Keeping notes: This is a well-known property/problem of systemd job transactions. In its simplest form it looks like this:

# cat /etc/systemd/system/foo.service
[Unit]
Description=some foo service
Requires=dep.service
After=dep.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/echo foo is running
ExecReload=/bin/echo reloading foo

# cat /etc/systemd/system/dep.service
[Unit]
Description=dependency of foo.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStartPre=/bin/sleep 3
ExecStart=/bin/echo dep is running
ExecStartPost=-/bin/systemctl reload foo.service

# systemctl daemon-reload; systemctl start foo.service
→ hangs forever, and systemctl list-jobs shows:

JOB UNIT TYPE STATE
886 dep.service start running
818 foo.service start waiting

In this form it's obviously bogus (nobody would write a unit like this). But it becomes quite common if the "reload" happens through some indirection, like a DHCP enter/exit or an ifupdown hook.

For these situations the reload should happen with --no-block, to not wait until it's finished, but just to enqueue it. While that's easy enough if a systemd unit itself does that, our various hooks didn't take that into account when they were written, and as they usually use "invoke-rc.d" or "service" (both of which don't know about --no-block) we need a more generic solution/workaround.

I proposed some bits on the upstream ML: http://lists.freedesktop.org/archives/systemd-devel/2015-February/027966.html

Revision history for this message
Martin Pitt (pitti) wrote :

I thought I had DHCP fail yesterday on a failing hook, but I cannot reproduce this today. So the || true was apparently a red herring.

I also linked to the Debian bug which describes the same problem, and has a proposed patch. We could apply that as a short-term solution, or try to fix this in systemd properly.

Changed in samba (Ubuntu):
assignee: Martin Pitt (pitti) → nobody
status: In Progress → Triaged
Changed in systemd (Ubuntu):
assignee: nobody → Martin Pitt (pitti)
Martin Pitt (pitti)
summary: - systemd init gets stuck reloading smbd.service
+ Reloading services can result in a deadlock under systemd
Martin Pitt (pitti)
Changed in systemd (Ubuntu):
status: Triaged → In Progress
Changed in samba (Ubuntu):
milestone: none → ubuntu-15.04
milestone: ubuntu-15.04 → none
Changed in systemd (Ubuntu):
milestone: none → ubuntu-15.03
Changed in samba (Debian):
status: Unknown → Fix Committed
Revision history for this message
Martin Pitt (pitti) wrote :

For keeping references: This is another case like https://bugs.debian.org/635777 and https://bugs.debian.org/624599 just a bit more general.

Revision history for this message
Martin Pitt (pitti) wrote :

I pondered this for a while; quite naturally there is no perfect solution as synchronous reload/restart requests in hooks which may be part of a systemd unit transaction are fundamentally against the design of systemd. But this patch now seems to fix this bug and the three referenced Debian bugs, and supersedes Avoid-reload-and-re-start-requests-during-early-boot.patch and Avoid-reloading-services-when-shutting-down.patch, so I think it's better.

Attaching here so that I don't lose them.

Revision history for this message
Martin Pitt (pitti) wrote :

This is another patch I experimented with which helps with a different kind of deadlock problem. But with the previous patch all of the practically reported issues are already covered, so I just keep it here in case we need it in the future.

Revision history for this message
Martin Pitt (pitti) wrote :

Tim: Can you please revert the change to "service" (apt-get install --reinstall sysvinit-utils) and test the systemd in https://launchpad.net/~pitti/+archive/ubuntu/ppa ?

tags: added: patch
Revision history for this message
Martin Pitt (pitti) wrote :

Actually, please forget the PPA; after further discussions with the Debian and upstream systemd maintainers we have settled for a different solution. This now works for me with samba and postfix.

Can you please replace /usr/sbin/service with this script and confirm that it also fixes your boot hang? Thanks!

Revision history for this message
Tim Lunn (darkxst) wrote :

I tested the updated service scripts and indeed does also fix my boot hang!

Revision history for this message
Martin Pitt (pitti) wrote :

As explained in https://bugs.debian.org/777113 I want to fix this in a more generic way in "service", "invoke-rc.d", and /lib/lsb/init-functions.d/40-systemd.

affects: samba (Ubuntu) → sysvinit (Ubuntu)
Changed in sysvinit (Ubuntu):
milestone: none → ubuntu-15.03
assignee: nobody → Martin Pitt (pitti)
Changed in sysvinit (Debian):
status: Unknown → New
Changed in samba (Debian):
status: Fix Committed → New
Changed in systemd (Debian):
status: Unknown → New
Martin Pitt (pitti)
Changed in systemd (Ubuntu):
status: In Progress → Fix Committed
Martin Pitt (pitti)
Changed in sysvinit (Ubuntu):
status: Triaged → Fix Committed
Changed in systemd (Debian):
status: New → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package sysvinit - 2.88dsf-53.2ubuntu8

---------------
sysvinit (2.88dsf-53.2ubuntu8) vivid; urgency=medium

  * service, invoke-rc.d: Avoid deadlocks during bootup and shutdown from
    units/hooks which call "invoke-rc.d service reload" and similar, since the
    synchronous wait plus systemd's normal behaviour of transactionally
    processing all dependencies first easily causes dependency loops. Thus
    during boot/shutdown operate only on the unit and not on its dependencies,
    just like SysV behaves. (Closes: #777113, LP: #1417010)
  * Make sysvinit-utils and sysv-rc break systemd << 215 to ensure we have the
    "systemctl is-system-running" command.
 -- Martin Pitt <email address hidden> Mon, 09 Feb 2015 11:07:10 +0100

Changed in sysvinit (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 218-8ubuntu2

---------------
systemd (218-8ubuntu2) vivid; urgency=medium

  * debian/tests/logind: With dropped systemd-logind-launch we don't have a
    visible /sys/fs/cgroup/systemd/ any more under cgmanager. So adjust the
    test to check /proc/self/cgroup instead.
 -- Martin Pitt <email address hidden> Tue, 10 Feb 2015 10:37:01 +0100

Changed in systemd (Ubuntu):
status: Fix Committed → Fix Released
Changed in systemd (Debian):
status: Fix Committed → Fix Released
Changed in sysvinit (Debian):
status: New → Fix Committed
Changed in sysvinit (Debian):
status: Fix Committed → Fix Released
Changed in samba (Debian):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.