cookie overruns can cause org.freedesktop.systemd1 dbus to hang

Bug #1876600 reported by Heitor Alves de Siqueira on 2020-05-03
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Undecided
Unassigned
Xenial
High
Heitor Alves de Siqueira
Bionic
High
Heitor Alves de Siqueira

Bug Description

[Impact]
Long-running services overflow the sd_bus->cookie counter, causing further communication with org.freedesktop.systemd1 to stall.

[Description]
Systemd dbus messages include a "cookie" value to uniquely identify them in their bus context. This value is obtained from the bus header, and incremented for each exchanged message in the same bus object. For services that run for longer periods of time and keep communicating through dbus, it's possible to overflow the cookie value, causing further messages to the org.freedesktop.systemd1 dbus to fail. This can lead to these services becoming unresponsive, as they get stuck trying to communicate with invalid bus cookie values.

This issue has been fixed upstream by the commit below:
- sd-bus: deal with cookie overruns (1f82f5bb4237)

$ git describe --contains 1f82f5bb4237
v242-rc1~228

$ rmadison systemd
 systemd | 229-4ubuntu4 | xenial | source, ...
 systemd | 229-4ubuntu21.27 | xenial-security | source, ...
 systemd | 229-4ubuntu21.27 | xenial-updates | source, ...
 systemd | 229-4ubuntu21.28 | xenial-proposed | source, ...
 systemd | 237-3ubuntu10 | bionic | source, ...
 systemd | 237-3ubuntu10.38 | bionic-security | source, ...
 systemd | 237-3ubuntu10.39 | bionic-updates | source, ...
 systemd | 237-3ubuntu10.40 | bionic-proposed | source, ... <----
 systemd | 242-7ubuntu3 | eoan | source, ...

Releases starting with Eoan already have this fix.

[Test Case]
There doesn't seem to be an easy test case for this, as the cookie values start at zero and won't overflow until (1<<32). There have been reports from users hitting this on Kubernetes clusters continuously running for longer periods (~5 months).
Using GDB, we can construct an artificial test case to test the cookie overflow. The test case below performs the following steps:

1. Create a new system bus object through sd_bus_default_system()
2. Allocate and append a new method_call message to the bus
3. Send the message through sd_bus_call()
4. Handle the response message and free up the message objects

It's essentially the example code from the sd_bus_message_new_method_call() manpage, with minor modifications: this is done continuously, to keep incrementing the bus cookie value. We step in with GDB when it reaches 0x10000, and set its value to 0xffffff00 which then causes the test program to fail shortly afterwards. An example test run of an impacted system:

ubuntu@bionic:~$ gcc -Wall test.c -o cookie -lsystemd -g
ubuntu@bionic:~$ gdb --batch --command=test.gdb --args ./cookie
Breakpoint 1 at 0xe61: file test.c, line 38.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
(16s) cookie: 0x00010000 reply-cookie: 0x00010000

Breakpoint 1, print_unit_path (bus=0x555555757290) at test.c:38
38 r = sd_bus_message_new_method_call(bus, &m,
$1 = 0x10000
$2 = 0xffffff00
Call failed: Operation not supported
Sleeping and retrying...
Call failed: Invalid argument
Assertion 'm->n_ref > 0' failed at ../src/libsystemd/sd-bus/bus-message.c:934, function sd_bus_message_unref(). Aborting.

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=0x6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

To compile and debug the test case above, libsystemd-dev and libsystemd0-dbgsym are required.
Both test.c and test.gdb source code are attached to this LP bug.

[Regression Potential]
This fix introduces some changes in the way cookie incrementation is handled. We now have a reduced number of available values, since the patch makes use of a high order bit to indicate whether we have overflowed or not. Potential issues could arise from two distinct messages repeating the cookie value, or from us not handling the cookie reuse properly. In practice, this shouldn't cause serious problems as most dbus messages should not stall long enough for a possible overlap in the 2^31 space. The patch has been present in other stable Ubuntu Series and upstream, and has been validated and tested through the systemd test suite and autopkgtests.

Related branches

Changed in systemd (Ubuntu):
status: New → Fix Released
Changed in systemd (Ubuntu Xenial):
status: New → Confirmed
Changed in systemd (Ubuntu Bionic):
status: New → Confirmed
Changed in systemd (Ubuntu Xenial):
importance: Undecided → High
Changed in systemd (Ubuntu Bionic):
importance: Undecided → High
assignee: nobody → Heitor Alves de Siqueira (halves)
Changed in systemd (Ubuntu Xenial):
assignee: nobody → Heitor Alves de Siqueira (halves)
description: updated
summary: - cookie overruns cause org.freedesktop.systemd1 dbus to hang
+ cookie overruns can cause org.freedesktop.systemd1 dbus to hang
description: updated

Test builds for the proposed merge can be found at the lp1876600 PPA [0].

[0] https://launchpad.net/~halves/+archive/ubuntu/lp1876600

tags: added: sts-sponsor
description: updated
Changed in systemd (Ubuntu Xenial):
status: Confirmed → In Progress
Changed in systemd (Ubuntu Bionic):
status: Confirmed → In Progress
Dan Streetman (ddstreet) on 2020-05-06
tags: added: sts-sponsor-ddstreet
removed: sts-sponsor

Hello Heitor, or anyone else affected,

Accepted systemd into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/systemd/237-3ubuntu10.41 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in systemd (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-bionic

All autopkgtests for the newly accepted systemd (237-3ubuntu10.41) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

python-dbusmock/unknown (armhf)
policykit-1/unknown (armhf)
multipath-tools/unknown (armhf)
debci/unknown (ppc64el)
netplan.io/0.99-0ubuntu3~18.04.1 (i386)
pdns-recursor/unknown (armhf)
umockdev/0.11.1-1 (armhf)
sssd/unknown (armhf)
linux-raspi2-5.3/unknown (armhf)
suricata/unknown (armhf)
lxc/unknown (armhf)
casync/2+61.20180112-1 (s390x)
openssh/1:7.6p1-4ubuntu0.3 (arm64, s390x, amd64, i386, ppc64el, armhf)
python-systemd/unknown (armhf)
puppet/unknown (armhf)
prometheus-postgres-exporter/unknown (armhf)
lxc/3.0.3-0ubuntu1~18.04.1 (arm64)
postgresql-10/unknown (armhf)
polkit-qt-1/unknown (armhf)
munin/unknown (armhf)
systemd/237-3ubuntu10.41 (armhf)
pulseaudio/unknown (armhf)
php7.2/unknown (armhf)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#systemd

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Validated systemd 237-3ubuntu10.41 from bionic-proposed, according to test case from bug description:

ubuntu@systemd-cookie-bionic:~$ dpkg -l systemd | grep systemd
ii systemd 237-3ubuntu10.41 amd64 system and service manager
ubuntu@systemd-cookie-bionic:~$ gcc -Wall test.c -o cookie -lsystemd -g
ubuntu@systemd-cookie-bionic:~$ gdb --batch --command=test.gdb --args ./cookie
Breakpoint 1 at 0xe61: file test.c, line 38.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
(15s) cookie: 0x00010000 reply-cookie: 0x00010000

Breakpoint 1, print_unit_path (bus=0x555555757290) at test.c:38
38 r = sd_bus_message_new_method_call(bus, &m,
$1 = 0x10000
$2 = 0xffffff00
(15s) cookie: 0x80000000 reply-cookie: 0x80000000
(29s) cookie: 0x80010000 reply-cookie: 0x80010000
(43s) cookie: 0x80020000 reply-cookie: 0x80020000

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 237-3ubuntu10.41

---------------
systemd (237-3ubuntu10.41) bionic; urgency=medium

  [ Dan Streetman ]
  * d/p/lp1867375/0001-network-Allow-to-configure-GW-even-UseRoutes-false.patch,
    d/p/lp1867375/0002-network-add-a-flag-to-ignore-gateway-provided-by-DHC.patch,
    d/p/lp1867375/0003-network-change-UseGateway-default-to-UseRoutes-setti.patch:
    - Move gateway ignoring from UseRoutes= to UseGateway= (LP: #1867375)
   * d/p/lp1873607/0002-core-make-sure-to-restore-the-control-command-id-too.patch:
     - Avoid segfault during serialization (LP: #1873607)
   * d/p/lp1529152/0001-bash-completion-systemctl-use-systemctl-no-pager.patch,
     d/p/lp1529152/0002-bash-completion-systemctl-pass-current-partial-unit-.patch,
     d/p/lp1529152/0003-shell-completion-systemctl-pass-current-word-to-all-.patch,
     d/p/lp1529152/0004-bash-completion-systemctl-re-implement-__filter_unit.patch,
     d/p/lp1529152/0005-strip-value-from-property-names.patch:
     - fix slow systemctl tab completion (LP: #1529152)
   * d/p/lp1877159-networkd-fix-attribute-length-for-wireguard-10380.patch:
     - avoid kernel err msg setting wireguard param (LP: #1877159)

  [ Heitor Alves de Siqueira ]
  * d/p/lp1876600-sd-bus-deal-with-cookie-overruns.patch:
    - deal with dbus cookie overruns (LP: #1876600)

 -- Heitor Alves de Siqueira <email address hidden> Sun, 03 May 2020 11:30:25 +0000

Changed in systemd (Ubuntu Bionic):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for systemd has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments