xenial systemd reports 'inactive' instead of 'failed' for service units that repeatedly failed to restart / failed permanently

Bug #1795658 reported by Mauricio Faria de Oliveira
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Invalid
Undecided
Unassigned
Xenial
Fix Released
Medium
Dimitri John Ledkov

Bug Description

[Impact]

 * In case a service unit has repeatedly failed to restart, it should be
   reported as 'failed' permanently, but currently it's instead reported
   as 'inactive'.

 * System monitoring tools that evaluate the status of systemd service units
   and act upon it (for example: restart service, report permanent failure)
   are currently misled by information in 'systemctl status <unit>.service'.

 * System management tools based on such information may take wrong and/or
   sub-optimal actions in the managed systems regarding such service units.

 * This systemd patch [1] directly addresses this issue (see systemd github
   PR #3166 [2]), and its code is still effectice in upstream systemd today,
   without further fixes/changes (the only changes were in doc text and the
   busname files that were removed, but still without further fixes to this).

[Test Case]

 * This is copied from systemd PR #3166 [2].

 * This has been tested by a customer as well, and with its system monitoring
   and management solution, for interoperability verification.

    $ cat <<EOF | sudo tee /etc/systemd/system/fail-on-restart.service
    [Service]
    ExecStart=/bin/false
    Restart=always
    EOF

    $ sudo systemctl daemon-reload
    $ sudo systemctl start fail-on-restart

    Before) "Active: inactive (dead)"

    $ systemctl status -n0 fail-on-restart
    fail-on-restart.service
       Loaded: loaded (/etc/systemd/system/fail-on-restart.service; static; vendor preset: enabled)
       Active: inactive (dead)

    After) "Active: failed (Result: start-limit-hit)"

    $ systemctl status -n0 fail-on-restart
    fail-on-restart.service
       Loaded: loaded (/etc/systemd/system/fail-on-restart.service; static; vendor preset: enabled)
       Active: failed (Result: start-limit-hit) since Sat 2018-09-29 11:01:34 UTC; 4s ago
      Process: 7066 ExecStart=/bin/false (code=exited, status=1/FAILURE)
     Main PID: 7066 (code=exited, status=1/FAILURE)

[Regression Potential]

 * This code changes at which point the check for the number of (re)start
   attempts are made, so regressions to (re)start units are theoretically
   possible.

 * However, this code actually reverts a change that caused a regression,
   so it goes back to the code that was known to work correctly before ..

 * .. and it is still in this form in upstream systemd nowadays,
   without further fixes/changes (see comment in the Impact section).

[Other Info]

 * Test package was built on Launchpad PPA for all architectures,
   with dependencies from Proposed enabled (more up-to-date for SRU).

 * The testsuite (in package build time; blocks the package build result)
   has identical results to that in buildlog of current xenial-updates.

    ============================================================================
    Testsuite summary for systemd 229
    ============================================================================
    # TOTAL: 128
    # PASS: 109
    # SKIP: 19
    # XFAIL: 0
    # FAIL: 0
    # XPASS: 0
    # ERROR: 0
    ============================================================================

[Links]

[1] https://github.com/systemd/systemd/commit/072993504e3e4206ae1019f5461a0372f7d82ddf
[2] https://github.com/systemd/systemd/issues/3166
[3] https://launchpad.net/~mfo/+archive/ubuntu/sf199312

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Changed in systemd (Ubuntu):
assignee: nobody → Mauricio Faria de Oliveira (mfo)
status: New → In Progress
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (7.0 KiB)

More details on the verification of test package from Launchpad PPA)
---

Test-case)

$ cat <<EOF | sudo tee /etc/systemd/system/fail-on-restart.service
[Service]
ExecStart=/bin/false
Restart=always
EOF

Before) "Active: inactive (dead)"

$ dpkg -s systemd | grep Version
Version: 229-4ubuntu21.4

$ sudo systemctl daemon-reload

$ sudo systemctl start fail-on-restart

$ systemctl status -n0 fail-on-restart
● fail-on-restart.service
   Loaded: loaded (/etc/systemd/system/fail-on-restart.service; static; vendor preset: enabled)
   Active: inactive (dead)

$ journalctl --no-pager -u fail-on-restart
<...>
Sep 29 10:59:00 havers systemd[1]: Started fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Main process exited, code=exited, status=1/FAILURE
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Unit entered failed state.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Failed with result 'exit-code'.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Service hold-off time over, scheduling restart.
Sep 29 10:59:00 havers systemd[1]: Stopped fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: Started fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Main process exited, code=exited, status=1/FAILURE
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Unit entered failed state.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Failed with result 'exit-code'.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Service hold-off time over, scheduling restart.
Sep 29 10:59:00 havers systemd[1]: Stopped fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: Started fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Main process exited, code=exited, status=1/FAILURE
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Unit entered failed state.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Failed with result 'exit-code'.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Service hold-off time over, scheduling restart.
Sep 29 10:59:00 havers systemd[1]: Stopped fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: Started fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Main process exited, code=exited, status=1/FAILURE
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Unit entered failed state.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Failed with result 'exit-code'.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Service hold-off time over, scheduling restart.
Sep 29 10:59:00 havers systemd[1]: Stopped fail-on-restart.service.
Sep 29 10:59:01 havers systemd[1]: Started fail-on-restart.service.
Sep 29 10:59:01 havers systemd[1]: fail-on-restart.service: Main process exited, code=exited, status=1/FAILURE
Sep 29 10:59:01 havers systemd[1]: fail-on-restart.service: Unit entered failed state.
Sep 29 10:59:01 havers systemd[1]: fail-on-restart.service: Failed with result 'exit-code'.
Sep 29 10:59:01 havers systemd[1]: fail-on-restart.service: Service hold-off time over, scheduli...

Read more...

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Note to self, everything after xenial has this patch already.

Changed in systemd (Ubuntu Xenial):
status: New → Triaged
assignee: nobody → Dimitri John Ledkov (xnox)
importance: Undecided → Medium
Changed in systemd (Ubuntu):
status: In Progress → Invalid
assignee: Mauricio Faria de Oliveira (mfo) → nobody
Changed in systemd (Ubuntu Xenial):
status: Triaged → In Progress
Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello Mauricio, or anyone else affected,

Accepted systemd into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/systemd/229-4ubuntu21.5 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in systemd (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Verification done on Xenial.
Changing verification tags.

--

Testcase)

    $ cat <<EOF | sudo tee /etc/systemd/system/fail-on-restart.
    [Service]
    ExecStart=/bin/false
    Restart=always
    EOF

    $ sudo systemctl daemon-reload

Before) "Active: inactive (dead)"

    $ dpkg -s systemd | grep ^Version:
    Version: 229-4ubuntu21.4

    $ sudo systemctl start fail-on-restart

    $ systemctl status -n0 fail-on-restart
       fail-on-restart.service
       Loaded: loaded (/etc/systemd/system/fail-on-restart.service; static; vendor p
       Active: inactive (dead)

Install/Reboot)

    $ sudo apt-get install systemd
    $ sudo reboot

After) "Active: failed (Result: start-limit-hit) since <...>"

    $ dpkg -s systemd | grep ^Version:
    Version: 229-4ubuntu21.5

    $ sudo systemctl start fail-on-restart

    $ systemctl status -n0 fail-on-restart
       fail-on-restart.service
       Loaded: loaded (/etc/systemd/system/fail-on-restart.service; static; vendor preset: enabled)
       Active: failed (Result: start-limit-hit) since Wed 2018-10-10 20:53:17 UTC; 2min 34s ago
      Process: 1450 ExecStart=/bin/false (code=exited, status=1/FAILURE)
     Main PID: 1450 (code=exited, status=1/FAILURE)

tags: added: verification-done verification-done-xenial
removed: verification-needed verification-needed-xenial
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Positive test results with xenial-proposed have also been verified by a company interested in this fix, in their scenario/application.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 229-4ubuntu21.5

---------------
systemd (229-4ubuntu21.5) xenial; urgency=medium

  [ Dimitri John Ledkov ]
  * systemctl: correctly proceed to immediate shutdown if scheduling fails
    (LP: #1670291)
  * hwdb: update micmute on Dell laptops. (LP: #1738153)
  * hwdb: Use wlan keycode for all Dell systems. (LP: #1762385)
  * units: Disable journald Watchdog (LP: #1773148)

  [ Mauricio Faria de Oliveira ]
  * core: Fix for service to enter the 'failed' state (rather than 'inactive') after it repeatedly fails restart.
    (LP: #1795658)

  [ Dimitri John Ledkov ]
  * Disable dh_installinit generation of tmpfiles for the systemd package.
    (LP: #1748147)

 -- Dimitri John Ledkov <email address hidden> Mon, 08 Oct 2018 16:10:42 +0100

Changed in systemd (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Robie Basak (racb) wrote : Update Released

The verification of the Stable Release Update for systemd has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.