xenial systemd reports 'inactive' instead of 'failed' for service units that repeatedly failed to restart / failed permanently

Bug #1795658 reported by Mauricio Faria de Oliveira on 2018-10-02
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Undecided
Unassigned
Xenial
Medium
Dimitri John Ledkov

Bug Description

[Impact]

 * In case a service unit has repeatedly failed to restart, it should be
   reported as 'failed' permanently, but currently it's instead reported
   as 'inactive'.

 * System monitoring tools that evaluate the status of systemd service units
   and act upon it (for example: restart service, report permanent failure)
   are currently misled by information in 'systemctl status <unit>.service'.

 * System management tools based on such information may take wrong and/or
   sub-optimal actions in the managed systems regarding such service units.

 * This systemd patch [1] directly addresses this issue (see systemd github
   PR #3166 [2]), and its code is still effectice in upstream systemd today,
   without further fixes/changes (the only changes were in doc text and the
   busname files that were removed, but still without further fixes to this).

[Test Case]

 * This is copied from systemd PR #3166 [2].

 * This has been tested by a customer as well, and with its system monitoring
   and management solution, for interoperability verification.

    $ cat <<EOF | sudo tee /etc/systemd/system/fail-on-restart.service
    [Service]
    ExecStart=/bin/false
    Restart=always
    EOF

    $ sudo systemctl daemon-reload
    $ sudo systemctl start fail-on-restart

    Before) "Active: inactive (dead)"

    $ systemctl status -n0 fail-on-restart
    fail-on-restart.service
       Loaded: loaded (/etc/systemd/system/fail-on-restart.service; static; vendor preset: enabled)
       Active: inactive (dead)

    After) "Active: failed (Result: start-limit-hit)"

    $ systemctl status -n0 fail-on-restart
    fail-on-restart.service
       Loaded: loaded (/etc/systemd/system/fail-on-restart.service; static; vendor preset: enabled)
       Active: failed (Result: start-limit-hit) since Sat 2018-09-29 11:01:34 UTC; 4s ago
      Process: 7066 ExecStart=/bin/false (code=exited, status=1/FAILURE)
     Main PID: 7066 (code=exited, status=1/FAILURE)

[Regression Potential]

 * This code changes at which point the check for the number of (re)start
   attempts are made, so regressions to (re)start units are theoretically
   possible.

 * However, this code actually reverts a change that caused a regression,
   so it goes back to the code that was known to work correctly before ..

 * .. and it is still in this form in upstream systemd nowadays,
   without further fixes/changes (see comment in the Impact section).

[Other Info]

 * Test package was built on Launchpad PPA for all architectures,
   with dependencies from Proposed enabled (more up-to-date for SRU).

 * The testsuite (in package build time; blocks the package build result)
   has identical results to that in buildlog of current xenial-updates.

    ============================================================================
    Testsuite summary for systemd 229
    ============================================================================
    # TOTAL: 128
    # PASS: 109
    # SKIP: 19
    # XFAIL: 0
    # FAIL: 0
    # XPASS: 0
    # ERROR: 0
    ============================================================================

[Links]

[1] https://github.com/systemd/systemd/commit/072993504e3e4206ae1019f5461a0372f7d82ddf
[2] https://github.com/systemd/systemd/issues/3166
[3] https://launchpad.net/~mfo/+archive/ubuntu/sf199312

Changed in systemd (Ubuntu):
assignee: nobody → Mauricio Faria de Oliveira (mfo)
status: New → In Progress
Download full text (7.0 KiB)

More details on the verification of test package from Launchpad PPA)
---

Test-case)

$ cat <<EOF | sudo tee /etc/systemd/system/fail-on-restart.service
[Service]
ExecStart=/bin/false
Restart=always
EOF

Before) "Active: inactive (dead)"

$ dpkg -s systemd | grep Version
Version: 229-4ubuntu21.4

$ sudo systemctl daemon-reload

$ sudo systemctl start fail-on-restart

$ systemctl status -n0 fail-on-restart
● fail-on-restart.service
   Loaded: loaded (/etc/systemd/system/fail-on-restart.service; static; vendor preset: enabled)
   Active: inactive (dead)

$ journalctl --no-pager -u fail-on-restart
<...>
Sep 29 10:59:00 havers systemd[1]: Started fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Main process exited, code=exited, status=1/FAILURE
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Unit entered failed state.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Failed with result 'exit-code'.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Service hold-off time over, scheduling restart.
Sep 29 10:59:00 havers systemd[1]: Stopped fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: Started fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Main process exited, code=exited, status=1/FAILURE
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Unit entered failed state.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Failed with result 'exit-code'.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Service hold-off time over, scheduling restart.
Sep 29 10:59:00 havers systemd[1]: Stopped fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: Started fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Main process exited, code=exited, status=1/FAILURE
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Unit entered failed state.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Failed with result 'exit-code'.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Service hold-off time over, scheduling restart.
Sep 29 10:59:00 havers systemd[1]: Stopped fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: Started fail-on-restart.service.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Main process exited, code=exited, status=1/FAILURE
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Unit entered failed state.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Failed with result 'exit-code'.
Sep 29 10:59:00 havers systemd[1]: fail-on-restart.service: Service hold-off time over, scheduling restart.
Sep 29 10:59:00 havers systemd[1]: Stopped fail-on-restart.service.
Sep 29 10:59:01 havers systemd[1]: Started fail-on-restart.service.
Sep 29 10:59:01 havers systemd[1]: fail-on-restart.service: Main process exited, code=exited, status=1/FAILURE
Sep 29 10:59:01 havers systemd[1]: fail-on-restart.service: Unit entered failed state.
Sep 29 10:59:01 havers systemd[1]: fail-on-restart.service: Failed with result 'exit-code'.
Sep 29 10:59:01 havers systemd[1]: fail-on-restart.service: Service hold-off time over, scheduli...

Read more...

Dimitri John Ledkov (xnox) wrote :

Note to self, everything after xenial has this patch already.

Changed in systemd (Ubuntu Xenial):
status: New → Triaged
assignee: nobody → Dimitri John Ledkov (xnox)
importance: Undecided → Medium
Changed in systemd (Ubuntu):
status: In Progress → Invalid
assignee: Mauricio Faria de Oliveira (mfo) → nobody
Changed in systemd (Ubuntu Xenial):
status: Triaged → In Progress

Hello Mauricio, or anyone else affected,

Accepted systemd into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/systemd/229-4ubuntu21.5 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in systemd (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial

Verification done on Xenial.
Changing verification tags.

--

Testcase)

    $ cat <<EOF | sudo tee /etc/systemd/system/fail-on-restart.
    [Service]
    ExecStart=/bin/false
    Restart=always
    EOF

    $ sudo systemctl daemon-reload

Before) "Active: inactive (dead)"

    $ dpkg -s systemd | grep ^Version:
    Version: 229-4ubuntu21.4

    $ sudo systemctl start fail-on-restart

    $ systemctl status -n0 fail-on-restart
       fail-on-restart.service
       Loaded: loaded (/etc/systemd/system/fail-on-restart.service; static; vendor p
       Active: inactive (dead)

Install/Reboot)

    $ sudo apt-get install systemd
    $ sudo reboot

After) "Active: failed (Result: start-limit-hit) since <...>"

    $ dpkg -s systemd | grep ^Version:
    Version: 229-4ubuntu21.5

    $ sudo systemctl start fail-on-restart

    $ systemctl status -n0 fail-on-restart
       fail-on-restart.service
       Loaded: loaded (/etc/systemd/system/fail-on-restart.service; static; vendor preset: enabled)
       Active: failed (Result: start-limit-hit) since Wed 2018-10-10 20:53:17 UTC; 2min 34s ago
      Process: 1450 ExecStart=/bin/false (code=exited, status=1/FAILURE)
     Main PID: 1450 (code=exited, status=1/FAILURE)

tags: added: verification-done verification-done-xenial
removed: verification-needed verification-needed-xenial

Positive test results with xenial-proposed have also been verified by a company interested in this fix, in their scenario/application.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers