Systemd unit cannot restart sometimes

Bug #1845257 reported by Benjamin Allot
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Daisy
Triaged
Undecided
Unassigned

Bug Description

Hello,

Here is a snippet of a production incident debug where a lot of retracer-amd64 service were in a "failed" state.

I looked in some systemd parameters and here are my conclusion:

  sudo systemctl show retracer-amd64 | grep -E 'Burst|LimitInterval|Restart'
    Restart=always
    RestartUSec=100ms
    StartLimitInterval=60000000
    StartLimitBurst=5

So my guess is, the unit restarts 5 times, waiting only 100ms between each and is then prevented to restart, even with Restart=always.

I suggest setting the RestartSec to a higher value and increasing significantly the StartLimitBurst or reducing a LOT the StartLimitInterval (which is 60 secs).

I'm opening the bug here because I think this is related to the charm, not the application itself.

Related branches

Revision history for this message
Brian Murray (brian-murray) wrote :

The hook, located here https://code.launchpad.net/~daisy-pluckers/charms/xenial/daisy-retracer/trunk, only sets StartLImitInterval and StartLimitBurst - I guess the others are systemd defaults.

Do you have an idea of how often this has been a problem so we can properly prioritize the work?

Changed in daisy:
status: New → Triaged
Revision history for this message
Benjamin Allot (ballot) wrote :

On how often, I would say this is responsible for almost 100% of the alert we receive regarding rabbitmq queue length on error-tracker.

I had only 40% of the retracer-app-amd64 service up this morning. And this is the 6th times this alerts us this month (december 2019).

Revision history for this message
Benjamin Allot (ballot) wrote :

Alkso, I noticed that this may be related to bug https://bugs.launchpad.net/daisy/+bug/1620823 (which I failed to see when I opened this one).

Feel free to mark as duplicate if needed.

tags: added: id-5e2ab61280a0ed81905cff72
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Journal logs would be useful to see if and why systemd decided to stop spawning. Just because something is in a failed state, it's not easily identifiable why it entered into it just by looking at the unit settings.

Do you have the .journal binary file with unit entering the failed state that you can upload? (maybe marking the bug private first)

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Also is this on xenial? I'm not sure, but i thought there have been Restart=always bugs that got fixed in later releases.

For example, subiquity.service crashing in the life system, was restarted always until filling up all of ram and disk space with approt crash reports to the point of crashing the system.

Revision history for this message
Brian Murray (brian-murray) wrote :

Dimitri and I looked at some log files from when the retracers were dying and noticed that there were python processes OOM'ing on the retracers. We suspect it is the retracer processes themselves and I've added a MemoryLimit option to the retracer-$arch processes on the "regular" retracers. The failed-queue retracers will not have the MemoryLimit set and should be able to retrace anything the regular retracers can not. Additionally, my testing in staging (setting a very low MemoryLimit) indicates that the retracing process doesn't die so I think this'll resolve the cannot restart issue.

Changed in daisy:
status: Triaged → Incomplete
status: Incomplete → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.