Daisy

Systemd unit cannot restart sometimes

Bug #1845257 reported by Benjamin Allot on 2019-09-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Daisy	Triaged	Undecided	Unassigned

Bug Description

Hello,

Here is a snippet of a production incident debug where a lot of retracer-amd64 service were in a "failed" state.

I looked in some systemd parameters and here are my conclusion:

  sudo systemctl show retracer-amd64 | grep -E 'Burst|LimitInterval|Restart'
    Restart=always
    RestartUSec=100ms
    StartLimitInterval=60000000
    StartLimitBurst=5

So my guess is, the unit restarts 5 times, waiting only 100ms between each and is then prevented to restart, even with Restart=always.

I suggest setting the RestartSec to a higher value and increasing significantly the StartLimitBurst or reducing a LOT the StartLimitInterval (which is 60 secs).

I'm opening the bug here because I think this is related to the charm, not the application itself.

Tags:

Related branches

lp:~ballot/charms/xenial/daisy-retracer/lp1845257

Rejected for merging into lp:~daisy-pluckers/charms/xenial/daisy-retracer/trunk

Brian Murray: Disapprove on 2020-06-09

Junien F: Approve on 2020-04-22

Revision history for this message

Brian Murray (brian-murray) wrote on 2019-09-25:

The hook, located here https://code.launchpad.net/~daisy-pluckers/charms/xenial/daisy-retracer/trunk, only sets StartLImitInterval and StartLimitBurst - I guess the others are systemd defaults.

Do you have an idea of how often this has been a problem so we can properly prioritize the work?

Changed in daisy:
status:	New → Triaged

Revision history for this message

Benjamin Allot (ballot) wrote on 2020-01-01:

On how often, I would say this is responsible for almost 100% of the alert we receive regarding rabbitmq queue length on error-tracker.

I had only 40% of the retracer-app-amd64 service up this morning. And this is the 6th times this alerts us this month (december 2019).

Revision history for this message

Benjamin Allot (ballot) wrote on 2020-01-01:

Alkso, I noticed that this may be related to bug https://bugs.launchpad.net/daisy/+bug/1620823 (which I failed to see when I opened this one).

Feel free to mark as duplicate if needed.

Francis Ginther (fginther) on 2020-01-24

tags:

added: id-5e2ab61280a0ed81905cff72

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2020-05-07:

Journal logs would be useful to see if and why systemd decided to stop spawning. Just because something is in a failed state, it's not easily identifiable why it entered into it just by looking at the unit settings.

Do you have the .journal binary file with unit entering the failed state that you can upload? (maybe marking the bug private first)

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2020-05-07:

Also is this on xenial? I'm not sure, but i thought there have been Restart=always bugs that got fixed in later releases.

For example, subiquity.service crashing in the life system, was restarted always until filling up all of ram and disk space with approt crash reports to the point of crashing the system.

Revision history for this message

Brian Murray (brian-murray) wrote on 2020-06-09:

Dimitri and I looked at some log files from when the retracers were dying and noticed that there were python processes OOM'ing on the retracers. We suspect it is the retracer processes themselves and I've added a MemoryLimit option to the retracer-$arch processes on the "regular" retracers. The failed-queue retracers will not have the MemoryLimit set and should be able to retrace anything the regular retracers can not. Additionally, my testing in staging (setting a very low MemoryLimit) indicates that the retracing process doesn't die so I think this'll resolve the cannot restart issue.

Changed in daisy:
status:	Triaged → Incomplete
status:	Incomplete → Triaged

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.