softlockup DoS causes systemd-journald.service to abort with SIGABORT

Bug #1696970 reported by Colin Ian King on 2017-06-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
High
Unassigned
Artful
High
Balint Reczey
Bionic
High
Unassigned

Bug Description

I was running the new stress-ng softlockup stressor and observed that systemd-journald gets killed with an abort and this corrupts the systemd journal.

How to reproduce:

git clone git://kernel.ubuntu.com/cking/stress-ng
cd stress-ng
make clean; make

sudo ./stress-ng --softlockup 0 -t 360 -v

..and wait for 360 seconds. dmesg shows the following, 100% reproduceable:

[ 875.310331] systemd[1]: systemd-timesyncd.service: Watchdog timeout (limit 3min)!
[ 875.310740] systemd[1]: systemd-timesyncd.service: Killing process 574 (systemd-timesyn) with signal SIGABRT.
[ 875.327289] systemd[1]: systemd-timesyncd.service: Main process exited, code=killed, status=6/ABRT
[ 875.327666] systemd[1]: systemd-timesyncd.service: Unit entered failed state.
[ 875.327686] systemd[1]: systemd-timesyncd.service: Failed with result 'watchdog'.
[ 875.327917] systemd[1]: systemd-timesyncd.service: Service has no hold-off time, scheduling restart.
[ 875.327954] systemd[1]: Stopped Network Time Synchronization.
[ 875.328845] systemd[1]: Starting Network Time Synchronization...
[ 875.525071] systemd[1]: Started Network Time Synchronization.
[ 875.539619] systemd[1]: systemd-journald.service: Main process exited, code=dumped, status=6/ABRT
[ 875.544257] systemd-journald[5214]: File /run/log/journal/440e485e550040e3b93b66b2faae8525/system.journal corrupted or uncleanly shut down, renaming and replacing.

Changed in systemd (Ubuntu):
importance: Undecided → High
tags: added: rls-aa-incoming
Steve Langasek (vorlon) wrote :

The watchdogs are by design. The unclean journal may not be. OTOH if this only happens with the ephemeral journal in /run, the impact is minor. We should check whether the unclean journal problem is reproducible with /var/log/journal.

tags: removed: rls-aa-incoming
Changed in systemd (Ubuntu Artful):
assignee: nobody → Canonical Foundations Team (canonical-foundations)
Steve Langasek (vorlon) on 2017-06-15
Changed in systemd (Ubuntu Artful):
assignee: Canonical Foundations Team (canonical-foundations) → Balint Reczey (rbalint)
Balint Reczey (rbalint) wrote :

The journal restart can be observed on zesty as well.

However, the journal is not corrupted, the renaming and replacing took place as a safety measure due to unclean shutdown.
$ sudo journalctl --verify --file=/run/log/journal/*/*
PASS: /run/log/journal//6ff4e09ca827435bbe1b6efc2236a105/system.journal
PASS: /run/log/journal//6ff4e09ca827435bbe1b6efc2236a105/system@00055289b166aebb-17bfde05076f84d2.journal~
PASS: /run/log/journal//6ff4e09ca827435bbe1b6efc2236a105/system@b7f19f66c640442d854033592fdedbb6-0000000000013de4-0005528690181932.journal
PASS: /run/log/journal//6ff4e09ca827435bbe1b6efc2236a105/system@b7f19f66c640442d854033592fdedbb6-0000000000013569-000552843559e7e6.journal
PASS: /run/log/journal//6ff4e09ca827435bbe1b6efc2236a105/system@b7f19f66c640442d854033592fdedbb6-0000000000012d03-00055282a0e0b31b.journal
PASS: /run/log/journal//6ff4e09ca827435bbe1b6efc2236a105/system@b7f19f66c640442d854033592fdedbb6-00000000000124a8-000552806e21938c.journal
PASS: /run/log/journal//6ff4e09ca827435bbe1b6efc2236a105/system@b7f19f66c640442d854033592fdedbb6-0000000000011ca8-0005527efc9ec6de.journal
PASS: /run/log/journal//6ff4e09ca827435bbe1b6efc2236a105/system@b7f19f66c640442d854033592fdedbb6-00000000000114a4-0005527d2438261c.journal

Was the journal corrupted in your case?

Changed in systemd (Ubuntu Artful):
status: New → Incomplete
Dimitri John Ledkov (xnox) wrote :

Note that journal too has watchdogsec limit.

I do wonder if softlockup should or should not be counted towards the watchdog counter =/

Balint Reczey (rbalint) wrote :

@xnox IMO watchdogs should not care if it is a "soft" or other kind of lockup.

I'm closing the bug because the workaround for recovering from a potentially broken journal is working and there is no indication of a journal which is actually broken.

Changed in systemd (Ubuntu Artful):
status: Incomplete → Opinion
Balint Reczey (rbalint) wrote :

I suggest setting this bug Won't Fix.

Colin Ian King (colin-king) wrote :

Hrm, softlockups that break the journal aren't particularly rare on highly loaded servers. I'm a little concerned that a potential user space DoS attack could cause the log to discard important information that the system admin may use to help track down misbehaving system behaviour.

tags: added: id-597a8343fc8685288751db7a
Balint Reczey (rbalint) wrote :

@colin-king In this case the journal was not broken. Do you have many examples of really broken journals?

If so, do you have any solution in mind that would work better in scenarios you care about but don't cause regressions?

Proper infrastructure planning with (CPU) quotas in place where it matters seems to be a good practice instead of running highly loaded servers.

The 3 min default timer for systemd-journald seems appropriate or even generous.

Dimitri John Ledkov (xnox) wrote :

I see more chatter about journald aborting upstream and on the mailing lists. Imho, just because journald was not scheduled to run kind of means that maybe it has too low of a priority, or the system is overloaded. It does not show that journald is actually at fault here, and/or failing. Imho, one shouldn't be killing journald, when it is otherwise obviously operating fine (aka waiting to be run).

I'm concerned as to why there is a watchdog on journald now. It should be rocksolid, and either work or crash, there is no need to crash it on a fixed schedule just because.

Changed in systemd (Ubuntu Bionic):
status: Opinion → Confirmed
assignee: Balint Reczey (rbalint) → nobody
Balint Reczey (rbalint) wrote :

> and/or failing. Imho, one shouldn't be killing journald, when it is otherwise obviously
> operating fine (aka waiting to be run).

@xnox: How do you tell if there is no live lock and it is operating fine despite it timed out on a 3 minute timer?

> I'm concerned as to why there is a watchdog on journald now. It should be rocksolid,
> and either work or crash, there is no need to crash it on a fixed schedule just because.

If we drop the watchdog we won't get any new journal entries if it enters/tricked into an infinite loop. I don't think that would be wise.

There are upstream bugs with too little information for similar issues:
https://github.com/systemd/systemd/issues/2899
https://github.com/systemd/systemd/issues/2924

@xnox Do you have links with enough info for debugging?

Dimitri John Ledkov (xnox) wrote :

@colin-king

I'm thinking to make "journald" more important by setting Nice=-1, to hopefully make it prioritized more often.

There are also CPUSchedulingPolicy=, CPUSchedulingPriority=, IOSchedulingClass=, IOSchedulingPriority= -> i wonder if I should be adjusting these settings too, to make journald more "real-time" processed.

Changed in systemd (Ubuntu Bionic):
status: Confirmed → Fix Committed
tags: added: id-5ab94cf1ab21d3c7e6eee4ec
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 237-3ubuntu6

---------------
systemd (237-3ubuntu6) bionic; urgency=medium

  * Adjust the new dropin test, for v237 systemd.
  * Refresh the keyring patch, to the one merged.

 -- Dimitri John Ledkov <email address hidden> Tue, 27 Mar 2018 13:40:09 +0100

Changed in systemd (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.