systemd-journald, udev, logind crashed with SIGABRT -- get killed by 1 min watchdog timeout on longer kernel lockups

Bug #1433320 reported by Fila Kolodny
510
This bug affects 41 people
Affects Status Importance Assigned to Milestone
apport (Ubuntu)
Fix Released
High
Martin Pitt
systemd (Ubuntu)
Won't Fix
Medium
Unassigned

Bug Description

Errors Bucket
-------------
https://errors.ubuntu.com/bucket/?id=/lib/systemd/systemd-journald%3A6%3A__epoll_wait_nocancel%3Asd_event_wait%3Asd_event_run%3Amain

Sometimes userspace get locked up hard: A QEMU instance sees long hangs while the host gets suspended, USB devices/drivers lock up for some time (like in this bug's dmesg), or Google hangouts in bug 1495585.
The kernel detects that some process(es) get hard-locked up, the hardware watchdog (which systemd configures for some crucial services to auto-restart them via "WatchdogSec=1min") picks this up and restarts the process.

In that case, these crash reports only describe the symptom, not the cause. They should be suppressed since they provide no information that can be used to diagnose the actual root problem.

ProblemType: CrashDistroRelease: Ubuntu 15.04
Package: systemd 219-4ubuntu5 [modified: usr/share/dbus-1/system-services/org.freedesktop.systemd1.service]
ProcVersionSignature: Ubuntu 3.19.0-8.8-generic 3.19.1
Uname: Linux 3.19.0-8-generic x86_64
ApportVersion: 2.16.2-0ubuntu3
Architecture: amd64
Date: Tue Mar 17 17:18:08 2015
ExecutablePath: /lib/systemd/systemd-journald
InstallationDate: Installed on 2014-02-24 (386 days ago)
InstallationMedia: Ubuntu 13.10 "Saucy Salamander" - Release amd64 (20131016.1)
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 003: ID 0e0f:0002 VMware, Inc. Virtual USB Hub
 Bus 002 Device 002: ID 0e0f:0003 VMware, Inc. Virtual Mouse
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: VMware, Inc. VMware Virtual Platform
ProcCmdline: /lib/systemd/systemd-journald
ProcEnviron:
 LANG=en_US.UTF-8
 PATH=(custom, no user)
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.19.0-8-generic root=UUID=7f776823-8240-460b-ab2d-45c8a2b2f2b6 ro quiet splash
Signal: 6SourcePackage: systemd
StacktraceTop:
 __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81
 ?? ()
 ?? ()
 __libc_start_main (main=0x7f94ef54c760, argc=1, argv=0x7ffe7c6e2fa8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe7c6e2f98) at libc-start.c:287
 ?? ()
Title: systemd-journald crashed with SIGABRT in __epoll_wait_nocancel()
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

dmi.bios.date: 05/20/2014
dmi.bios.vendor: Phoenix Technologies LTD
dmi.bios.version: 6.00
dmi.board.name: 440BX Desktop Reference Platform
dmi.board.vendor: Intel Corporation
dmi.board.version: None
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 1
dmi.chassis.vendor: No Enclosure
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd05/20/2014:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A:
dmi.product.name: VMware Virtual Platform
dmi.product.version: None
dmi.sys.vendor: VMware, Inc.

CVE References

Revision history for this message
Fila Kolodny (fila) wrote :
information type: Private → Public
Revision history for this message
Apport retracing service (apport) wrote :

StacktraceTop:
 __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81
 sd_event_wait (e=0x7f94f0fc5230, timeout=<optimized out>) at ../src/libsystemd/sd-event/sd-event.c:2373
 sd_event_run (timeout=<optimized out>, e=<optimized out>) at ../src/libsystemd/sd-event/sd-event.c:2500
 main (argc=8, argv=0x7ffe7c6e2850) at ../src/journal/journald.c:109

Revision history for this message
Apport retracing service (apport) wrote : Stacktrace.txt
Revision history for this message
Apport retracing service (apport) wrote : StacktraceSource.txt
Revision history for this message
Apport retracing service (apport) wrote : ThreadStacktrace.txt
Changed in systemd (Ubuntu):
importance: Undecided → Medium
tags: removed: need-amd64-retrace
Martin Pitt (pitti)
summary: - systemd-journald crashed with SIGABRT in __epoll_wait_nocancel()
+ systemd-journald crashed with SIGABRT in sd_event_wait()
Revision history for this message
Launchpad Janitor (janitor) wrote : Re: systemd-journald crashed with SIGABRT in sd_event_wait()

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
description: updated
Revision history for this message
Martin Pitt (pitti) wrote :
Changed in systemd (Ubuntu):
status: Confirmed → Fix Committed
Revision history for this message
Martin Pitt (pitti) wrote :

Sorry, wrong bug. This *might* be a duplicate of bug 1437896 due to memory corruption, but it's too early to tell as none of the duplicates here have a reproducer. We'll see if we will stop getting new reports once the fix for bug 1437896 lands.

Changed in systemd (Ubuntu):
status: Fix Committed → Confirmed
tags: added: wily
Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

For me this happens only on a QEMU-based virtual machine, not on real iron.

Martin Pitt (pitti)
summary: - systemd-journald crashed with SIGABRT in sd_event_wait()
+ systemd-journald, udev, logind get killed by 1 min watchdog timeout
Revision history for this message
Martin Pitt (pitti) wrote : Re: systemd-journald, udev, logind get killed by 1 min watchdog timeout

There is nothing specific to epoll_wait(), that's just the function that these daemons wait in for practially all the time. The SIGABRT is due to systemd killing/restarting the processes as they don't respond withing 1 minute (watchdog). In Fila's case the kernel reports hung tasks (in a page fault, i. e. slow/broken swap?), but in Till's reports there is no other apparent error message.

This is not just limited to QEMU, e. g. duplicate bug 1438805 is a real machine with a hardware watchdog. So we still only merely know the symptoms -- something stalls these QEMU or real laptop processes, but we don't know yet how. Does anyone have some observations/recipes how to reproduce this? I. e. did you start a lot of QEMU instances in parallel or something else what could make these slow? Maybe a running SSD TRIM in the background?

Changed in systemd (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

My situation is more or less the following: On my laptop I have several virtual machines (usually 1 or 2 running at the same time) to have different versions of Ubuntu handy and also to test network printing use cases using only one physical machine, both for development of the printing part of Ubuntu.

Now to move the laptop from on spot to the other or over night I simply close the lid which does a suspend-to-RAM. stopping the system but keeping its state. When I resume by opening the lid everything gets running again from the point where I have left off, but with the clock beinbg on the current hour then and probably suggesting to certain processes that hours have passed waiting for an answer and causing timeouts. This couls probably be the cause of the crashes.

Revision history for this message
Martin Pitt (pitti) wrote :

Till> ah, so this involves suspending the VMs? I figure QEMU's watchdog then doesn't handle this very well. Do other submitters also experience this during suspend?

Martin Pitt (pitti)
summary: - systemd-journald, udev, logind get killed by 1 min watchdog timeout
+ systemd-journald, udev, logind crashed with SIGABRT in
+ __epoll_wait_nocancel() -- get killed by 1 min watchdog timeout
Revision history for this message
Martin Pitt (pitti) wrote : Re: systemd-journald, udev, logind crashed with SIGABRT in __epoll_wait_nocancel() -- get killed by 1 min watchdog timeout

Steve analyzed this in the duplicate bug 1495585. I'll adjust the description here to make it easier to find/understand.

Changed in systemd (Ubuntu):
status: Incomplete → Won't Fix
Changed in apport (Ubuntu):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Martin Pitt (pitti)
description: updated
summary: - systemd-journald, udev, logind crashed with SIGABRT in
- __epoll_wait_nocancel() -- get killed by 1 min watchdog timeout
+ systemd-journald, udev, logind crashed with SIGABRT -- get killed by 1
+ min watchdog timeout on longer kernel lockups
Changed in apport (Ubuntu):
status: Triaged → In Progress
Martin Pitt (pitti)
Changed in systemd (Ubuntu):
status: Won't Fix → Triaged
Martin Pitt (pitti)
Changed in apport (Ubuntu):
status: In Progress → Fix Committed
Changed in systemd (Ubuntu):
status: Triaged → Won't Fix
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package apport - 2.19-0ubuntu1

---------------
apport (2.19-0ubuntu1) wily; urgency=medium

  * New upstream release:
    - apport: Drop re-nicing. This might decrease the time a user has to wait
      for apport to finish the core dump for a crashed/hanging foreground
      process. (See LP #1278780)
    - kernel_crashdump: Enforce that the log/dmesg files are not a symlink.
      This prevents normal users from pre-creating a symlink to the
      predictable .crash file, and thus triggering a "fill up disk" DoS attack
      when the .crash report tries to include itself. Thanks to halfdog for
      discovering this! (CVE-2015-1338, part of LP #1492570)
    - SECURITY FIX: Fix all writers of report files (package_hook,
      kernel_crashdump, and similar) to open the report file exclusively,
      i. e. fail if they already exist. This prevents privilege escalation
      through symlink attacks. Note that this will also prevent overwriting
      previous reports with the same same. Thanks to halfdog for discovering
      this! (CVE-2015-1338, LP: #1492570)
    - apport: Ignore process restarts from systemd's watchdog. Their traces
      are usually useless as they don't have any information about the actual
      reasaon why processes hang (like VM suspends or kernel lockups with bad
      hardware) (LP: #1433320)

 -- Martin Pitt <email address hidden> Thu, 24 Sep 2015 14:41:54 +0200

Changed in apport (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

For my QEMU-based virtual machines this fix seems not to have worked out. I still get the same amount of crash reports even with the fixed Apport installed.

Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

See my duplicate bug report bug 1500968 as an example for a crash report which occured with the fixed Apport installed.

Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

pitti, the problem seems not to be fixed yet, today I got this one: bug 1504167

Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

pitti, seems still not to be fixed, I started my QEMU virtual machine with Wily and got: bug 1506796, bug 1506797, bug 1506798, bug 1506799, bug 1506800, and bug 1506801.

Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

pitti, I still get the same crashes as I get all the time. Seems still not be fixed.

Revision history for this message
Martin Pitt (pitti) wrote :

Well, the "crash" itself can't be fixed in userspace -- your kernel causes very long hangs in userspace processes for some reason, and the watchdog timer sees this and kills/restarts the hung processes.

Revision history for this message
Till Kamppeter (till-kamppeter) wrote :

pitti, the problem occurs only on a virtual machine which is on a laptop which gets frequently suspended (suspend to RAM). So processing gets stopped at a point of time and when it carries on after waking up from suspend the watchdog timers have all timed out. They should be stopped during the suspend, too.

Revision history for this message
Yuan Song (ysong) wrote :

I got this bug pretty frequently on my 16.04 desktop(4.4.0-36-generic #55)

Program received signal SIGABRT, Aborted.
0x00007f8b6afa160a in timerfd_settime () at ../sysdeps/unix/syscall-template.S:84
#0 0x00007f8b6afa160a in timerfd_settime () at ../sysdeps/unix/syscall-template.S:84
#1 0x000055d36e6c416e in ?? ()
#2 0x000055d36e69f4ea in ?? ()
#3 0x00007f8b6aeba830 in __libc_start_main (main=0x55d36e69e540, argc=1, argv=0x7fffecd24e58, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffecd24e48)
    at ../csu/libc-start.c:291
#4 0x000055d36e69f8c9 in ?? ()
A debugging session is active.

Revision history for this message
Joe Marty (josephmarty+ubuntu) wrote :

I'm having this same error occur on a new Ubuntu 18.04 VM, on VMWare Player. I don't understand why it was marked as "won't fix" for systemd... ?

This is happening when I suspend my host machine (windows 10). Upon resuming, the VM has crashed and goes through the startup process, then reports this error.

Revision history for this message
Joe Marty (josephmarty+ubuntu) wrote :

I don't know if this will help, but here's the vmware log:

Revision history for this message
Joe Marty (josephmarty+ubuntu) wrote :

Incidentally, I only get this bug in a virtual machine that I had "copied" from one computer to another. I was able to re-create the virtual machine from scratch (using the same install image, and installing the same software, etc) on the new computer, and did not get the same issue in the new VM.

Revision history for this message
Joe Marty (josephmarty+ubuntu) wrote :

I spoke too soon. I'm now getting the same error in the new VM, which was created from scratch.

Revision history for this message
itszn (itszn) wrote :

I have just started using an Ubuntu 18.04.1 LTS server image in VMWare Workstation 15 this week.
When ever my host suspends, the systemd's watchdog times out many services including logind. This causes the x server to crash. (Which is really annoying since it kills all my terminals)

This is reproducible every time I suspend my host.

Systemd logs the following:
ubuntu systemd[1]: systemd-logind.service: Watchdog timeout (limit 3min)!
ubuntu systemd[1]: systemd-logind.service: Killing process 1122 (systemd-logind) with signal SIGABRT.

Increasing the watchdog time does not seem to help as it still is killed after the longer time.

I'm going to try removing the WatchdogSec option but that seems like a hack.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.