systemd: Failed to send signal

Bug #1783499 reported by Shuang Liu
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
dbus (Ubuntu)
Invalid
Undecided
Unassigned
systemd (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

systemd: Failed to send signal.

[ 3.137257] systemd[1]: Failed to send job remove signal for 109: Connection reset by peer
[ 3.138119] systemd[1]: run-rpc_pipefs.mount: Failed to send unit change signal for run-rpc_pipefs.mount: Transport endpoint is not connected
[ 3.138185] systemd[1]: dev-mapper-ubuntu\x2d\x2dvg\x2droot.device: Failed to send unit change signal for dev-mapper-ubuntu\x2d\x2dvg\x2droot.device: Transport endpoint is not connected
[ 3.138512] systemd[1]: run-rpc_pipefs.mount: Failed to send unit change signal for run-rpc_pipefs.mount: Transport endpoint is not connected
[ 3.142719] systemd[1]: Failed to send job remove signal for 134: Transport endpoint is not connected
[ 3.142958] systemd[1]: auth-rpcgss-module.service: Failed to send unit change signal for auth-rpcgss-module.service: Transport endpoint is not connected
[ 3.165359] systemd[1]: Failed to send job remove signal for 133: Transport endpoint is not connected
[ 3.165505] systemd[1]: proc-fs-nfsd.mount: Failed to send unit change signal for proc-fs-nfsd.mount: Transport endpoint is not connected
[ 3.165541] systemd[1]: dev-mapper-ubuntu\x2d\x2dvg\x2droot.device: Failed to send unit change signal for dev-mapper-ubuntu\x2d\x2dvg\x2droot.device: Transport endpoint is not connected
[ 3.166854] systemd[1]: Failed to send job remove signal for 66: Transport endpoint is not connected
[ 3.167072] systemd[1]: proc-fs-nfsd.mount: Failed to send unit change signal for proc-fs-nfsd.mount: Transport endpoint is not connected
[ 3.167130] systemd[1]: systemd-modules-load.service: Failed to send unit change signal for systemd-modules-load.service: Transport endpoint is not connected
[ 2.929018] systemd[1]: Failed to send job remove signal for 53: Transport endpoint is not connected
[ 2.929220] systemd[1]: systemd-random-seed.service: Failed to send unit change signal for systemd-random-seed.service: Transport endpoint is not connected
[ 3.024320] systemd[1]: sys-devices-platform-serial8250-tty-ttyS12.device: Failed to send unit change signal for sys-devices-platform-serial8250-tty-ttyS12.device: Transport endpoint is not connected
[ 3.024421] systemd[1]: dev-ttyS12.device: Failed to send unit change signal for dev-ttyS12.device: Transport endpoint is not connected
[ 3.547019] systemd[1]: proc-sys-fs-binfmt_misc.automount: Failed to send unit change signal for proc-sys-fs-binfmt_misc.automount: Connection reset by peer
[ 3.547144] systemd[1]: Failed to send job change signal for 207: Transport endpoint is not connected

How to reproduce:
1. enable debug level journal
LogLevel=debug in /etc/systemd/system.conf
2. reboot the system
3. journalctl | grep "Failed to send"

sliu@vmlxhi-094:~$ lsb_release -rd
Description: Ubuntu 16.04.4 LTS
Release: 16.04

sliu@vmlxhi-094:~$ systemctl --version
systemd 229
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN

sliu@vmlxhi-094:~$ dbus-daemon --version
D-Bus Message Bus Daemon 1.10.6
Copyright (C) 2002, 2003 Red Hat, Inc., CodeFactory AB, and others
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Tags: dbus systemd
Revision history for this message
Attila (acraciun) wrote :

Here at Mozilla, we have 200 servers running on HP Moonshot system, all have same hardware configuration and Ubuntu 16.04.2. The OS is not up to date, we use it as is was released. We using a program to tests Firefox source code and after each test we reboot the servers using /sbin/reboot. After a while (between 24-48h - during this period ~6 reboots/h are made), randomly, all 200 servers get stuck at the reboot - see the ILO capture - and to bring it back we have to power cycle each of them.

On one of the beta servers, we have made the bellow updates/changes, set debug, set cron to reboot server after 5-10 min, however, the reboot freeze is still present:
- upgraded OS to Ubuntu 16.04.5 latest packages;
- used GRUB_CMDLINE_LINUX_DEFAULT="reboot=bios"
- used GRUB_CMDLINE_LINUX_DEFAULT="acpi=off"
- GRUB_CMDLINE_LINUX_DEFAULT="reboot=force"
- upgraded Kernel to v4.15 (the main one from Ubuntu's repo);
- upgraded Kernel to v4.20 from https://kernel.ubuntu.com/~kernel-ppa/mainline/
- now we are testing the reboot with 4.20.3 from the above repo and working to update systemd.

Attached you can find the debug-log for:
- kernel 4.4.0-66-generic #87-Ubuntu - shutdown-debuglogkernel-4.4.txt
- kernel 4.15 - shutdown-log-kernel4-15.txt
- kernel 4.20 shutdown-log-kernel420.txt
- ILO capture with the freeze ILO-reboot-freeze.PNG

Please check all this logs/capture and let us know a solution. Thanks.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in dbus (Ubuntu):
status: New → Confirmed
Changed in systemd (Ubuntu):
status: New → Confirmed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Have you tried other reboot quirk like "reboot=pci"? It may help.

Revision history for this message
Attila (acraciun) wrote :

I did not tried reboot=pci, will try now. Thanks.

Revision history for this message
Attila (acraciun) wrote :

reboot=pci does not help, it get stuck after 19 hours (reboot once at 5 min). Attached is the debug log. On ILO we get the same as I sent in my first post - see capture from the arhive.

We'll update systemd to 237 and test it.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please let the hardware vendor know. This is highly likely a platform bug.
AFAICT, systemd has done its part, so the bug is either in the kernel (less likely) or in the BIOS.

Revision history for this message
Attila (acraciun) wrote :

We will notify them if the systemd does not fix the issue. So far, after the systemd upgrade, the server reboots fine in 24h test (reboot once at 5 min).

Revision history for this message
Attila (acraciun) wrote :

After 36h of rebooting, the system is stuck. Attached is the debug log.

Revision history for this message
Attila (acraciun) wrote :

We sent all the logs and captures data to HPE and they asked to update the BIOS to the latest one. We already using the latest firmware, 4 months past the recommended firmware. Also, we have 200 servers with Windows 10 doing same test - reboot after each test and there are no hangs. HPE stated that they do not have any other reports of hangs on reboots or shutdowns for this hardware.

Any other suggestion from your side?

Meanwhile I'll do a dist-upgrade from 16.04 to 18 and do the reboot test.

Revision history for this message
Attila (acraciun) wrote :

After 5 hours of reboot the server is stuck. See the attached debug log and ILO capture.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

> Any other suggestion from your side?
The system hang in this bug is really hardware related, so HPE needs to take a deeper look.

For someone like us that cannot dig into firmware/hardware and need to find a solution at software level, in general I'll start with disabling runtime power management for all devices.

Revision history for this message
Attila (acraciun) wrote :

I have disabled the acpi services (acpid.path, acpid.service, acpid.socket and set up the reboot once at 5 min. The server get stuck after 13h.

Something else to try?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Try kernel parameter "acpi=off".

Revision history for this message
Attila (acraciun) wrote :

That was already tested and no luck.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Well, I don't think trials and errors can get any fruitful result, this issue really needs hardware vendor to investigate.

Revision history for this message
Attila (acraciun) wrote :

We installed Centos 7 and started the reboot test now. If this fail again then is a hardware issue, if not, then it can be from Ubuntu.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Or a kernel regression, Centos 7 uses an older kernel.

Revision history for this message
Attila (acraciun) wrote :

This get stuck in max 5 hours. We'll try to install a non systemd Ubuntu, like v14 to test it.

Revision history for this message
Attila (acraciun) wrote :

Tested with multiple OSs, all stuck: Ubuntu 14, Fedora 29, Arch Linux current 2019.02.01.

We received a detailed note from the next level of support with HPE and decided to test with Ubuntu 16 (our production one) and set "ps -ef >> /shutdown-log.txt" in to the debug.sh script. Maybe we can see something that is not closed/terminated and prevent the server to reboot.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

I think it stuck in firmware/hardware instead of userspace process.

Revision history for this message
Attila (acraciun) wrote :

The test server get stuck after ~9h, here is the debug log.

Revision history for this message
Attila (acraciun) wrote :

I have activated RuntimeWatchdogSec=20s and ShutdownWatchdogSec=1min in /etc/systemd/system.conf and set the reboot, the server get stuck in 105min.

Revision history for this message
Attila (acraciun) wrote :

I have found that if I set the reboot once at 15 min, the server works for 5-6 days. If the reboots are done once at 10 min, the server get stuck in 24-48h max. Now I'm testing the reboots once at 20min, it should take 10 days or more until is stuck.

Also, I have tried all "reboot=" grub options, nothing helps. The only way to keep he server online is to set the reboot time less often. Odd!

Revision history for this message
Dan Streetman (ddstreet) wrote :

please reopen if this is still an issue

Changed in systemd (Ubuntu):
status: Confirmed → Invalid
Changed in dbus (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.