AWS ubuntu became unreachable after ssh login

Bug #1794169 reported by Andrii Petrenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned
systemd (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

I've reached strange situation with Ubuntu 18.04 LTS with latest kernel on AWS m5.xlarge instance.

System became unreachable after series of successful ssh logins. systemd -user became zombie and block main systemd daemon (PID 1).

I've created bug https://github.com/systemd/systemd/issues/10123 but it was closed with "there's a problem with your kernel". https://github.com/systemd/systemd/issues/10123#issuecomment-423984751

Symptoms are very similar to https://github.com/systemd/systemd/issues/8598

apetren+ 26679 0.0 0.0 0 0 ? Z 02:56 0:00 \_ [(sd-pam)] <defunct>
apetren+ 26855 0.0 0.0 76636 7816 ? Ds 02:57 0:00 /lib/systemd/systemd --user
apetren+ 26856 0.0 0.0 0 0 ? Z 02:57 0:00 \_ [(sd-pam)] <defunct>
apetren+ 26954 0.0 0.0 0 0 ? Zs 02:57 0:00 \_ [kill] <defunct>
apetren+ 27053 0.0 0.0 76636 7496 ? Ss 02:58 0:00 /lib/systemd/systemd --user
apetren+ 27054 0.0 0.0 193972 2768 ? S 02:58 0:00 \_ (sd-pam)

This situation is repeatable on 7 instances 1-2 times per week.

how to repeat: 1. Install ubuntu 18.04 LTS from official ubuntu image 2. update kernel and packages to latest version 3. from another instance run

while `true` ;do ssh <email address hidden> "hostname; ps -ef|grep defunc |grep -v grep" ; done

By this command in couple of days I have 2->4->6->8... zombies and in a hour system is frozen...

sudo reboot is not working, because systemd with PID 1 is unreachable. kill -9 1 -- not working as well.

# uname -r:
Linux mainframe04 4.15.0-1021-aws #21-Ubuntu SMP Tue Aug 28 10:23:07 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS"

# systemd --version
systemd 237
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid

AWS instance m5.xlarge

Please let me know if you need any information.
---
ProblemType: Bug
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7.3
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
CRDA: N/A
DistroRelease: Ubuntu 18.04
Ec2AMI: ami-91caece9
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-west-2b
Ec2InstanceType: m5.xlarge
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Amazon EC2 m5.xlarge
Package: systemd 237-3ubuntu10.3
PackageArchitecture: amd64
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-1021-aws root=UUID=a4278387-5a07-46eb-a726-ae1e22673af4 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
ProcVersionSignature: Ubuntu 4.15.0-1021.21-aws 4.15.18
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-1021-aws N/A
 linux-backports-modules-4.15.0-1021-aws N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: bionic ec2-images
Uname: Linux 4.15.0-1021-aws x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: docker
WifiSyslog:

_MarkForUpload: True
dmi.bios.date: 10/16/2017
dmi.bios.vendor: Amazon EC2
dmi.bios.version: 1.0
dmi.board.asset.tag: i-02cfe7fcd8a997085
dmi.board.vendor: Amazon EC2
dmi.chassis.asset.tag: Amazon EC2
dmi.chassis.type: 1
dmi.chassis.vendor: Amazon EC2
dmi.modalias: dmi:bvnAmazonEC2:bvr1.0:bd10/16/2017:svnAmazonEC2:pnm5.xlarge:pvr:rvnAmazonEC2:rn:rvr:cvnAmazonEC2:ct1:cvr:
dmi.product.name: m5.xlarge
dmi.sys.vendor: Amazon EC2

Andrii Petrenko (aplsms)
tags: added: systemd
tags: added: kernel pam
Revision history for this message
Steve Beattie (sbeattie) wrote :

Not sure whether the issue is a poor interaction with sd-pam and the kernel or strictly a kernel issue.

Kernel timeout backtrace:

Sep 21 03:00:33 mainframe01 kernel: [292411.276266] Not tainted 4.15.0-1021-aws #21-Ubuntu
Sep 21 03:00:33 mainframe01 kernel: [292411.277931] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 03:00:33 mainframe01 kernel: [292411.280331] kworker/u8:5 D 0 25806 2 0x80000080
Sep 21 03:00:33 mainframe01 kernel: [292411.280339] Workqueue: events_unbound fsnotify_mark_destroy_workfn
Sep 21 03:00:33 mainframe01 kernel: [292411.280340] Call Trace:
Sep 21 03:00:33 mainframe01 kernel: [292411.280347] __schedule+0x291/0x8a0
Sep 21 03:00:33 mainframe01 kernel: [292411.280349] schedule+0x2c/0x80
Sep 21 03:00:33 mainframe01 kernel: [292411.280350] schedule_timeout+0x1cf/0x350
Sep 21 03:00:33 mainframe01 kernel: [292411.280354] ? add_timer+0x124/0x280
Sep 21 03:00:33 mainframe01 kernel: [292411.280357] wait_for_completion+0xba/0x140
Sep 21 03:00:33 mainframe01 kernel: [292411.280362] ? wake_up_q+0x80/0x80
Sep 21 03:00:33 mainframe01 kernel: [292411.280365] __synchronize_srcu.part.13+0x85/0xb0
Sep 21 03:00:33 mainframe01 kernel: [292411.280367] ? trace_raw_output_rcu_utilization+0x50/0x50
Sep 21 03:00:33 mainframe01 kernel: [292411.280369] synchronize_srcu+0x66/0xe0
Sep 21 03:00:33 mainframe01 kernel: [292411.280370] ? synchronize_srcu+0x66/0xe0
Sep 21 03:00:33 mainframe01 kernel: [292411.280372] fsnotify_mark_destroy_workfn+0x7b/0xe0
Sep 21 03:00:33 mainframe01 kernel: [292411.280375] process_one_work+0x1de/0x410
Sep 21 03:00:33 mainframe01 kernel: [292411.280377] worker_thread+0x253/0x410
Sep 21 03:00:33 mainframe01 kernel: [292411.280379] kthread+0x121/0x140
Sep 21 03:00:33 mainframe01 kernel: [292411.280380] ? process_one_work+0x410/0x410
Sep 21 03:00:33 mainframe01 kernel: [292411.280382] ? kthread_create_worker_on_cpu+0x70/0x70
Sep 21 03:00:33 mainframe01 kernel: [292411.280385] ? do_syscall_64+0x73/0x130
Sep 21 03:00:33 mainframe01 kernel: [292411.280387] ? SyS_exit+0x17/0x20
Sep 21 03:00:33 mainframe01 kernel: [292411.280391] ret_from_fork+0x35/0x40

information type: Private Security → Public Security
tags: added: bionic
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1794169

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Andrii Petrenko (aplsms) wrote : AudioDevicesInUse.txt

apport information

tags: added: apport-collected ec2-images
description: updated
Revision history for this message
Andrii Petrenko (aplsms) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Andrii Petrenko (aplsms) wrote : Dependencies.txt

apport information

Revision history for this message
Andrii Petrenko (aplsms) wrote : Lspci.txt

apport information

Revision history for this message
Andrii Petrenko (aplsms) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Andrii Petrenko (aplsms) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Andrii Petrenko (aplsms) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Andrii Petrenko (aplsms) wrote : ProcModules.txt

apport information

Revision history for this message
Andrii Petrenko (aplsms) wrote : SystemdDelta.txt

apport information

Revision history for this message
Andrii Petrenko (aplsms) wrote : UdevDb.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Andrii Petrenko (aplsms) wrote :

information attached and status changed.

i'm using linux-image-4.15.0-1021-aws. based on https://packages.ubuntu.com/bionic/linux-image-4.15.0-1021-aws it maintained my Ubuntu Core.

Also i'm not 100% sure that issue in the kernel.

information type: Public Security → Public
Brad Figg (brad-figg)
tags: added: cscc
Revision history for this message
Dan Streetman (ddstreet) wrote :

please reopen if this is still an issue

Changed in systemd (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.