AWS ubuntu became unreachable after ssh login
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned | ||
systemd (Ubuntu) |
Invalid
|
Undecided
|
Unassigned |
Bug Description
I've reached strange situation with Ubuntu 18.04 LTS with latest kernel on AWS m5.xlarge instance.
System became unreachable after series of successful ssh logins. systemd -user became zombie and block main systemd daemon (PID 1).
I've created bug https:/
Symptoms are very similar to https:/
apetren+ 26679 0.0 0.0 0 0 ? Z 02:56 0:00 \_ [(sd-pam)] <defunct>
apetren+ 26855 0.0 0.0 76636 7816 ? Ds 02:57 0:00 /lib/systemd/
apetren+ 26856 0.0 0.0 0 0 ? Z 02:57 0:00 \_ [(sd-pam)] <defunct>
apetren+ 26954 0.0 0.0 0 0 ? Zs 02:57 0:00 \_ [kill] <defunct>
apetren+ 27053 0.0 0.0 76636 7496 ? Ss 02:58 0:00 /lib/systemd/
apetren+ 27054 0.0 0.0 193972 2768 ? S 02:58 0:00 \_ (sd-pam)
This situation is repeatable on 7 instances 1-2 times per week.
how to repeat: 1. Install ubuntu 18.04 LTS from official ubuntu image 2. update kernel and packages to latest version 3. from another instance run
while `true` ;do ssh <email address hidden> "hostname; ps -ef|grep defunc |grep -v grep" ; done
By this command in couple of days I have 2->4->6->8... zombies and in a hour system is frozen...
sudo reboot is not working, because systemd with PID 1 is unreachable. kill -9 1 -- not working as well.
# uname -r:
Linux mainframe04 4.15.0-1021-aws #21-Ubuntu SMP Tue Aug 28 10:23:07 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_
DISTRIB_
DISTRIB_
# systemd --version
systemd 237
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-
AWS instance m5.xlarge
Please let me know if you need any information.
---
ProblemType: Bug
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7.3
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
CRDA: N/A
DistroRelease: Ubuntu 18.04
Ec2AMI: ami-91caece9
Ec2AMIManifest: (unknown)
Ec2Availability
Ec2InstanceType: m5.xlarge
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Amazon EC2 m5.xlarge
Package: systemd 237-3ubuntu10.3
PackageArchitec
PciMultimedia:
ProcEnviron:
TERM=xterm-
PATH=(custom, no user)
XDG_RUNTIME_
LANG=C.UTF-8
SHELL=/bin/bash
ProcFB:
ProcKernelCmdLine: BOOT_IMAGE=
ProcVersionSign
RelatedPackageV
linux-
linux-
linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: bionic ec2-images
Uname: Linux 4.15.0-1021-aws x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: docker
WifiSyslog:
_MarkForUpload: True
dmi.bios.date: 10/16/2017
dmi.bios.vendor: Amazon EC2
dmi.bios.version: 1.0
dmi.board.
dmi.board.vendor: Amazon EC2
dmi.chassis.
dmi.chassis.type: 1
dmi.chassis.vendor: Amazon EC2
dmi.modalias: dmi:bvnAmazonEC
dmi.product.name: m5.xlarge
dmi.sys.vendor: Amazon EC2
tags: | added: systemd |
tags: | added: kernel pam |
information type: | Public Security → Public |
tags: | added: cscc |
Not sure whether the issue is a poor interaction with sd-pam and the kernel or strictly a kernel issue.
Kernel timeout backtrace:
Sep 21 03:00:33 mainframe01 kernel: [292411.276266] Not tainted 4.15.0-1021-aws #21-Ubuntu kernel/ hung_task_ timeout_ secs" disables this message. mark_destroy_ workfn 0x291/0x8a0 timeout+ 0x1cf/0x350 0x124/0x280 completion+ 0xba/0x140 srcu.part. 13+0x85/ 0xb0 output_ rcu_utilization +0x50/0x50 srcu+0x66/ 0xe0 srcu+0x66/ 0xe0 mark_destroy_ workfn+ 0x7b/0xe0 one_work+ 0x1de/0x410 thread+ 0x253/0x410 one_work+ 0x410/0x410 create_ worker_ on_cpu+ 0x70/0x70 64+0x73/ 0x130 fork+0x35/ 0x40
Sep 21 03:00:33 mainframe01 kernel: [292411.277931] "echo 0 > /proc/sys/
Sep 21 03:00:33 mainframe01 kernel: [292411.280331] kworker/u8:5 D 0 25806 2 0x80000080
Sep 21 03:00:33 mainframe01 kernel: [292411.280339] Workqueue: events_unbound fsnotify_
Sep 21 03:00:33 mainframe01 kernel: [292411.280340] Call Trace:
Sep 21 03:00:33 mainframe01 kernel: [292411.280347] __schedule+
Sep 21 03:00:33 mainframe01 kernel: [292411.280349] schedule+0x2c/0x80
Sep 21 03:00:33 mainframe01 kernel: [292411.280350] schedule_
Sep 21 03:00:33 mainframe01 kernel: [292411.280354] ? add_timer+
Sep 21 03:00:33 mainframe01 kernel: [292411.280357] wait_for_
Sep 21 03:00:33 mainframe01 kernel: [292411.280362] ? wake_up_q+0x80/0x80
Sep 21 03:00:33 mainframe01 kernel: [292411.280365] __synchronize_
Sep 21 03:00:33 mainframe01 kernel: [292411.280367] ? trace_raw_
Sep 21 03:00:33 mainframe01 kernel: [292411.280369] synchronize_
Sep 21 03:00:33 mainframe01 kernel: [292411.280370] ? synchronize_
Sep 21 03:00:33 mainframe01 kernel: [292411.280372] fsnotify_
Sep 21 03:00:33 mainframe01 kernel: [292411.280375] process_
Sep 21 03:00:33 mainframe01 kernel: [292411.280377] worker_
Sep 21 03:00:33 mainframe01 kernel: [292411.280379] kthread+0x121/0x140
Sep 21 03:00:33 mainframe01 kernel: [292411.280380] ? process_
Sep 21 03:00:33 mainframe01 kernel: [292411.280382] ? kthread_
Sep 21 03:00:33 mainframe01 kernel: [292411.280385] ? do_syscall_
Sep 21 03:00:33 mainframe01 kernel: [292411.280387] ? SyS_exit+0x17/0x20
Sep 21 03:00:33 mainframe01 kernel: [292411.280391] ret_from_