Kernel Panic while rebooting cloud instance

Bug #1822118 reported by Sean Feole on 2019-03-28
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
High
Colin Ian King
systemd (Ubuntu)
Undecided
Unassigned

Bug Description

Description: In the event a particular Azure cloud instance is rebooted it's possible that it may never recover and the instance will break indefinitely.

In My case, it was a kernel panic. See specifics below..

Series: Disco
Instance Size: Basic_A3
Region: (Default) US-WEST-2
Kernel Version: 4.18.0-1013-azure #13-Ubuntu SMP Thu Feb 28 22:54:16 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I had a simple script to reboot an instance (X) amount of times, I chose 50, so the machine would power cycle by issuing a "reboot" from the terminal prompt just as a user would. Once the machine came up, it captured dmesg and other bits then rebooted again until it reached 50.

After the 4th attempt, my script timed out, I took a look at the instance console log and the following displayed on the console.

[ OK ] Reached target Reboot.
/shutdown: error while loading shared libra[ 89.498980] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00
[ 89.498980]
[ 89.500042] CPU: 0 PID: 1 Comm: shutdown Not tainted 4.18.0-1013-azure #13-Ubuntu
[ 89.508026] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
[ 89.508026] Call Trace:
[ 89.508026] dump_stack+0x63/0x8a
[ 89.508026] panic+0xe7/0x247
[ 89.508026] do_exit.cold.23+0x26/0x75
[ 89.508026] do_group_exit+0x43/0xb0
[ 89.508026] __x64_sys_exit_group+0x18/0x20
[ 89.508026] do_syscall_64+0x5a/0x110
[ 89.508026] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 89.508026] RIP: 0033:0x7f7bf0154d86
[ 89.508026] Code: Bad RIP value.
[ 89.508026] RSP: 002b:00007ffd6be693b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
[ 89.508026] RAX: ffffffffffffffda RBX: 00007f7bf015e420 RCX: 00007f7bf0154d86
[ 89.508026] RDX: 000000000000007f RSI: 000000000000003c RDI: 000000000000007f
[ 89.508026] RBP: 00007f7bef9449c0 R08: 00000000000000e7 R09: 00000000ffffffff
[ 89.508026] R10: 00007ffd6be6974c R11: 0000000000000206 R12: 0000000000000018
[ 89.508026] R13: 00007f7bef944ac8 R14: 00007f7bef944a00 R15: 0000000000000000
[ 89.508026] Kernel Offset: 0x16000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 89.508026] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00
[ 89.508026] ]---

this only occurred once in my testing.

Sean Feole (sfeole) on 2019-03-28
Changed in linux-azure (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
Changed in linux-azure (Ubuntu):
importance: Undecided → High
status: New → In Progress
Colin Ian King (colin-king) wrote :

Minor note: this appears to be a kernel panic during shutdown rather than a boot hang issue.

Colin Ian King (colin-king) wrote :

So I don't think this is a kernel related issue. The error message "error while loading shared library" is being emitted from the libc "fatal_error" shared library dynamic loader from init and an exit() is being called, causing the kernel to report that init has exited, which leads to the stack dump and the shutdown failure.

I suspect that there is some kind of race on shutdown causing systemd to be unable to load a necessary shared library (perhaps the file system is unmounted prematurely?) causing process 1 fail and hence to exit.

tags: added: kernel-hyper-v
Joseph Salisbury (jsalisbury) wrote :

Are you able to share the script that reporduces this bug and bug 1822133?

Colin Ian King (colin-king) wrote :

@Sean, can you add the info for Joseph?

Sean Feole (sfeole) wrote :

Hey Joe, Please see attached. Simply update the system variable with the username@ip and run it.

Joseph Salisbury (jsalisbury) wrote :

Thanks Sean and Colin! I'm going to see if I can reproduce it. If I can, I do the usual investigation to see if its a regression, fixed upstream and if a bisect will help.

tags: added: id-5c9e3d984ba1ad4df84a6b1c
Joseph Salisbury (jsalisbury) wrote :

I modified your script to perform 5000 reboots. I'm up to 3508 reboots now without hitting the bug. I'll let it run for a while longer. I'll also compare our environments to see if there is a difference.

Joseph Salisbury (jsalisbury) wrote :

I still haven't been able to reproduce this. If you can reproduce this easily, maybe I can provide some kernels to be tested?

I'll keep trying to reproduce here. I compared my VM settings to why is in the bug description, and I have the same config, except the region. I wouldn't think that would have an effect, but I'll look into it.

Robert Važan (robert.vazan) wrote :

I can confirm this bug. My VPS servers are configured to automatically reboot daily and one of them started hanging up with similar error a few days ago. I am using Ubuntu 18.10 x64 running on Vultr VPS. When the VPS hanged today, the screen ended with these messages:

[ OK ] Reached target Shutdown.
[ OK ] Reached target Final Step.
         Starting Reboot...
[

(several blank lines here)

/shutdown: error while loading shared libraries: libidn.so.11: cannot open shared object file: No such file or directory

The name of the library changes randomly with every hangup.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments