Kernel Panic while rebooting cloud instance

Bug #1822118 reported by Sean Feole
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Expired
High
Unassigned
systemd (Ubuntu)
Invalid
High
Dimitri John Ledkov

Bug Description

Very occasionally systemd panics on reboots of an azure instance. A workaround to this issue is described in comment #20

------------

Description: In the event a particular Azure cloud instance is rebooted it's possible that it may never recover and the instance will break indefinitely.

In My case, it was a kernel panic. See specifics below..

Series: Disco
Instance Size: Basic_A3
Region: (Default) US-WEST-2
Kernel Version: 4.18.0-1013-azure #13-Ubuntu SMP Thu Feb 28 22:54:16 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I had a simple script to reboot an instance (X) amount of times, I chose 50, so the machine would power cycle by issuing a "reboot" from the terminal prompt just as a user would. Once the machine came up, it captured dmesg and other bits then rebooted again until it reached 50.

After the 4th attempt, my script timed out, I took a look at the instance console log and the following displayed on the console.

[ OK ] Reached target Reboot.
/shutdown: error while loading shared libra[ 89.498980] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00
[ 89.498980]
[ 89.500042] CPU: 0 PID: 1 Comm: shutdown Not tainted 4.18.0-1013-azure #13-Ubuntu
[ 89.508026] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
[ 89.508026] Call Trace:
[ 89.508026] dump_stack+0x63/0x8a
[ 89.508026] panic+0xe7/0x247
[ 89.508026] do_exit.cold.23+0x26/0x75
[ 89.508026] do_group_exit+0x43/0xb0
[ 89.508026] __x64_sys_exit_group+0x18/0x20
[ 89.508026] do_syscall_64+0x5a/0x110
[ 89.508026] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 89.508026] RIP: 0033:0x7f7bf0154d86
[ 89.508026] Code: Bad RIP value.
[ 89.508026] RSP: 002b:00007ffd6be693b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
[ 89.508026] RAX: ffffffffffffffda RBX: 00007f7bf015e420 RCX: 00007f7bf0154d86
[ 89.508026] RDX: 000000000000007f RSI: 000000000000003c RDI: 000000000000007f
[ 89.508026] RBP: 00007f7bef9449c0 R08: 00000000000000e7 R09: 00000000ffffffff
[ 89.508026] R10: 00007ffd6be6974c R11: 0000000000000206 R12: 0000000000000018
[ 89.508026] R13: 00007f7bef944ac8 R14: 00007f7bef944a00 R15: 0000000000000000
[ 89.508026] Kernel Offset: 0x16000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 89.508026] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00
[ 89.508026] ]---

this only occurred once in my testing.

Sean Feole (sfeole)
Changed in linux-azure (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
Changed in linux-azure (Ubuntu):
importance: Undecided → High
status: New → In Progress
Revision history for this message
Colin Ian King (colin-king) wrote :

Minor note: this appears to be a kernel panic during shutdown rather than a boot hang issue.

Revision history for this message
Colin Ian King (colin-king) wrote :

So I don't think this is a kernel related issue. The error message "error while loading shared library" is being emitted from the libc "fatal_error" shared library dynamic loader from init and an exit() is being called, causing the kernel to report that init has exited, which leads to the stack dump and the shutdown failure.

I suspect that there is some kind of race on shutdown causing systemd to be unable to load a necessary shared library (perhaps the file system is unmounted prematurely?) causing process 1 fail and hence to exit.

tags: added: kernel-hyper-v
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Are you able to share the script that reporduces this bug and bug 1822133?

Revision history for this message
Colin Ian King (colin-king) wrote :

@Sean, can you add the info for Joseph?

Revision history for this message
Sean Feole (sfeole) wrote :

Hey Joe, Please see attached. Simply update the system variable with the username@ip and run it.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks Sean and Colin! I'm going to see if I can reproduce it. If I can, I do the usual investigation to see if its a regression, fixed upstream and if a bisect will help.

tags: added: id-5c9e3d984ba1ad4df84a6b1c
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I modified your script to perform 5000 reboots. I'm up to 3508 reboots now without hitting the bug. I'll let it run for a while longer. I'll also compare our environments to see if there is a difference.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I still haven't been able to reproduce this. If you can reproduce this easily, maybe I can provide some kernels to be tested?

I'll keep trying to reproduce here. I compared my VM settings to why is in the bug description, and I have the same config, except the region. I wouldn't think that would have an effect, but I'll look into it.

Revision history for this message
Robert Važan (robert.vazan) wrote :

I can confirm this bug. My VPS servers are configured to automatically reboot daily and one of them started hanging up with similar error a few days ago. I am using Ubuntu 18.10 x64 running on Vultr VPS. When the VPS hanged today, the screen ended with these messages:

[ OK ] Reached target Shutdown.
[ OK ] Reached target Final Step.
         Starting Reboot...
[

(several blank lines here)

/shutdown: error while loading shared libraries: libidn.so.11: cannot open shared object file: No such file or directory

The name of the library changes randomly with every hangup.

Revision history for this message
Colin Ian King (colin-king) wrote :

@Joseph, any ideas how we can progress on this?

Changed in linux-azure (Ubuntu):
status: In Progress → Incomplete
Revision history for this message
Colin Ian King (colin-king) wrote :

@Robert, was there a specific class of virtual machine you were using when this issue occurred?

Revision history for this message
Colin Ian King (colin-king) wrote :
Download full text (3.3 KiB)

IP addr Mac Addr Kernel Reboots
13.64.67.186 00:0d:3a:3a:dd:04 5.0.0-1016-azure 50
104.42.152.115 00:0d:3a:35:b1:e6 5.0.0-1016-azure 50
65.52.121.205 00:0d:3a:3b:0f:52 5.0.0-1016-azure 50
13.88.28.42 00:0d:3a:3b:c7:da 5.0.0-1016-azure 50
40.118.165.237 00:0d:3a:3b:c2:4e 5.0.0-1016-azure 50
40.118.190.105 00:0d:3a:36:c6:d7 5.0.0-1016-azure 50
40.78.90.95 00:0d:3a:37:c0:d9 5.0.0-1016-azure 50
13.83.84.150 00:0d:3a:37:c0:15 5.0.0-1016-azure 50
104.42.74.129 00:0d:3a:36:c2:3e 5.0.0-1016-azure 50
40.85.154.162 00:0d:3a:37:cc:dd 5.0.0-1016-azure 50
40.78.43.4 00:0d:3a:37:c5:07 5.0.0-1016-azure 50
13.93.142.147 00:0d:3a:37:c8:5f 5.0.0-1016-azure 50
40.78.44.229 00:0d:3a:3b:e4:80 5.0.0-1016-azure 50
40.118.189.62 00:0d:3a:3b:e8:8e 5.0.0-1016-azure 50
40.78.85.10 00:0d:3a:3b:e6:37 5.0.0-1016-azure 50
40.78.13.203 00:0d:3a:3a:c2:b0 5.0.0-1016-azure 50
104.42.112.81 00:0d:3a:30:71:fb 5.0.0-1016-azure 50
40.80.156.132 00:0d:3a:30:2f:7c 5.0.0-1016-azure 50
13.64.173.138 00:0d:3a:30:73:b2 5.0.0-1016-azure 50
13.64.189.105 00:0d:3a:30:a4:6f 5.0.0-1016-azure 50
13.64.189.127 00:0d:3a:30:a4:1f 5.0.0-1016-azure 50
104.45.237.232 00:0d:3a:32:1e:3b 5.0.0-1016-azure 50
104.42.233.11 00:0d:3a:32:34:68 5.0.0-1016-azure 50
104.42.233.20 00:0d:3a:34:ed:42 5.0.0-1016-azure 50
23.101.202.206 00:0d:3a:32:32:b0 5.0.0-1016-azure 50
104.42.233.18 00:0d:3a:34:ee:ba 5.0.0-1016-azure 50
104.42.233.151 00:0d:3a:34:e9:0d 5.0.0-1016-azure 50
104.40.51.248 00:0d:3a:32:27:c6 5.0.0-1016-azure 50
104.40.69.158 00:0d:3a:34:f1:5d 5.0.0-1016-azure 50
52.160.41.95 00:0d:3a:35:9f:c8 5.0.0-1016-azure 50
104.42.158.74 00:0d:3a:34:c7:91 5.0.0-1016-azure 50

IP addr Mac Addr Kernel Reboots
40.83.145.235 00:0d:3a:5a:01:f9 5.0.0-1016-azure 250
104.210.50.91 00:0d:3a:35:b2:48 5.0.0-1016-azure 250
13.88.186.166 00:0d:3a:5a:0b:86 5.0.0-1016-azure 250
40.118.185.194 00:0d:3a:35:b9:59 5.0.0-1016-azure 250
104.42.37.175 00:0d:3a:5a:06:ff 5.0.0-1016-azure 250
13.88.186.188 00:0d:3a:5a:05:da 5.0.0-1016-azure 250
104.210.48.49 00:0d:3a:35:b8:a7 5.0.0-1016-azure 250
104.210.50.215 00:0d:3a:35:ba:13 5.0.0-1016-azure 250
40.78.52.50 00:0d:3a:35:b6:50 5.0.0-1016-azure 250
40.118.186.25 00:0d:3a:35:b5:93 5.0.0-1016-azure 250
13.93.233.26 00:0d:3a:37:06:7e 5.0.0-1016-azure 156 crashed
13.93.136.144 00:0d:3a:37:0e:2f 5.0.0-1016-azure 250
40.118.241.192 00:0d:3a:32:c4:5d 5.0.0-1016-azure 250
40.83.160.52 00:0d:3a:37:47:48 5.0.0-1016-azure 250
104.42.9.61 00:0d:3a:36:d3:a0 5.0.0-1016-azure 250

IP addr Mac Addr Kernel Reboots
104.40.1.50 00:0d:3a:30:8d:ae 5.0.0-1016-azure 500
104.40.3.205 00:0d:3a:30:81:d5 5.0.0-1016-azure 500
104.40.9.37 00:0d:3a:30:86:bb 5.0.0-1016-azure 500
104.40.0.242 00:0d:3a:30:88:6b 5.0.0-1016-azure 500
104.40.12.184 00:0d:3a:30:8e:e0 5.0.0-1016-azure 500
137.135.46.72 00:0d:3a:30:ac:df 5.0.0-1016-azure 500
137.135.47.169 00:0d:3a:30:9a:3f 5.0.0-1016-azure 500
104.40.10.226 00:0d:3a:30:2d:fb 5.0.0-1016-azure 500
104.40.10.244 00:0d:3a:30:8e:12 5.0.0-1016-azure 500
104.40.15.160 00:0d:3a:30:87:4d 5.0.0-1016-azure 500
40.112.132.2 00:0d:3a:59:b5:80 5.0.0-1016-azure 500
13.64.97.59 00:0d:3a:59:b1:e7 5.0.0-1016-azure 500...

Read more...

Revision history for this message
Colin Ian King (colin-king) wrote :

See above, I ran several thousand reboot tests on a lot of Basic_A3 instances, ranging from 50, 250 to 500 reboots. Only one failed. So this is *really* hard to reproduce.

Revision history for this message
Colin Ian King (colin-king) wrote :
Download full text (3.3 KiB)

I kicked off another ~20K reboot tests with Standard_B2S instances and hit hangs again:

IP addr Mac Addr Kernel Reboots
104.42.3.161 00:0d:3a:37:82:ee 5.0.0-1020-azure 100
13.91.5.23 00:0d:3a:5a:74:23 5.0.0-1020-azure 57 [ HANG ]
13.91.5.222 00:0d:3a:5a:75:1a 5.0.0-1020-azure 100
13.64.117.146 00:0d:3a:5a:74:da 5.0.0-1020-azure 100
13.64.117.17 00:0d:3a:37:67:0e 5.0.0-1020-azure 100
13.91.6.207 00:0d:3a:3a:cc:2c 5.0.0-1020-azure 100
40.78.30.129 00:0d:3a:36:6e:eb 5.0.0-1020-azure 100
104.210.36.238 00:0d:3a:5a:73:da 5.0.0-1020-azure 100
13.91.6.143 00:0d:3a:3a:c8:ec 5.0.0-1020-azure 100
40.83.249.58 00:0d:3a:3a:c0:7a 5.0.0-1020-azure 100
104.45.216.53 00:0d:3a:3b:8a:55 5.0.0-1020-azure 100
104.210.42.18 00:0d:3a:5a:73:5c 5.0.0-1020-azure 100
40.78.27.21 00:0d:3a:3a:c9:19 5.0.0-1020-azure 100
40.83.252.110 00:0d:3a:5a:79:93 5.0.0-1020-azure 100
13.64.119.204 00:0d:3a:5a:7e:bc 5.0.0-1020-azure 100

104.210.34.4 00:0d:3a:31:18:ee 5.0.0-1020-azure 250
138.91.197.202 00:0d:3a:31:1d:c1 5.0.0-1020-azure 94 [ HANG ]
138.91.196.241 00:0d:3a:31:15:2b 5.0.0-1020-azure 250
104.210.33.44 00:0d:3a:31:16:f3 5.0.0-1020-azure 250
40.83.248.76 00:0d:3a:32:af:a7 5.0.0-1020-azure 250
40.83.253.204 00:0d:3a:32:ba:09 5.0.0-1020-azure 250
168.62.202.8 00:0d:3a:32:a0:11 5.0.0-1020-azure 250
40.83.249.8 00:0d:3a:32:bd:ce 5.0.0-1020-azure 250
40.83.249.93 00:0d:3a:32:b7:32 5.0.0-1020-azure 250
40.83.253.187 00:0d:3a:32:b9:cd 5.0.0-1020-azure 250
23.99.9.88 00:0d:3a:37:96:c9 5.0.0-1020-azure 250
104.40.29.184 00:0d:3a:36:9f:e0 5.0.0-1020-azure 250
137.135.40.122 00:0d:3a:36:9f:eb 5.0.0-1020-azure 250
137.135.49.43 00:0d:3a:36:92:aa 5.0.0-1020-azure 250
138.91.251.8 00:0d:3a:37:9e:ef 5.0.0-1020-azure 250

13.64.146.175 00:0d:3a:31:de:ee 5.0.0-1020-azure 500
104.42.23.145 00:0d:3a:31:da:d7 5.0.0-1020-azure 500
104.42.29.99 00:0d:3a:31:d4:4f 5.0.0-1020-azure 500
40.78.106.12 00:0d:3a:31:d9:8a 5.0.0-1020-azure 500
138.91.233.210 00:0d:3a:31:df:84 5.0.0-1020-azure 500
104.42.25.30 00:0d:3a:31:c9:a4 5.0.0-1020-azure 500
13.64.150.69 00:0d:3a:31:dd:47 5.0.0-1020-azure 321 [ HANG ]
104.42.25.23 00:0d:3a:31:d3:c9 5.0.0-1020-azure 500
104.42.24.176 00:0d:3a:31:d8:36 5.0.0-1020-azure 500
13.64.79.133 00:0d:3a:31:d5:b4 5.0.0-1020-azure 500
104.42.29.146 00:0d:3a:31:de:73 5.0.0-1020-azure 500
104.42.19.191 00:0d:3a:31:d4:78 5.0.0-1020-azure 500
40.118.249.118 00:0d:3a:31:db:20 5.0.0-1020-azure 500
40.112.219.112 00:0d:3a:31:dc:da 5.0.0-1020-azure 500
104.42.17.115 00:0d:3a:31:d3:21 5.0.0-1020-azure 500
40.83.212.164 00:0d:3a:5a:ab:48 5.0.0-1020-azure 500
52.160.123.4 00:0d:3a:36:0d:6a 5.0.0-1020-azure 500
52.160.83.37 00:0d:3a:5a:ab:79 5.0.0-1020-azure 500
52.160.122.92 00:0d:3a:36:00:4c 5.0.0-1020-azure 500
52.160.122.71 00:0d:3a:36:0f:bd 5.0.0-1020-azure 500
52.160.123.12 00:0d:3a:36:04:39 5.0.0-1020-azure 500
104.210.60.218 00:0d:3a:36:b6:25 5.0.0-1020-azure 500
52.160.123.221 00:0d:3a:5a:a9:a3 5.0.0-1020-azure 500
52.160.123.234 00:0d:3a:5a:a7:1c 5.0.0-1020-azure 500
104.210.61.139 00:0d:3a:37:b7:84 5.0.0-1020-azure 500
104.210.61.43 00:0d:3a:36:b5:96 5.0.0-1020-azure 500
40.83.212.185 00:0d:3a:5a:af:9c 5.0.0-1020-azure 500
52.160.82.11...

Read more...

Revision history for this message
Colin Ian King (colin-king) wrote :

So the best way to reproduce this issue is to run ~500 reboots across multiple instances rather than 5000-10000 reboots on once instance.

Revision history for this message
Colin Ian King (colin-king) wrote :

Get more failures with Standard_B1ms

IP addr Mac Addr Kernel Reboots
52.160.101.11 00:0d:3a:5b:a0:7c 5.0.0-1020-azure 10
137.135.51.101 00:0d:3a:31:20:fc 5.0.0-1020-azure 500
137.135.50.133 00:0d:3a:31:27:0f 5.0.0-1020-azure 396 [hang]
137.135.51.198 00:0d:3a:31:28:d7 5.0.0-1020-azure 500
137.135.49.89 00:0d:3a:31:22:c1 5.0.0-1020-azure 500
137.135.48.14 00:0d:3a:33:05:7d 5.0.0-1020-azure 500
104.40.5.23 00:0d:3a:32:e7:27 5.0.0-1020-azure 228 [hang]
13.93.223.213 00:0d:3a:32:e8:59 5.0.0-1020-azure 500
104.40.0.151 00:0d:3a:31:32:09 5.0.0-1020-azure 500
40.118.128.130 00:0d:3a:32:f5:71 5.0.0-1020-azure 500
23.101.200.119 00:0d:3a:36:c5:94 5.0.0-1020-azure 500
104.40.8.52 00:0d:3a:33:07:6e 5.0.0-1020-azure 500
104.40.19.222 00:0d:3a:33:01:0d 5.0.0-1020-azure 500
104.42.135.72 00:0d:3a:3b:e9:15 5.0.0-1020-azure 500
104.40.22.205 00:0d:3a:33:0d:d8 5.0.0-1020-azure 500

104.40.7.22 00:0d:3a:37:85:ff 5.0.0-1020-azure 500
13.88.17.94 00:0d:3a:5a:54:c6 5.0.0-1020-azure 500
104.40.8.196 00:0d:3a:59:56:f3 5.0.0-1020-azure 500
13.88.21.125 00:0d:3a:5a:50:00 5.0.0-1020-azure 500
13.88.23.139 00:0d:3a:5a:55:c3 5.0.0-1020-azure 500
23.99.81.188 00:0d:3a:5a:52:0f 5.0.0-1020-azure 500
13.88.20.132 00:0d:3a:5a:55:f1 5.0.0-1020-azure 500
13.88.20.126 00:0d:3a:5a:58:65 5.0.0-1020-azure 500
13.88.20.237 00:0d:3a:5a:55:42 5.0.0-1020-azure 500
13.88.17.35 00:0d:3a:5a:57:5f 5.0.0-1020-azure 500
13.91.54.225 00:0d:3a:5a:57:5a 5.0.0-1020-azure 500
13.88.21.57 00:0d:3a:5a:52:ce 5.0.0-1020-azure 500
13.88.21.67 00:0d:3a:5a:5b:b7 5.0.0-1020-azure 500
13.88.18.46 00:0d:3a:5a:5d:02 5.0.0-1020-azure 229 [hang]
13.88.16.222 00:0d:3a:37:80:d1 5.0.0-1020-azure 500

Revision history for this message
Colin Ian King (colin-king) wrote :

And for Standard_D2s_v3

IP addr Mac Addr Kernel Reboots
104.42.252.54 00:0d:3a:32:df:92 5.0.0-1020-azure 500
104.42.150.26 00:0d:3a:31:1b:50 5.0.0-1020-azure 500
104.42.147.144 00:0d:3a:32:d8:2f 5.0.0-1020-azure 500
40.112.129.232 00:0d:3a:32:d5:7c 5.0.0-1020-azure 500
40.112.134.251 00:0d:3a:32:d9:2d 5.0.0-1020-azure 500
13.64.195.21 00:0d:3a:5a:7b:51 5.0.0-1020-azure 500
40.83.214.204 00:0d:3a:36:47:98 5.0.0-1020-azure 500
13.64.195.27 00:0d:3a:5a:7f:05 5.0.0-1020-azure 500
13.64.195.31 00:0d:3a:5a:78:55 5.0.0-1020-azure 500
13.64.195.69 00:0d:3a:5a:7c:72 5.0.0-1020-azure 500
104.42.51.23 00:0d:3a:37:47:0a 5.0.0-1020-azure 500
13.64.233.120 00:0d:3a:37:46:ab 5.0.0-1020-azure 500
13.64.233.216 00:0d:3a:37:49:fb 5.0.0-1020-azure 500
13.64.239.157 00:0d:3a:37:43:cf 5.0.0-1020-azure 16 [hang]
52.160.87.177 00:0d:3a:35:fc:b1 5.0.0-1020-azure 500

Revision history for this message
Colin Ian King (colin-king) wrote :

@Joseph, so I can reproduce this hang/crash issue across a variety of instances. I can't get any info back on a console, so debugging this is not easy.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
Revision history for this message
Finom Davoi (xiwakaw) wrote :

Hello,

I get similar "Kernel panic" after running the following codes and restarting the VM:

sudo gedit /etc/ld.so.conf.d/glibc_2.29.conf

And adding the full path to the library as instructed here:

https://stackoverflow.com/a/13428971

Serial log attached.

Revision history for this message
Colin Ian King (colin-king) wrote :

@Finom, that's a good observation, much appreciated.

Changed in systemd (Ubuntu):
importance: Undecided → High
assignee: nobody → Dimitri John Ledkov (xnox)
description: updated
Revision history for this message
Dan Streetman (ddstreet) wrote :

please reopen if this is still an issue

Changed in systemd (Ubuntu):
status: Confirmed → Invalid
Changed in linux-azure (Ubuntu):
assignee: Colin Ian King (colin-king) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux-azure (Ubuntu) because there has been no activity for 60 days.]

Changed in linux-azure (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.