Azure Instance never recovered during series of instance reboots.

Bug #1822133 reported by Sean Feole
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Fix Released
High
Colin Ian King

Bug Description

Description: During SRU Testing of various Azure Instances, there will be some cases where the instance will not respond following a system reboot. SRU Testing only restarts a giving instance once, after it preps all of the necessary files to-be-tested.

Series: Disco
Instance Size: Basic_A3
Region: (Default) US-WEST-2
Kernel Version: 4.18.0-1013-azure #13-Ubuntu SMP Thu Feb 28 22:54:16 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I initiated a series of tests which rebooted Azure Cloud instances 50 times. During the 49th Reboot, an Instance failed to return from a reboot.. Upon grabbing the console output the following was seen scrolling endlessly. I have seen this failure in cases where the instance only restarted a handful of times >5

[ 84.247704]hyperv_fb: unable to send packet via vmbus
[ 84.247704]hyperv_fb: unable to send packet via vmbus
[ 84.247704]hyperv_fb: unable to send packet via vmbus
[ 84.247704]hyperv_fb: unable to send packet via vmbus
[ 84.247704]hyperv_fb: unable to send packet via vmbus
[ 84.247704]hyperv_fb: unable to send packet via vmbus
[ 84.247704]hyperv_fb: unable to send packet via vmbus
[ 84.247704]hyperv_fb: unable to send packet via vmbus

In another test attempt I saw the following failure:

ERROR ExtHandler /proc/net/route contains no routes
ERROR ExtHandler /proc/net/route contains no routes
ERROR ExtHandler /proc/net/route contains no routes
ERROR ExtHandler /proc/net/route contains no routes
ERROR ExtHandler /proc/net/route contains no routes
ERROR ExtHandler /proc/net/route contains no routes
ERROR ExtHandler /proc/net/route contains no routes

Both of these failures broke networking, Both of these failures were seen at least twice to three times, thus may explain why in some cases we never recover from an instance reboot.

Sean Feole (sfeole)
description: updated
Changed in linux-azure (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
Sean Feole (sfeole)
description: updated
Changed in linux-azure (Ubuntu):
importance: Undecided → High
status: New → In Progress
Revision history for this message
Colin Ian King (colin-king) wrote :

The "hyperv_fb: Unable to send packet via vmbus" message is from synthvid_send(), drivers/video/fbdev/hyperv_fb.c (the Microsoft Hyper-V Synthetic Video Frame Buffer Driver). This error occurs when vmbus_sendpacket() fails to send a packet via the write ring buffer (see hv_ringbuffer_write())

This failure is either because:

1. the channel->rescind is zero
2. the ring buffer is full

adding some more debug into hv_ringbuffer_write in the error paths may shed some more light on that.

Revision history for this message
Colin Ian King (colin-king) wrote :

I believe the "ERROR ExtHandler /proc/net/route contains no routes" is from the "Azure Linux Agent", so that's proprietary code that I don't have any idea how or why it is emitting that message.

tags: added: kernel-hyper-v
Revision history for this message
Joshua R. Poulson (jrp) wrote :

walinuxagent is a package from the Ubuntu repository, but the upstream is here: https://github.com/Azure/WALinuxAgent

The error message is seen in line 884 here: https://github.com/Azure/WALinuxAgent/blob/ce58412bc50cee1d066bd01f0e52ad89d75f1a0d/azurelinuxagent/common/osutil/default.py

Likely the agent is finding no routes because the network wasn't up. When you are seeing other vmbus errors it is possible netvsc isn't working either.

We're taking a look. The error is familiar and there might be a missing patch in 18.04's 4.18.0-1013 kernel.

Revision history for this message
Sean Feole (sfeole) wrote :

I revisited this particular problem today. I ran a series of restart tests across westus2 and westus, utilizing the same instance and series, but with the latest kernel in -proposed. I was unable to reproduce this problem as originally reported in the bug. I also have not seen any of our SRU testing stall due to problems of this nature, where the instance fails to respond after a restart.

I think it's safe to say we can close this bug and mark it resolved for the time being.

Revision history for this message
Sean Feole (sfeole) wrote :

Test Parameters for comment #4

Series: Disco
Regions: westus2 & westus
Instance: Basic_A3
Kernel: 5.0.0-1007.7

Revision history for this message
Dexuan Cui (decui) wrote :

It's glad to see the issue can not repro with the 5.0.0-1007.7 kernel.

@sfeole: The line "[ 84.247704]hyperv_fb: unable to send packet via vmbus" usually means the VM has panicked. Do you happen to still keep the full serial log containing this line of error? It would be good to understand this error for the 4.18.0-1013-azure kernel.

Revision history for this message
Kirk Elliott (kirk9706) wrote :

This bit me tonight on a few days old 18.04 VM with stock kernel. So far I can not get back in with reboot or redeploy to upgrade the kernel.

Revision history for this message
Kirk Elliott (kirk9706) wrote :

kernel is 4.18.0-1018-azure

Revision history for this message
Dexuan Cui (decui) wrote :

@Kirk: Can you please share the VM's serial log, which can be obtained from Azure portal's Boot Diagnostics -> Serial log?

Revision history for this message
Dexuan Cui (decui) wrote :

@Kirk: I suppose you can get your VM back by Restarting the VM by force via Azure Portal (or Azure cmd line)?

Revision history for this message
Kirk Elliott (kirk9706) wrote :

@Dexuan I emailed you my serial log. I restarted via portal a number of times without success. I tried to redeploy twice and it failed. Trying to restart via cli now.

Revision history for this message
Kirk Elliott (kirk9706) wrote :

How do I upgrade the kernel to 5.0.0-1007.7 if that is what is recommended to avoid this issue? apt-get dist-upgrade on another vm with 4.18.0-1018-azure did not upgrade the kernel. Thank God it restarted! Or is 5.0.0-1007.7 not an Azure kernel so I would need to switch to stock Ubuntu kernel?

Revision history for this message
Kirk Elliott (kirk9706) wrote :

I got the VM back by force stopping, then starting. It looks like 5.0.0-1007.7 only comes with 19.10, which is not available yet on Azure?

Revision history for this message
Kirk Elliott (kirk9706) wrote :

I see now 5.0.0-1007.7 is also available for 19.04, but I'll wait and see what you advise @Dexuan after looking at my serial log.

Revision history for this message
Terry Rudd (terrykrudd) wrote : Re: [Bug 1822133] Re: Azure Instance never recovered during series of instance reboots.

On 5/29/19 6:58 PM, Sean Feole wrote:
> I revisited this particular problem today. I ran a series of restart
> tests across westus2 and westus, utilizing the same instance and
> series, but with the latest kernel in -proposed. I was unable to
> reproduce this problem as originally reported in the bug. I also have
> not seen any of our SRU testing stall due to problems of this nature,
> where the instance fails to respond after a restart.
>
> I think it's safe to say we can close this bug and mark it resolved for
> the time being.
>

Sean, we plan to do this in today's call.

Terry

Revision history for this message
Dexuan Cui (decui) wrote :

I tried to reproduce the bug but couldn't.

My Ubuntu 18.04 VM (in West US 2, the VM size is: Basic A3 (4 vcpus, 7 GiB memory)) is still running fine, after I rebooted the VM 100 times with the below /etc/rc.local script:

#!/bin/bash
date >> /root/reboot.log
NUM=`wc -l /root/reboot.log | cut -d' ' -f1`
[ $NUM -le 100 ] && reboot

In Kirk's log, it looks the VM failed to obtain an IP via DHCP (or somehow the VM's network went DOWN?), and there are a lot of lines of

WARNING ExtHandler CGroup walinuxagent.service: Crossed the Memory Threshold. Current Value: 627945472 bytes, Threshold: 512 megabytes.

I don't know what the line means.

I don't see any issue in Linux kernel and drivers. I guess the issue may be in the waagent/cloud-init daemons, or somewhere outside the VM.

Since I can not reproduce the issue, and it looks nobody can reproduce the issue now (it looks Kirk's VM works fine now, after the VM was Stopped and Started again), I can not further debug the issue.

If somebody manages to repro the issue again, please check if you're still able to login the VM via Azure serial console; if yes, please check if the VM has an IP address by "ifconfig -a"; if the VM has no IP, we'll have to check the syslog (and/or the network manager's log?) to find out what goes wrong; if the VM has an IP but is unable to communicate to the outside world, something may be wrong with the Linux network device driver.

Revision history for this message
Kirk Elliott (kirk9706) wrote :

I rebooted our VM over 60 times today without any further issues. Since the VM network went down before I restarted the VM or lost access to the serial console login, I'm also thinking now it was a network issue to begin with. It could always be coincidence that I was able to get the VM back by force stopping and then starting the vm from a stopped state, when neither restarting nor redeploying helped, but there is one thing, correct me if this is wrong, that is different about force stopping and that is it deallocates the VM. If the network issue were internal to the VM I wonder if both restarting and redeploying could have with the best of intentions persisted the faulty network state to the point where the operations failed, while dealocating ended the issue and the VM was able to start over clean.

Revision history for this message
Kirk Elliott (kirk9706) wrote :

@Dexuan, the error 'WARNING ExtHandler CGroup walinuxagent.service: Crossed the Memory Threshold' started when the vm was first created on May 24, persisted until I stopped/dealocated the VM, and has not returned in syslog. Nor for what it's worth has this error been seen in over six months on another Ubuntu 18.04 VM we have.

Revision history for this message
Tom Wilberding (tom-wilberding) wrote :

It looks like people are resolving this by upgrading to latest Disco with 5.0.0-1007.7 kernel?

We are on 18.04.1 LTS and having similar symptoms. Is there a fix in the works for 18.04.1?

Revision history for this message
Sean Feole (sfeole) wrote :

@tom-wilberding Hey Tom, are you able to capture the console logs of the instance when this occurs?

Revision history for this message
Tom Wilberding (tom-wilberding) wrote :

@sfeole, checking with our devops folks. I don't have Azure access to the hosts in question.

Revision history for this message
rs (rnsc) wrote :

Hello, I noticed the same behavior on one of my Azure VM.
It wasn't rebooted thought, it just started blurting out "hyperv_fb" errors in a loop. (that's all I could display in the Serial Console on Azure, even sending SysRq to reboot the machine wasn't doing anything, I was forced to de-allocate my VM.

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

@rnsc, what kernel were you using?

Revision history for this message
Dexuan Cui (decui) wrote :

@rnsc: Can you please share the VM's full serial log, which can be obtained from Azure portal's Boot Diagnostics -> Serial log?

Terry Rudd (terrykrudd)
Changed in linux-azure (Ubuntu):
status: In Progress → Incomplete
Revision history for this message
Vlad Rudenko (vlar) wrote :

Hi,

We got this issue many times last week even after redeploying VM (i.e moving it to another physical host)!

Attached is the serial log with the kernel panic.

Revision history for this message
Dexuan Cui (decui) wrote :

@Vald: This is from your attachment:

[21965.367843] kernel BUG at /build/linux-azure-njdnVX/linux-azure-4.15.0/net/ipv4/ip_output.c:636!
[21965.377590] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-1056-azure #61-Ubuntu
[21965.435777] RIP: 0010:ip_do_fragment+0x571/0x860
[21965.435777] ip_fragment.constprop.47+0x43/0x90
[21965.435777] ip_finish_output+0xf6/0x270
[21965.435777] ip_output+0x75/0xf0
[21965.435777] ip_forward_finish+0x51/0x80
[21965.435777] ip_forward+0x38a/0x480
[21965.435777] ip_rcv_finish+0x122/0x410
[21965.435777] ip_rcv+0x292/0x360
[21965.435777] __netif_receive_skb_core+0x809/0xbc0
[21965.435777] __netif_receive_skb+0x18/0x60
[21965.435777] netif_receive_skb_internal+0x37/0xe0
[21965.435777] napi_gro_receive+0xd0/0xf0
[21965.435777] netvsc_recv_callback+0x16d/0x220 [hv_netvsc]
[21965.435777] rndis_filter_receive+0x23b/0x580 [hv_netvsc]
[21965.435777] netvsc_poll+0x17e/0x630 [hv_netvsc]
[21965.435777] net_rx_action+0x265/0x3b0
[21965.435777] __do_softirq+0xf5/0x2a8
[21965.435777] irq_exit+0x106/0x110
[21965.435777] hyperv_vector_handler+0x63/0x76

So you're running a 4.15.0-1056-azure kernel, which panics at RIP: 0010:ip_do_fragment+0x571/0x860 (net/ipv4/ip_output.c:636).

This is a known issue in this version of the kernel and we're working on a new version which will fix the issue:

The identified upsream fix is:

commit 5d407b071dc369c26a38398326ee2be53651cfe4
Author: Taehee Yoo <email address hidden>
Date: Mon Sep 10 02:47:05 2018 +0900
Subject: ip: frags: fix crash in ip_do_fragment()
( https://github.com/torvalds/linux/commit/5d407b071dc369c26a38398326ee2be53651cfe4 )

Meanwhile, if it's possible, please downgrade the kernel to from -1056 to -1052, which per support does not crash.

Revision history for this message
Joshua R. Poulson (jrp) wrote :

"ip: frags: fix crash in ip_do_fragment()" is in 4.15.0-1057 in proposed right now for a separate issue.

Revision history for this message
Colin Ian King (colin-king) wrote :

Indeed, the commit is in in 4.15.0-1057 and has been released. Marking this bug as fixed released.

commit b502cfeffec81be8564189e5498fd3f252b27900
Author: Taehee Yoo <email address hidden>
Date: Wed Sep 4 14:40:49 2019 -0300

    ip: frags: fix crash in ip_do_fragment()

    BugLink: https://bugs.launchpad.net/bugs/1842447

    commit 5d407b071dc369c26a38398326ee2be53651cfe4 upstream

    A kernel crash occurrs when defragmented packet is fragmented
    in ip_do_fragment().
    In defragment routine, skb_orphan() is called and
    skb->ip_defrag_offset is set. but skb->sk and
    skb->ip_defrag_offset are same union member. so that
    frag->sk is not NULL.
    Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
    defragmented packet is fragmented.

Changed in linux-azure (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.