Ubuntu 18.04 guest hangs during diskio stress

Bug #1778717 reported by Sujith Pandel on 2018-06-26
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
dellserver
Undecided
Unassigned
linux (Ubuntu)
High
Joseph Salisbury
Bionic
High
Joseph Salisbury
Cosmic
High
Joseph Salisbury

Bug Description

Setup - A Dell EMC 14th Gen AMD server, with ESXi 6.7 as host and Ubuntu 18.04 (latest updates) guest.

Guest has 3 data disks apart from the boot disk. All these disks are exposed through "ParaVirtual SCSI controller" from the host ESXi 6.7.
Run diskio stress on all the 3 data disks for overnight.
Guest goes unresponsive (even ping fails) in between.
Only way to recover is to reset the guest.

Logs do not indicate much on this error.
Rarely we get a crash on the Guest-screen. Attaching 2 screenshots of different hits.

We have tried mainline kernel v4.18-rc-2 and have hit the issue there as well.

Changing the SCSI controller to "LSI Logic SAS adapter" helps and we have not seen any crash/hang here.

Sujith Pandel (sujithpandel) wrote :
Sujith Pandel (sujithpandel) wrote :
Sujith Pandel (sujithpandel) wrote :

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1778717

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Sujith Pandel (sujithpandel) wrote :

* Also hit with 16.04.4 HWE kernel 4.13.0-45-generic.

* Not seen with LTS kernel 4.4 of Ubuntu 16.04

Sujith Pandel (sujithpandel) wrote :

Moving back to Confirmed state - since all the available logs are already uploaded.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a bisect to figure out what commit caused this regression. We need to identify the earliest kernel where the issue started happening as well as the last kernel that did not have this issue.

Can you test the following kernels and report back? We are looking for the first kernel version that exhibits this bug:

4.5 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.5-wily/
4.8 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8/
4.10 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10/

You don't have to test every kernel, just up until the kernel that first has this bug.

Thanks in advance!

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
status: New → Incomplete
Changed in linux (Ubuntu Cosmic):
status: Confirmed → Incomplete
Changed in linux (Ubuntu Bionic):
status: Incomplete → Triaged
Changed in linux (Ubuntu Cosmic):
status: Incomplete → Triaged
Changed in linux (Ubuntu Bionic):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Cosmic):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
status: Triaged → Incomplete
Changed in linux (Ubuntu Cosmic):
status: Triaged → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers