kernel crash in 12.04 kvm guest root on emulated scsi

Bug #992328 reported by Scott Moser
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
High
Stefan Bader

Bug Description

I ran an instanced on eucalyptus community cloud of precise. Host is unknown version of kvm. Guest is Ubuntu cloud image 12.04 release (20120424).

I suspect its an issue with the emulated scsi in kvm (lspci shows: SCSI storage controller: LSI Logic / Symbios Logic 53c895a).

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.2.0-23-virtual 3.2.0-23.36
ProcVersionSignature: User Name 3.2.0-23.36-virtual 3.2.14
Uname: Linux 3.2.0-23-virtual x86_64
AlsaDevices:
 total 0
 crw-rw---T 1 root audio 116, 1 May 1 00:49 seq
 crw-rw---T 1 root audio 116, 33 May 1 00:49 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.0.1-0ubuntu5
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
CurrentDmesg: [ 16.560044] eth0: no IPv6 routers present
Date: Tue May 1 00:54:06 2012
IwConfig:
 lo no wireless extensions.

 eth0 no wireless extensions.
Lsusb: Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
PciMultimedia:

ProcEnviron:
 TERM=screen
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: root=LABEL=cloudimg-rootfs ro console=ttyS0 kloaded=1
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-23-virtual N/A
 linux-backports-modules-3.2.0-23-virtual N/A
 linux-firmware 1.79
RfKill: Error: [Errno 2] No such file or directory
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 01/01/2007
dmi.bios.vendor: QEMU
dmi.bios.version: QEMU
dmi.chassis.type: 1
dmi.modalias: dmi:bvnQEMU:bvrQEMU:bd01/01/2007:svn:pn:pvr:cvn:ct1:cvr:

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Scott,

Do you have a way to reproduce the crash easily?

Changed in linux (Ubuntu):
importance: Undecided → Medium
importance: Medium → High
Revision history for this message
Scott Moser (smoser) wrote :

Yes, to reproduce, for some definition of "easily"
 - get account at http://open.eucalyptus.com/CommunityCloud
 - launch instance of precise with some user data (attached is what i used)

Currently:
emi-907C1D03 smoser-ubuntu-images/ubuntu-precise-12.04-amd64-server-20120424.manifest.xml
emi-66BF1C71 smoser-ubuntu-images/ubuntu-precise-12.04-i386-server-20120424.manifest.xml

Ie, so run with:
euca-run-instances --key mykey --user-data-file my.userdata emi-907C1D03

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Scott,

Do you happen to know if this crash happens on earlier kernels, or is this new to Precise?

tags: added: kernel-da-key kernel-key
Revision history for this message
Scott Moser (smoser) wrote : Re: [Bug 992328] Re: kernel crash in 12.04 kvm guest root on emulated scsi

On Tue, 1 May 2012, Joseph Salisbury wrote:

> Hi Scott,
>
> Do you happen to know if this crash happens on earlier kernels, or is
> this new to Precise?

Its newish in precise. I suspect that the real bug is basically bad
hardware. Ie, that the emulated scsi device that kvm/qemu present to the
guest is buggy in some way and that the kernel is crashing as a result.

In fact, it looks to me like some versions of qemu have disabled scsi
support https://bugzilla.redhat.com/show_bug.cgi?id=621933 .

I'd like to have someone from the kernel team agree or disagree with that
assesment though.

Revision history for this message
Stefan Bader (smb) wrote :

Quickly glancing over I would not think this is due to disabled support as the rh bugzilla would be about libvirt not even starting the guest with a misleading error message (syntax error/some internal error?). In your case the kernel does come up and it looks like the symbios driver oopses in its interrupt handler. Whatever the exact reason would be.

It looks like the crash does not happen immediately after boot but after a bit of use (upgrade running but not too much time has elapsed).

Stefan Bader (smb)
Changed in linux (Ubuntu):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
Revision history for this message
Stefan Bader (smb) wrote :

Crash looks to be here:

4422 in /build/buildd/linux-3.2.0/drivers/scsi/sym53c8xx_2/sym_hipd.c
4423 in /build/buildd/linux-3.2.0/drivers/scsi/sym53c8xx_2/sym_hipd.c
   0xffffffff81438d16 <+742>: mov 0x358(%r13),%rax
   0xffffffff81438d1d <+749>: mov 0x80(%rax),%rdx
   0xffffffff81438d24 <+756>: mov 0xb8(%rdx),%rdx
   0xffffffff81438d2b <+763>: test %rdx,%rdx
   0xffffffff81438d2e <+766>: je 0xffffffff81438ffe <sym_int_sir+1486>

which is in C code:

/*
  * The device didn't switch to MSG IN phase after
  * having reselected the initiator.
  */
 case SIR_RESEL_NO_MSG_IN:
         scmd_printk(KERN_WARNING, cp->cmd,
                         "No MSG IN phase after reselection\n");
         goto out_stuck;

RAX = 0xb == SIR_RESEL_NO_MSG_IN and R13 is NULL. 0x358(R13) is cp->cmd

p &((struct sym_ccb *) 0x0)->cmd
$2 = (struct scsi_cmnd **) 0x358

Revision history for this message
Stefan Bader (smb) wrote :

So cp is a command control block pointer. The code reads the data structure address from some register, then this address is tried to be found in a hashed list of command control blocks in the host control block. There could be none of those at all or none with a matching address. In both cases cp would be NULL but the code never assumes this could happen.
Either it should not happen and the hardware emulation is broken here or the interrupt handler should have a check for the NULL pointer.

Revision history for this message
Stefan Bader (smb) wrote :

Interestingly this really seems *not* new:

bug #564924 and bug #546458...

Revision history for this message
Stefan Bader (smb) wrote :

So theoretically the issue may go back to 2.6.24 (Hardy) times which saw

commit 3fb364e089e05c35ead55a08d56d3004193681f6
Author: Matthew Wilcox <email address hidden>
Date: Fri Oct 5 15:55:10 2007 -0400

    [SCSI] sym53c8xx: Use scmd_printk where appropriate

which adds the printk's that directly go for the cp->cmd. At least from the kernel side. Then there is still the possible hardware side. Or certain things that happen depending on certain error conditions that do not need to happen all the times.

Revision history for this message
Stefan Bader (smb) wrote :

Found this old posting that looks like it tried to solve those oopses:

https://lkml.org/lkml/2010/11/18/495

It does not look like it made it anywhere. So I am trying to start a new thread about it right now.

Revision history for this message
Scott Moser (smoser) wrote :

On Wed, 2 May 2012, Stefan Bader wrote:

> Quickly glancing over I would not think this is due to disabled support
> as the rh bugzilla would be about libvirt not even starting the guest
> with a misleading error message (syntax error/some internal error?). In
> your case the kernel does come up and it looks like the symbios driver
> oopses in its interrupt handler. Whatever the exact reason would be.

Right. I wasnt' saying that it was disabled (obviously not in ubuntu's
build or the one that is running on Eucalytpus's host). I was saying that
they've disabled it because it is buggy.

> It looks like the crash does not happen immediately after boot but after
> a bit of use (upgrade running but not too much time has elapsed).

Right. Heavy IO triggers it.
Its either a buggy driver or buggy [virtual] hardware.

Revision history for this message
graziano obertelli (graziano.obertelli) wrote :

In case this helps, these are some specs of the ECC configuration:

kvm --version
QEMU PC emulator version 0.11.0 (qemu-kvm-0.11.0), Copyright (c) 2003-2008 Fabrice Bellard

CPU: model name : Intel(R) Xeon(R) CPU E5504 @ 2.00GHz

kernel on the NC (node controller): Linux node1 2.6.31-22-generic #73-Ubuntu SMP Fri Feb 11 19:18:05 UTC 2011 x86_64 GNU/Linux

tags: removed: kernel-key
Revision history for this message
Stefan Bader (smb) wrote :

This is so long ago I barely remember... Wasn't this in the end something where scsi emulation got removed or at least very much discouraged? From my recollection this became some form of "never mind". So let me close it for now. But if this should be looked at, feel free to reopen.

Changed in linux (Ubuntu):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.