Ringtail on Hyper-V causes BUG: scheduling while atomic

Bug #1180419 reported by Sudrien
56
This bug affects 9 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned
Raring
Fix Released
High
Joseph Salisbury
Saucy
Fix Released
High
Unassigned

Bug Description

Ringtail 3.8.0-19-generic on Hyper-v, bears at least a slight resemblance to to #752064

This is hand copied, as it's not showing up in messages

[ 23.921.954] BUG: scheduling while atomic: swapper/0/0x10000100
[ 168.973600] end request: I/O error, dev sda, sector 16786984
[ 168.973780] Buffer I/O error on device sda1, logical block 2098117

Similar messages repeat a few times, until it starts complaining about an offline device. The CPU is showing pretty high activity.

Halt from command line works on this system.
It is only when a shut down command is issued from hyper-v
---
AlsaDevices:
 total 0
 crw-rw---T 1 root audio 116, 1 May 15 2013 seq
 crw-rw---T 1 root audio 116, 33 May 15 2013 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.9.2-0ubuntu8
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 13.04
HibernationDevice: RESUME=UUID=67816604-7de5-4ce5-9805-84ecd77d9e63
IwConfig: Error: [Errno 2] No such file or directory
Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99
MachineType: Microsoft Corporation Virtual Machine
MarkForUpload: True
Package: linux 3.8.0.19.35
PackageArchitecture: amd64
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.8.0-19-generic root=UUID=8e612ddd-6458-4783-a7c7-6c6acf6b48ac ro nosplash quiet
ProcVersionSignature: Ubuntu 3.8.0-19.30-generic 3.8.8
RelatedPackageVersions:
 linux-restricted-modules-3.8.0-19-generic N/A
 linux-backports-modules-3.8.0-19-generic N/A
 linux-firmware 1.106
RfKill: Error: [Errno 2] No such file or directory
Tags: raring
Uname: Linux 3.8.0-19-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

dmi.bios.date: 03/19/2009
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 090004
dmi.board.name: Virtual Machine
dmi.board.vendor: Microsoft Corporation
dmi.board.version: 7.0
dmi.chassis.asset.tag: 1897-8098-1654-4953-6712-7975-54
dmi.chassis.type: 3
dmi.chassis.vendor: Microsoft Corporation
dmi.chassis.version: 7.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr090004:bd03/19/2009:svnMicrosoftCorporation:pnVirtualMachine:pvr7.0:rvnMicrosoftCorporation:rnVirtualMachine:rvr7.0:cvnMicrosoftCorporation:ct3:cvr7.0:
dmi.product.name: Virtual Machine
dmi.product.version: 7.0
dmi.sys.vendor: Microsoft Corporation

CVE References

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1180419

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: raring
Revision history for this message
Sudrien (sudrien) wrote : BootDmesg.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Sudrien (sudrien) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Sudrien (sudrien) wrote : Dependencies.txt

apport information

Revision history for this message
Sudrien (sudrien) wrote : Lspci.txt

apport information

Revision history for this message
Sudrien (sudrien) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Sudrien (sudrien) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Sudrien (sudrien) wrote : ProcModules.txt

apport information

Revision history for this message
Sudrien (sudrien) wrote : UdevDb.txt

apport information

Revision history for this message
Sudrien (sudrien) wrote : UdevLog.txt

apport information

Revision history for this message
Sudrien (sudrien) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.10 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.10-rc1-saucy/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
tags: added: kernel-da-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

These bugs occur typically in an interrupt context. Do you have a way to reproduce this bug? If so, there is a wiki page[0] that provides some information on gathering more data specific to these type of bugs. If possible can you review the wiki page and gather the additional data listed on that page?

[0] https://wiki.ubuntu.com/Kernel/DebuggingSchedulingWhileAtomic

Sudrien (sudrien)
tags: added: kernel-unable-to-test-upstream
Revision history for this message
Sudrien (sudrien) wrote :

Was unable to test upstream, hyper-v modules not there to load, apparently. On 3.8.0-19-generic x86_64 I've got

Module Size Used by
vesafb 13828 1
joydev 17377 0
hid_generic 12540 0
hid_hyperv 13059 0
hid 101002 2 hid_hyperv,hid_generic
microcode 22881 0
psmouse 95870 0
serio_raw 13215 0
mac_hid 13205 0
lp 17759 0
parport 46345 1 lp
hv_storvsc 17495 2
hv_netvsc 22768 0
hv_utils 13568 0
hv_vmbus 34431 4 hv_netvsc,hid_hyperv,hv_utils,hv_storvsc
floppy 69449 0

network wasn't coming up either, udev issues.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Sudrien (sudrien) wrote :

re comment 13

shutdown -h now & halt work as expected, it's only when shut down is issued from hyper-v that the issues show up. As such, nothig is able to make it to disk.

Revision history for this message
Sudrien (sudrien) wrote :

for it's uglyness, I did get a kernel to work. Marking 'kernel-fixed-upstream'.

git clone git://kernel.ubuntu.com/ubuntu/ubuntu-saucy.git
fakeroot debian/rules clean
make localmodconfig
make-kpkg --append-to-version hyperv1 --initrd kernel_image
apt-get install linux-tools-comm
dpkg -i linux-tools-3.9.0-2_3.9.0-2.6_amd64.deb
dpkg -i linux-image-3.9.2hyperv1+_3.9.2hyperv1+-10.00.Custom_amd64.deb

tags: added: kernel-fixed-upstream
removed: kernel-unable-to-test-upstream
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to also test the latest 3.8 stable kernel, which is 3.8.13:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8.13-raring/

You may need to perform the same steps as in comment #16. If you need to do that, the latest stable kernel can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

If the bug still exists in 3.8.13, we can perform a reverse bisect to identify the commit that fixes this in 3.9.

Revision history for this message
Ionel Cristian Mărieș (ionel-mc) wrote :

Any workaround for this ?

This still reproduces on 3.8.0-25-generic

Revision history for this message
Sitsofe Wheeler (sitsofe) wrote :

I'm seeing this too and its highly reproducible (just click Action -> Shutdown in Virtual Machine Connection). It causes the VM to hang and for its disks to go offline (network connections will fail too). For those searching the net, my backtrace is attached (for anyone else wondering how to get such a backtrace set up COM1 to use a named pipe (e.g. ubuntu) in Hyper-V Manager, connect to the pipe using a program like Tera Term and add console=ttyS0,115200n8 to grub).

To me this looks like the issue that was fixed by https://patchwork.kernel.org/patch/2027361/ which seems to be part of the patch series mention in https://lkml.org/lkml/2013/1/23/570 . This patch is not in 3.8.13 stable kernel ( http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/drivers/hv/hv_util.c?h=linux-3.8.y ). The patch did go mainstream eventually (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/hv/hv_util.c?id=3dd6cb497198a0533a2530b6a345c60c9a29b9bc in the 3.9 kernel (git describe --contains 3dd6cb497198a0533a2530b6a345c60c9a29b9bc returns v3.9-rc1~124^2~27).

Fedora 18's 3.8.9-200 kernel has the same problem as Ubuntu Raring's 3.8.9.0-19 kernel but Fedora 18's 3.9.6-200 does not have this issue.

Revision history for this message
Mi Tom (crazy-k) wrote :

Confirmed in all raring installations running under both hyper-v server 2008 R2 as well as hyper-v server 2012 with 30+ of virtual machines.
Unfortunately I've upgraded all of my Ubuntu 12.10 virtual machines.

The problem exists not only when shutting down guest operating system manually but also in case the Hypervisor wants to shut down the virtual machine (e.g. during hypervisor's restart or stopping due to UPS battery low signal).

In such situation virtual machine with Ubuntu crashes (and due to crash sd rejects I/O to offline device so e.g. MySQL replication or MySQL itself is usually broken) with no information to hypervisor and due this hypervisor's restart or shut down cannot be finished, what can also cause the crash the hypervisor's harddisks (e.g. when UPS battery gets exhausted).

I'm a little bit astonished why this bug is not solved yet and even not assigned to anyone, as it DISABLES IN TOTAL the possibility of USING Ubuntu 13.04 IN PRODUCTION environment basing on Hyper-V virtualization.

In my opinion it shouldn't be categorized as medium importance as it is the CRITICAL one.

Changed in linux (Ubuntu Raring):
status: New → Confirmed
importance: Undecided → High
Changed in linux (Ubuntu Saucy):
importance: Medium → High
status: Confirmed → Fix Released
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

It doesn't look like commit 3dd6cb4 was Cc'd to stable. I'll cherry pick this commit into Raring and build a test kernel. I'll post a link to this kernel shortly.

Changed in linux (Ubuntu Raring):
status: Confirmed → In Progress
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I created a Raring test kernel with a cherry pick of commit 3dd6cb4. The kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1180419/

Can folks affected by this bug test this kernel, and post back if it still exhibits the bug or not?

You will need to install both the linux-image and linux-image-extra .deb packages.

If this commit fixes this bug, I can request that it be included in upstream stable and submit an SRU for inclusion in Raring.

Thanks in advance!

Revision history for this message
Mi Tom (crazy-k) wrote :

Thanks!

It seems the fix solved the issue.
I've just verified it on 2 different virtual machines having this problem and problem disappeared.

Thanks again!.

When it can appear in official kernel?

Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed' to 'verification-done'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-raring
Revision history for this message
Sudrien (sudrien) wrote :

The build in comment #22 worked just fine on the VM I was originally having issues with. Thank you.

tags: added: verification-done
removed: verification-needed-raring
Revision history for this message
David Medberry (med) wrote :

Also verified the actual raring-proposed (vs the #22). Thanks brad-figg and jsalisbury. Please get this into raring promptly!

Changed in linux (Ubuntu Raring):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Raring):
status: Fix Committed → Fix Released
Revision history for this message
Brad Figg (brad-figg) wrote :

@Nicholas,

The raring kernel has not been released yet (not in -updates) so it is just "Fix Committed".

Changed in linux (Ubuntu Raring):
status: Fix Released → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (15.4 KiB)

This bug was fixed in the package linux - 3.8.0-27.40

---------------
linux (3.8.0-27.40) raring; urgency=low

  [Brad Figg]

  * UBUNTU: [Config] CONFIG_ARM_ERRATA_643719=y

linux (3.8.0-27.39) raring; urgency=low

  [Brad Figg]

  * Release Tracking Bug
    - LP: #1199128

  [ Brad Figg ]

  * [Config] CONFIG_ATH9K_LEGACY_RATE_CONTROL=y

  [ Seth Forshee ]

  * SAUCE: Work around broken ACPI backlight on ThinkPad T430
    - LP: #1183856

  [ Stefan Bader ]

  * (d-i) Add dm-snapshot to md-modules
    - LP: #1191726

  [ Tim Gardner ]

  * [Config] CONFIG_SUNRPC_DEBUG=y
    - LP: #1127319

  [ Upstream Kernel Changes ]

  * Revert "ath9k_hw: Update rx gain initval to improve rx sensitivity"
    - LP: #1193126
  * Revert "serial: 8250_pci: add support for another kind of NetMos
    Technology PCI 9835 Multi-I/O Controller"
    - LP: #1190967
  * mac80211: close AP_VLAN interfaces before unregistering all
    - LP: #1193126
  * ath9k: use correct OTP register offsets for AR9550
    - LP: #1193126
  * regulator: palmas: Fix "enable_reg" to point to the correct reg for
    SMPS10
    - LP: #1193126
  * net: can: kvaser_usb: fix reception on "USBcan Pro" and "USBcan R" type
    hardware.
    - LP: #1193126
  * tg3: Add read dma workaround for 5720
    - LP: #1193126
  * xhci-mem: init list heads at the beginning of init
    - LP: #1193126
  * xhci: fix list access before init
    - LP: #1193126
  * xhci - correct comp_mode_recovery_timer on return from hibernate
    - LP: #1193126
  * xhci: Disable D3cold for buggy TI redrivers.
    - LP: #1193126
  * usb: dwc3: pci: PHY should be deleted later than dwc3 core
    - LP: #1193126
  * usb: dwc3: gadget: free trb pool only from epnum 2
    - LP: #1193126
  * usb: musb: make use_sg flag URB specific
    - LP: #1193126
  * USB: revert periodic scheduling bugfix
    - LP: #1193126
  * USB: serial: fix Treo/Kyocera interrrupt-in urb context
    - LP: #1193126
  * USB: visor: fix initialisation of Treo/Kyocera devices
    - LP: #1193126
  * USB: mos7720: fix DMA to stack
    - LP: #1193126
  * USB: mos7840: fix DMA to stack
    - LP: #1193126
  * USB: ark3116: fix control-message timeout
    - LP: #1193126
  * USB: iuu_phoenix: fix bulk-message timeout
    - LP: #1193126
  * USB: mos7720: fix message timeouts
    - LP: #1193126
  * USB: zte_ev: fix control-message timeouts
    - LP: #1193126
  * USB: Serial: cypress_M8: Enable FRWD Dongle hidcom device
    - LP: #1193126
  * USB: serial: Add Option GTM681W to qcserial device table.
    - LP: #1193126
  * USB: zte_ev: fix broken open
    - LP: #1193126
  * USB: keyspan: fix bogus array index
    - LP: #1193126
  * USB: mos7720: fix hardware flow control
    - LP: #1193126
  * x86/PCI: Map PCI setup data with ioremap() so it can be in highmem
    - LP: #1193126
  * USB: whiteheat: fix broken port configuration
    - LP: #1193126
  * USB: option: blacklist network interface on Huawei E1820
    - LP: #1193126
  * USB: option,zte_ev: move most ZTE CDMA devices to zte_ev
    - LP: #1193126
  * ecryptfs: fixed msync to flush data
    - LP: #1193126
  * dmaengine: ste_dma40: fix pm runtime ref counting
    - LP: #1193126
  * cifs: fix off-by-one bug in build_unc...

Changed in linux (Ubuntu Raring):
status: Fix Committed → Fix Released
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed' to 'verification-done'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-raring
tags: added: verification-done-raring
removed: verification-needed-raring
Revision history for this message
Sitsofe Wheeler (sitsofe) wrote :

I've tested linux-image-generic:amd64 3.8.0.30.48 running on Windows 2012 and the BUG/backtrace has gone.

Brad:
Your comments may have been a bit overly automated as the tags in this bug already contained verification-done when you added your comment (so you in turn set verification-done). Although your comment didn't mention it, I've changed verification-needed-raring to verification-done-raring so hopefully the right thing will happen.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.