SATA hotplug causes I/O stack to freeze

Bug #1049013 reported by Brian Candler
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

PROBLEM:

After updating to 3.2.0-30-generic on an Ubuntu 12.04 x86_64 server with 24 SATA drives and md RAID.

When a hotplug event occurs for a drive (even a drive which is not part of any active md RAID set), the machine hangs in a really bad way. I can ping it, I can open a connection to port 22 but not start ssh. Console reports messages such as
"BUG: soft lockup - CPU#1 stuck for 22s! [kworker/u:12:1602]"

For more details including stacktrace see
http://gluster.org/pipermail/gluster-users/2012-September/011354.html
This has a screenshot as attachment:
http://gluster.org/pipermail/gluster-users/attachments/20120909/40d0cfb0/attachment-0001.png

CAUSE AND FIX:

Patch provided at http://gluster.org/pipermail/gluster-users/2012-September/011355.html
It says this was a regression introduced by commit 3b661a9 "[SCSI] fix hot unplug vs async scan race"
which is 1675b80 in the ubuntu-precise repository.

After building a kernel with this patch applied, I found that:

- hot plugging two inactive drives while I/O access is going on to the other drives is fine. The other drives in an md raid0 set continued to work without a hitch (activity was being generated by bonnie++)

- Even removing those two drives while dd'ing from them was fine. I/O to the other md RAID set was also unaffected.

NOTE ABOUT TESTED KERNEL:

I built this test kernel from apt-get install linux-source (version 3.2.0-30.48), untarring /usr/src/linux-source-3.2.0.tar.bz2, applying the patch, and then "fakeroot make-kpkg --initrd --append-to-version=-brian-20120910 kernel-image kernel-headers"

However the deb package built is "linux-image-3.2.27-brian-20120910_3.2.27-brian-20120910-10.00.Custom_amd64.deb"
and uname -a also reports "3.2.27-brian-20120910"

I was surprised to find my kernel called 3.2.27 instead of 3.2.0, and so I wonder if there are other changes in this kernel apart from the patch I applied.

ADDITIONAL SYSTEM DETAILS:

Ubuntu 12.04 x86_64 server
24 SATA drives
2 x LSI HBAs (1 x 8 port, 1 x 16 port)

03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
09:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] (rev 02)

# ./sas2flash -listall
LSI Corporation SAS2 Flash Utility
Version 12.00.00.00 (2011.11.08)
Copyright (c) 2008-2011 LSI Corporation. All rights reserved

 Adapter Selected is a LSI SAS: SAS2008(B2)

Num Ctlr FW Ver NVDATA x86-BIOS PCI Addr
----------------------------------------------------------------------------

0 SAS2008(B2) 12.00.00.00 0c.00.00.05 07.23.01.00 00:03:00:00
1 SAS2116_1(B1) 12.00.00.00 0c.00.00.01 07.23.01.00 00:09:00:00

 Finished Processing Commands Successfully.
 Exiting SAS2Flash.
---
AlsaDevices:
 total 0
 crw-rw---T 1 root audio 116, 1 Sep 10 10:46 seq
 crw-rw---T 1 root audio 116, 33 Sep 10 10:46 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.0.1-0ubuntu12
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
DistroRelease: Ubuntu 12.04
HibernationDevice: RESUME=UUID=ff697d29-c635-4330-9ef9-941eebef0e01
InstallationMedia: Ubuntu-Server 11.10 "Oneiric Ocelot" - Release amd64 (20111011)
MachineType: TYAN S5510
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 LANGUAGE=en_GB:en
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.27-username-20120910 root=UUID=1939af43-cfa3-47c1-9ed6-1ca741c1a5ca ro crashkernel=384M-2G:64M,2G-:128M
ProcVersionSignature: Ubuntu 3.2.0-30.48-generic 3.2.27
RelatedPackageVersions:
 linux-restricted-modules-3.2.27-brian-20120910 N/A
 linux-backports-modules-3.2.27-brian-20120910 N/A
 linux-firmware 1.79.1
RfKill: Error: [Errno 2] No such file or directory
Tags: precise
Uname: Linux 3.2.27-brian-20120910 x86_64
UpgradeStatus: Upgraded to precise on 2012-05-22 (111 days ago)
UserGroups: adm admin cdrom dialout kvm libvirtd lpadmin plugdev sambashare
dmi.bios.date: 04/12/2012
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: V1.05a
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: S5510
dmi.board.vendor: TYAN
dmi.board.version: empty
dmi.chassis.asset.tag: empty
dmi.chassis.type: 3
dmi.chassis.vendor: empty
dmi.chassis.version: empty
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrV1.05a:bd04/12/2012:svnTYAN:pnS5510:pvrempty:rvnTYAN:rnS5510:rvrempty:cvnempty:ct3:cvrempty:
dmi.product.name: S5510
dmi.product.version: empty
dmi.sys.vendor: TYAN

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1049013

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: precise
Revision history for this message
Brian Candler (b-candler) wrote : AcpiTables.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Brian Candler (b-candler) wrote : BootDmesg.txt

apport information

Revision history for this message
Brian Candler (b-candler) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Brian Candler (b-candler) wrote : IwConfig.txt

apport information

Revision history for this message
Brian Candler (b-candler) wrote : Lspci.txt

apport information

Revision history for this message
Brian Candler (b-candler) wrote : Lsusb.txt

apport information

Revision history for this message
Brian Candler (b-candler) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Brian Candler (b-candler) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Brian Candler (b-candler) wrote : ProcModules.txt

apport information

Revision history for this message
Brian Candler (b-candler) wrote : UdevDb.txt

apport information

Revision history for this message
Brian Candler (b-candler) wrote : UdevLog.txt

apport information

Revision history for this message
Brian Candler (b-candler) wrote : WifiSyslog.txt

apport information

Revision history for this message
Brian Candler (b-candler) wrote :

Note: apport information was from system running the patched kernel

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you provide some information on the status of the patch with regards to getting it merged upstream? Has it been sent upstream, what sort of feedback has it received, is it getting applied to a subsystem maintainer's tree, etc?

Revision history for this message
Brian Candler (b-candler) wrote :

I just asked the question on gluster-users where I got the original patch, response is:

"Patch is not been applied to subsystem maintainer's tree yet, James
may busy with other staff, you can send mail to James
<email address hidden> & linux scsi
<email address hidden> push this bug fix to be include in to
mainline."

Revision history for this message
Brian Candler (b-candler) wrote :

The patch is now in mainstream kernel git master branch: commit bc3f02a795d3b4faa99d37390174be2a75d091bd

Revision history for this message
Tais P. Hansen (taisph) wrote :
Download full text (4.4 KiB)

This problem is wreaking havoc on our servers using iSCSI+Multipath.

[847112.700047] BUG: soft lockup - CPU#7 stuck for 23s! [kworker/u:4:8911]
[847112.703572] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables kvm_intel kvm scsi_dh_emc ext2 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bridge 8021q garp stp bonding dm_round_robin i5000_edac joydev edac_core i5k_amb radeon ioatdma dca lp ttm drm_kms_helper shpchp parport drm i2c_algo_bit mac_hid psmouse serio_raw dcdbas dm_multipath usbhid hid uas usb_storage mptsas mptscsih mptbase scsi_transport_sas bnx2
[847112.703627] CPU 7
[847112.703628] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables kvm_intel kvm scsi_dh_emc ext2 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bridge 8021q garp stp bonding dm_round_robin i5000_edac joydev edac_core i5k_amb radeon ioatdma dca lp ttm drm_kms_helper shpchp parport drm i2c_algo_bit mac_hid psmouse serio_raw dcdbas dm_multipath usbhid hid uas usb_storage mptsas mptscsih mptbase scsi_transport_sas bnx2
[847112.703671]
[847112.703674] Pid: 8911, comm: kworker/u:4 Not tainted 3.2.0-32-generic #51-Ubuntu Dell Inc. PowerEdge M600/0MY736
[847112.703679] RIP: 0010:[<ffffffff8165b049>] [<ffffffff8165b049>] _raw_spin_unlock_irqrestore+0x19/0x30
[847112.703689] RSP: 0018:ffff880045fe1d98 EFLAGS: 00000286
[847112.703691] RAX: 0000000000000286 RBX: 0000000000000008 RCX: 0000000000000008
[847112.703694] RDX: ffff8801341e8808 RSI: 0000000000000286 RDI: 0000000000000286
[847112.703696] RBP: ffff880045fe1da0 R08: ffffffff81cddaa0 R09: 0000000000000100
[847112.703699] R10: 0000000000000008 R11: 0000000000000001 R12: ffffffff81cddaa0
[847112.703701] R13: 0000000000000100 R14: 0000000000000008 R15: 0000000000000001
[847112.703704] FS: 0000000000000000(0000) GS:ffff88013fdc0000(0000) knlGS:0000000000000000
[847112.703707] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[847112.703709] CR2: 00007f2c5737dc62 CR3: 0000000001c05000 CR4: 00000000000026e0
[847112.703712] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[847112.703714] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[847112.703717] Process kworker/u:4 (pid: 8911, threadinfo ffff880045fe0000, task ffff88013523dc00)
[847112.703719] Stack:
[847112.704002] ffff88012b088930 ffff880045fe1dd0 ffffffff8142f834 ffff88012b088930
[847112.704002] ffff88012b088800 ffff88012b088818 0000000000000000 ffff880045fe1e00
[847112.704002] ffffffffa02aefec ffff88012b088880 ffff88009374a900 ffff88012b6c6000
[847112.704002] Call Trace:
[847112.704002] [<ffffffff8142f834>] scsi_remove_target+0xb4/0xe0
[847112.704002] [<ffffffffa02aefec>] __iscsi_unbind_session+0xbc/0x190 [scsi_transport_iscsi]
[847112...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.