cciss: hpacucli "ctrl slot=0 create type=ld drives=2:4" hangs, spews call trace in dmesg

Bug #1006212 reported by Paul Collins
56
This bug affects 10 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Medium
Unassigned
Precise
Won't Fix
Medium
Unassigned

Bug Description

On a Hewlett-Packard ProLiant DL385 G1 running precise, linux-image-3.2.0-24-generic 3.2.0-24.39, when I issue "ctrl slot=0 create type=ld drives=2:4" via hpacucli, the command hangs. In dmesg I find the following (full dmesg attached).

[ 482.228046] INFO: task .hpacucli:1384 blocked for more than 120 seconds.
[ 482.249879] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 482.275399] .hpacucli D ffffffff81806240 0 1384 1367 0x00020000
[ 482.275413] ffff8801f7c6daf8 0000000000000082 ffffc90000806000 fde97dc8b0a15c03
[ 482.275431] ffff8801f7c6dfd8 ffff8801f7c6dfd8 ffff8801f7c6dfd8 0000000000013780
[ 482.275449] ffffffff81c0d020 ffff8801f81e44d0 ffffffffa004fd40 ffffffffa004fd40
[ 482.275466] Call Trace:
[ 482.275486] [<ffffffff8165a88f>] schedule+0x3f/0x60
[ 482.275495] [<ffffffff8165b697>] __mutex_lock_slowpath+0xd7/0x150
[ 482.275504] [<ffffffff8165b2aa>] mutex_lock+0x2a/0x50
[ 482.275517] [<ffffffffa0038ebe>] cciss_unlocked_open+0x2e/0xd0 [cciss]
[ 482.275528] [<ffffffff811b0212>] __blkdev_get+0xd2/0x460
[ 482.275538] [<ffffffff8108abc7>] ? bit_waitqueue+0x17/0xc0
[ 482.275546] [<ffffffff811b05fe>] blkdev_get+0x5e/0x1e0
[ 482.275556] [<ffffffff812fb152>] register_disk+0x162/0x180
[ 482.275564] [<ffffffff812fb224>] add_disk+0xb4/0x230
[ 482.275574] [<ffffffffa003a651>] cciss_add_disk+0x141/0x1b0 [cciss]
[ 482.275584] [<ffffffffa003fbbf>] cciss_update_drive_info+0x3cf/0x490 [cciss]
[ 482.275595] [<ffffffffa0040252>] rebuild_lun_table+0x282/0x3a0 [cciss]
[ 482.275605] [<ffffffff8113dca8>] ? handle_mm_fault+0x1f8/0x350
[ 482.275615] [<ffffffffa0040def>] cciss_ioctl+0x29f/0x3e0 [cciss]
[ 482.275625] [<ffffffffa0040f76>] do_ioctl+0x46/0x70 [cciss]
[ 482.275635] [<ffffffffa00412fe>] cciss_compat_ioctl+0x1e/0xd8 [cciss]
[ 482.275645] [<ffffffff81309ecd>] compat_blkdev_ioctl+0x32d/0x4b0
[ 482.275655] [<ffffffff811c838d>] compat_sys_ioctl+0xad/0x240
[ 482.275665] [<ffffffff81667470>] cstar_dispatch+0x7/0x2e
[ 602.272039] INFO: task .hpacucli:1384 blocked for more than 120 seconds.
[ 602.294416] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 602.318038] .hpacucli D ffffffff81806240 0 1384 1367 0x00020000
[ 602.318044] ffff8801f7c6daf8 0000000000000082 ffffc90000806000 fde97dc8b0a15c03
[ 602.318050] ffff8801f7c6dfd8 ffff8801f7c6dfd8 ffff8801f7c6dfd8 0000000000013780
[ 602.318054] ffffffff81c0d020 ffff8801f81e44d0 ffffffffa004fd40 ffffffffa004fd40
[ 602.318059] Call Trace:
[ 602.318076] [<ffffffff8165a88f>] schedule+0x3f/0x60
[ 602.318084] [<ffffffff8165b697>] __mutex_lock_slowpath+0xd7/0x150
[ 602.318092] [<ffffffff8165b2aa>] mutex_lock+0x2a/0x50
[ 602.318104] [<ffffffffa0038ebe>] cciss_unlocked_open+0x2e/0xd0 [cciss]
[ 602.318114] [<ffffffff811b0212>] __blkdev_get+0xd2/0x460
[ 602.318123] [<ffffffff8108abc7>] ? bit_waitqueue+0x17/0xc0
[ 602.318131] [<ffffffff811b05fe>] blkdev_get+0x5e/0x1e0
[ 602.318140] [<ffffffff812fb152>] register_disk+0x162/0x180
[ 602.318147] [<ffffffff812fb224>] add_disk+0xb4/0x230
[ 602.318157] [<ffffffffa003a651>] cciss_add_disk+0x141/0x1b0 [cciss]
[ 602.318167] [<ffffffffa003fbbf>] cciss_update_drive_info+0x3cf/0x490 [cciss]
[ 602.318177] [<ffffffffa0040252>] rebuild_lun_table+0x282/0x3a0 [cciss]
[ 602.318186] [<ffffffff8113dca8>] ? handle_mm_fault+0x1f8/0x350
[ 602.318195] [<ffffffffa0040def>] cciss_ioctl+0x29f/0x3e0 [cciss]
[ 602.318204] [<ffffffffa0040f76>] do_ioctl+0x46/0x70 [cciss]
[ 602.318214] [<ffffffffa00412fe>] cciss_compat_ioctl+0x1e/0xd8 [cciss]
[ 602.318223] [<ffffffff81309ecd>] compat_blkdev_ioctl+0x32d/0x4b0
[ 602.318232] [<ffffffff811c838d>] compat_sys_ioctl+0xad/0x240
[ 602.318241] [<ffffffff81667470>] cstar_dispatch+0x7/0x2e
[ 722.316070] INFO: task .hpacucli:1384 blocked for more than 120 seconds.
[ 722.338386] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

... and so on until I reboot the machine. When the machine is back up and I examine the array configuration in hpacucli, the new voume is present and marked "OK".

The problem is also present in kernel-ppa's v3.4-precise:

pjdc@prat:~$ cat /proc/version
Linux version 3.4.0-030400-generic (apw@gomeisa) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) ) #201205210521 SMP Mon May 21 09:22:02 UTC 2012

so I will tag as unfixed upstream.

I am also marking as "Confirmed", being unable to run apport-collect when reproducing on 3.2, since the machine has no network due to bug #1005699.

Revision history for this message
Paul Collins (pjdc) wrote :
Revision history for this message
Paul Collins (pjdc) wrote :
Revision history for this message
Paul Collins (pjdc) wrote :
Changed in linux (Ubuntu Precise):
importance: Undecided → Medium
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Precise):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.4kernel[1] (Not a kernel in the daily directory). Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-quantal/

Changed in linux (Ubuntu Precise):
status: Confirmed → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Paul,

Sorry, I see in the description you already tested the upstream kernel. Do you happen to know if this bug also exists in previous releases, such as Oneiric or Lucid?

Changed in linux (Ubuntu Precise):
status: Incomplete → Confirmed
tags: added: kernel-da-key
Revision history for this message
Paul Collins (pjdc) wrote :

Whoops, I forgot to explain I had tagged as "regression-release" because the problem is not present in lucid.

I have also tested with the following kernels from previous releases:

oneiric: Ubuntu 3.0.0-20.34-server 3.0.30 - failure
natty: Ubuntu 2.6.38-15.59-server 2.6.38.8 - failure
maverick: Ubuntu 2.6.35-32.67-server 2.6.35.14 - success

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

So it appears the issue was introduced in natty. Would you be able to test a few kernels? If so, I can perform a bisect to try and identify the commit that caused this regression.

Revision history for this message
Paul Collins (pjdc) wrote :

Can do.

penalvch (penalvch)
tags: added: kernel-bug-exists-upstream-v3.4-precise
tags: added: natty oneiric
Revision history for this message
Gary Cuozzo (ua5r) wrote :

Hello,
As a data point, I'm seeing this same issue on 3 different DL360-G5's all running Precise. 2 systems are seeing the issue with the P400i controllers and the 3rd has the issue with a P800 controller and external disk arrays. I experienced the issue while creating RAID1 sets as well as 5 disk and 10 disk RAID5 sets.

All are running with kernel 3.2.0-31-generic.

Thanks,
gary

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :

Out if curiosity, which version of hpacucli are you guys using? Version? 32bit or 64bit?

Running Precise, I see this problem on controllers using the old cciss driver (P400 for example), but not on controllers using the hpsa driver (P810 for example). This is with hpacucli 9.10 64bit.

This all worked on lucid, albeit with an earlier hpacucli release which was 32bit, but it indeed smells like a regression.

Revision history for this message
Gary Cuozzo (ua5r) wrote :

One of my systems which experiences the problem is running hpacucli v8.70-8.0. The system is 64-bit. I'm running the utility using the following command: setarch x86_64 --uname-2.6 hpacucli

Without the setarch, the utility does not detect any controllers.

Hope the information helps,
gary

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :

hpacucli 8.x is only available as 32bit binaries.

9.x is available as both 32bit and 64bit binaries.

Revision history for this message
sc (soumen-chakrabarti) wrote :

Confirming regression on Precise 12.04 but with kernel 3.6.2 no patches from kernel.org, hpacucli 8.x and 9.x. Was working fine on Lucid 10.04 with kernel 2.6.32-21-server. Once hpacucli bombs, NFS server also breaks down until the computer is rebooted.

Revision history for this message
Robstarusa (rob-naseca) wrote :

This is affecting me as well.

DL380G5, Latest bios/drivers from Feb 2013, Kernel 3.2.0-40.

Revision history for this message
BillCarlson (bill-carlson) wrote :

Same issue, hpacucli-9.40-12.0.x86_64.rpm on a related distro (*cough* Debian Wheezy).

Proliant DL365 with e200i on 1.8.6 firmware.

Note even though the command hangs, the array is created and works after reboot.

New drive will show up right away, but anything trying to access will block as well, in my case via pvcreate, started logging blocked messages for pvcreate.

penalvch (penalvch)
tags: added: needs-kernel-logs
Revision history for this message
penalvch (penalvch) wrote :

Paul Collins, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please test for this with the latest development release of Ubuntu? ISO images are available from http://cdimage.ubuntu.com/daily-live/current/ .

If it remains an issue, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

tags: added: needs-upstream-testing
removed: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Ryan Tandy (rtandy) wrote :

On a DL380 G5 with a P400 array, this seems to be working again with the lts-raring kernel (3.8.0-32-generic) and newer (e.g. lts-saucy 3.11.0-13-generic from ppa:canonical-kernel-team/ppa). Can anyone else verify that?

Revision history for this message
Steve Langasek (vorlon) wrote :

The Precise Pangolin has reached end of life, so this bug will not be fixed for that release

Changed in linux (Ubuntu Precise):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.