HP cciss / SmartArray responding slowly

Bug #337419 reported by gardron
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Linux
Expired
Medium
linux (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

Hi,

We've had a couple of HP servers running 8.10 happily for months. I did an update earlier today and noticed (using hdparm) that the drive speed went from our usual 160-170MB/s (from a 0+1 array over 4x146GB 10k rpm SAS 2.5" drives) to topping out at 35MB/s. This upgrade set the kernel to 2.6.27-11-server from our previous release of 2.6.27-7-server. Changing grub to boot using this older kernel got the drive speed back up to what we were expecting as soon as the server was rebooted.

This is something we've seen with RHEL5 about a year ago when HP released a new software version with a bug in it with (I think) 2.6.18.

All other packages were up to date as of the time of writing.

This has been tested on the following HP kit:
DL180
DL140
DL320
DL380

Running the following RAID cards:
P200
P400
P800

Any further information required, I'm happy to provide as this is quite a performance hit on raw sequential speed.

Revision history for this message
MikeM (michaelm) wrote :

Hi,

I have seen this too using custom kernels on Hardy and on the current RC of Karmic Koala. The performance regression I have seen using the standard Linux CCISS kernel driver is a drop of about 66% performance. The performance regression was introduced in 2.6.29-rc1 and is in current kernels up to and including 2.6.32-rc5. Since Karmic Koala is based on 2.6.31 is shares the same performance drop when compared with Hardy.

Since Ihave seen mention of Canonical trying to get Ubuntu certified on HP kit, I would expect there to be a vested interest for both Canonical and HP to resolve this problem.

I have put detailed performance tests at the following URL to highlight this issue.
http://bugzilla.kernel.org/show_bug.cgi?id=13127

In summary using the standard Hardy kernel (2.6.24-24-server) and custom kernels (using the kernel config from 2.6.24-24-server to keep the same options) listed below I see around 90MB/s while reading off a single SAS 146GB 10k RPM drive in a RAID-0 logical drive on a DL360G5 and DL380G5. From 2.6.29-rc1 through to current kernels I see around 34MB/s while doing the same test.

The following give around 90MB/s read of the logical drive:
2.6.24-24-server
26.24.7
2.6.25.20
2.6.26.8
2.6.27.37
2.6.28.10

The following give around 34MB/s read of the same logical drive:
2.6.29-rc1
2.6.29
2.6.29.1 through 2.6.29.6
2.6.30.9
2.6.31.4
2.6.32-rc5

The test I performed on all the above was a simple:
dd if=/dev/cciss/c0d1 of=/dev/null bs=1024k count=1024
dd if=/dev/cciss/c0d1 of=/dev/null bs=1024k count=2048

Performance from Hardy:
--------------------------
Wed Oct 21 16:40:54 BST 2009
Linux uk-ditnas902 2.6.24-24-server #1 SMP Fri Sep 18 16:47:05 UTC 2009 x86_64
GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 11.83 s, 90.8 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.6072 s, 91.0 MB/s
--------------------------

Performance from current Karmic Koala:
---------------------
Linux uk-ditnas903 2.6.31-14-server #48-Ubuntu SMP Fri Oct 16 15:07:34 UTC 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.0771 s, 34.6 MB/s
------
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.1101 s, 34.5 MB/s
---------------------

MikeM (michaelm)
tags: added: karmic
tags: added: cciss performance regression-potential
tags: added: hp
removed: regression-potential
MikeM (michaelm)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
MikeM (michaelm) wrote :

dmesg from karmic on DL380g5 that showed the slower performance

Revision history for this message
MikeM (michaelm) wrote :

lspci -vvnn from karmic on DL380g5 that showed the slower performance

Revision history for this message
MikeM (michaelm) wrote :

uname -a from karmic on DL380g5 that showed the slower performance

Revision history for this message
MikeM (michaelm) wrote :

/proc/version_ginature from karmic on DL380g5 that showed the slower performance

Revision history for this message
MikeM (michaelm) wrote :

# cat /proc/driver/cciss/cciss0
cciss0: HP Smart Array P400 Controller
Board ID: 0x3234103c
Firmware Version: 5.20
IRQ: 37
Logical drives: 2
Current Q depth: 0
Current # commands on controller: 0
Max Q depth since init: 5
Max # commands on controller since init: 159
Max SG entries since init: 31
Sequential access devices: 0

cciss/c0d0: 146.77GB RAID 0
cciss/c0d1: 146.77GB RAID 0

As reported on bugzilla.kernel.org, I have seen this on a DL360G5 with latest HP firmware and on older HP firmware.

Changed in linux:
status: Unknown → Confirmed
Revision history for this message
Maxym (max-kutsevol) wrote :

Any luck with it?

Revision history for this message
MikeM (michaelm) wrote :

Maxym,

still awaiting a resolution from HP I'm afraid.

My thoughts remain as I previously posted on the kernel.org bug posting:
Anyone using a Smart Array controller should, in my opinion, do thorough
testing prior to using a 2.6.29 or more recent kernel for production work
loads. At a minimum the latest 2.6.28 kernel's performance should be compare
with the more recent kernel being considered.

None of my Linux based HP servers are running newer than 2.6.28. Hopefully this will be resolved before I am forced to upgrade from 8.04LTS...

Revision history for this message
Maxym (max-kutsevol) wrote :

MikeM,
Thanks for your reply. On E200i I'm having probles on all kernels from 2.6.18 through 2.6.31. Even on Centos 5.4 with installed binary driver from hp.com.

Revision history for this message
MikeM (michaelm) wrote :

Hi,

For me, this bug relates specifically to a performance regression between 2.6.28 and 2.6.29.

The E200i is an entry level controller. Unless you have the battery backed write cache (BBWC) enabler installed you will see "poor" write performance. The read performance on the E200i is not brilliant either. You don't mention what problems you are seeing, but I suspect you are simply seeing "poor" performance - this is probably due to over optimistic expectations from the E200i.

If you are having functionality problems I would recomment opening a new bug. If you are seeing "poor" performance across all Linux kernels when compared with, say, Windows, then maybe contact HP. If you are seeing a specific performance regression between 2.6.28 and 2.6.29 then post your results in this bug thread.

Sorry I cannot be more help.

Revision history for this message
MikeM (michaelm) wrote :

This performance regression saga continues....

I installed Ubuntu Lucid Lynx rc2 which is based on 2.6.32-something. The performance is as expected (89/90MB/s) doing my simple dd test.

I downloaded 2.6.32.7 from kernel.org and compiled a kernel using the Ubuntu config file as a starting point (added bnx driver as compiled in rather than a module and enabled sysfs deprecated support).

I installed the resulting .deb and booted up (ignoring the sysfs warning...). My dd test (on the Lucid Lynx system running my 2.6.32.7 kernel) gave around 90MB/s - so performance is as expected.

I copied the compiled .deb to a Ubuntu 8.04 system and installed it. The dd test (on the Hardy Heron system running the same 2.6.32.7 kernel as above) gave around 21MB/s.

I have performed these tests on two different HP DL360G5s just incase there is some oddity with the hardware.

So based on this new information, I am now not 100% sure that the performance regression lies with the cciss driver itself, but could perhaps be a result of some kernel parameters that the Ubuntu distribution tweaks.

I compared the hardy and lucid based systems' sysctl, /proc/sys and /sys parameters. Nothing in there jumped out as being obviously related to this issue.

I am still at a loss and really wish I could get 2.6.32 running on a Hardy based Ubuntu system with decent cciss driver performance.

Revision history for this message
miki (gregor-ibic) wrote :

Hi,
I performed a lot of tests on HP Proliant hw and raid controllers. I installed native linux and linux on top of esxi 4. The performance problem is the same, but a little more obscure, cause linear copy works fine, only application server are problematic. If I downgrade kernel to 2.6.29 I get decent, but not top performance, but on kernels higher than 2.6.30 I get horrible perfomance.
So maybe the problem is in the ext3 system and changes in it.

Revision history for this message
gardron (gardron) wrote :

It's not ext3 specific, all our testing has been done using XFS as the only thing we use extX for is /boot

Revision history for this message
MikeM (michaelm) wrote :

Hi,

I agree that it is not file system specific as I was testing against the block device:
dd if=/dev/cciss/c0d1 of=/dev/null bs=1024k count=1024

Furthermore, the cciss driver (which this bug is about) will not be used by Linux when running on ESX. Changing the Linux kernel version will not change the cciss driver used by ESX. There are plenty of threads relating to SmartArray controllers' performance when used without BBWC - maybe that is the problem you are seeing? Or you are asking for too many IOPs (especially many small IOs when you application is running) from your disk subsystem.

Regards,
Mike

Revision history for this message
gardron (gardron) wrote :

Again, all our testing was done on on P400/P800 cards with BBWC and P200 without BBWC so I don't see this being the cause.

The IOPs I can't see being an issue as it's something I'm typically seeing on smaller amounts of larger files. Larger amounts of smaller files seem ok for speed - similarly, large amounts of IOPs from mysql writes are at the level I'd be expecting. It just seems to be (or at least is more pronounced) on large transfers with small IOPs

Revision history for this message
steve (steven-drake) wrote :

Have had this problem with cciss driver since 2006. Try changing value in read_ahead_kb (in SLES10 x86) this is located at /sys/block/ccciss\_cxdx/queue) and see if performance doesn't return. Try 256.

Changed in linux:
importance: Unknown → Medium
Revision history for this message
Brad Figg (brad-figg) wrote : Unsupported series, setting status to "Won't Fix".

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: Confirmed → Won't Fix
Changed in linux:
status: Confirmed → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.