linux-image-2.6.24-* w/ aic79xx crash under heavy I/O load

Bug #238118 reported by Konstantin Lepikhov
16
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned
Gutsy
Invalid
Undecided
Unassigned
Hardy
Fix Released
Medium
Stefan Bader
linux-source-2.6.22 (Ubuntu)
Invalid
Medium
Stefan Bader
Gutsy
Fix Released
Medium
Stefan Bader
Hardy
Invalid
Undecided
Unassigned

Bug Description

Binary package hint: linux-source-2.6.24

Affected system:

Description: Ubuntu 7.10
Release: 7.10

Description: Ubuntu 8.04
Release: 8.04

Affected kernel flavours: any kernel build from 2.6.24-source

See detail problem description in LKML archive:

http://www.ussg.iu.edu/hypermail/linux/kernel/0803.1/0201.html

Upstream fix:

commit e88a0c2ca81207a75afe5bbb8020541dabf606ac
Author: James Bottomley <email address hidden>
Date: Sun Mar 9 11:57:56 2008 -0500

    drivers: fix dma_get_required_mask

This bug also affect 2.6.22 kernels.

Revision history for this message
Konstantin Lepikhov (lakostis) wrote :
Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi lakostis,

Thanks for the report and the upstream git commit id. It's very helpful. I'll reassign this to the kernel team and open a nomination for a Hardy SRU (stable release update). This patch is already in the upcoming Intrepid Ibex 8.10 kernel. Thanks.

Changed in linux:
assignee: nobody → ubuntu-kernel-team
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → ubuntu-kernel-team
importance: Undecided → Medium
status: New → Triaged
status: Triaged → Fix Committed
Stefan Bader (smb)
Changed in linux:
assignee: ubuntu-kernel-team → stefan-bader-canonical
status: Triaged → In Progress
Stefan Bader (smb)
Changed in linux-source-2.6.22:
assignee: nobody → stefan-bader-canonical
importance: Undecided → Medium
status: New → In Progress
Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Just adding a note I've also added Gutsy SRU nominations that Stefan is working on. Thanks.

Changed in linux:
status: New → Invalid
Changed in linux-source-2.6.22:
assignee: nobody → stefan-bader-canonical
status: New → In Progress
status: In Progress → Invalid
assignee: stefan-bader-canonical → nobody
importance: Medium → Undecided
importance: Undecided → Medium
Revision history for this message
Stefan Bader (smb) wrote :

SRU justification (Gutsy + Hardy):

Impact: Hard lockups when doing heavy (IMO just higher) I/O load. The
reason is that dma_get_required_mask returns a wrong value and thus the
memory management will get confused.

Fix: From upstream (in Intrepid) will remove the limit to the current
dma mask when actually the new limit is required.

Testcase: From kernel bug report it was sufficient to do a gzip which
hung within a few minutes.

Changed in linux-source-2.6.22:
assignee: nobody → stefan-bader-canonical
importance: Undecided → Medium
status: New → In Progress
Revision history for this message
Stefan Bader (smb) wrote :

Commited to ubuntu-gutsy as: 509dc390d293330ff9c1cbdfa0706ea2609ebef1

Changed in linux-source-2.6.22:
status: In Progress → Fix Committed
status: In Progress → Fix Committed
Revision history for this message
Stefan Bader (smb) wrote :

Commited to ubuntu-hardy as: 2595000a54e3c647cf35bb550c3f3a2699237c90

Changed in linux:
status: In Progress → Fix Committed
Revision history for this message
Stefan Bader (smb) wrote :

Re-setting to "In Progress" for SRU process reasons.

Changed in linux:
status: Fix Committed → In Progress
Stefan Bader (smb)
Changed in linux:
status: In Progress → Fix Committed
Revision history for this message
Timm Essigke (essigke) wrote :
Download full text (3.8 KiB)

Description: Ubuntu 8.04
Release: 8.04

Affected kernel flavors: 2.6.24-19-xen version 3.2.1-rc1-pre (buildd@buildd) (gcc version 4.2.3 (Ubuntu 4.2.3-2ubuntu7)) Fri Apr 11 01:12:53 UTC 2008

Bug remains after patch of Xen kernel!

I installed the kernel from bug #218126 (Ihttp://www.il.is.s.u-tokyo.ac.jp/~hiranotaka/), but experienced crashes of the aic7xxx module. Therefore, I recompiled 2.6.24-19 xen with the above patch. Even little I/O (like starting aptsh) causes the hard lockup. With the original 2.6.24-19-server kernel (without patch) I was not able to provoke the bug. With the Xen kernel it is 100% reproducible. I attached a full console dump from the boot to the crash.

The system is a 2.8 GHz Xeon (single CPU with hyper-threading, 32bit), 512 MB RAM.

Help fixing this bug is highly appreciated.

Timm

root@xenserver2:~# aptsh
Generating and mapping caches...
[ 141.861949] PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
[ 141.862176] ------------[ cut here ]------------
[ 141.862309] kernel BUG at /home/essigke/linux-2.6.24/debian/build/custom-source-xen/drivers/scsi/aic7xxx/aic79xx_osm.c:1490!
[ 141.862464] invalid opcode: 0000 [#1] SMP
[ 141.862795] Modules linked in: 8021q bridge ipv6 iptable_filter ip_tables x_tables lp loop container serio_raw 8250_pnp button pata_acpi sworks_agp i2c_piix4 evdev 8250 seri
al_core agpgart parport_pc parport i2c_core psmouse pcspkr ext3 jbd mbcache ohci_hcd usbcore sr_mod cdrom sd_mod osst st ch sg pata_serverworks floppy mptfc mptscsih mptbase sc
si_transport_fc scsi_tgt e1000 aic79xx ata_generic tg3 ssb libata aic7xxx scsi_transport_spi scsi_mod dm_mirror dm_snapshot dm_mod thermal processor fan fuse
[ 141.869315]
[ 141.869438] Pid: 51, comm: kblockd/0 Not tainted (2.6.24-19-xen #1)
[ 141.869568] EIP: 0061:[<de9f7951>] EFLAGS: 00010082 CPU: 0
[ 141.869739] EIP is at ahd_linux_queue+0x661/0x670 [aic79xx]
[ 141.869871] EAX: fffffff4 EBX: dbb6dee6 ECX: dd6b4080 EDX: 00000002
[ 141.870003] ESI: dd48acbe EDI: dd73704a EBP: dd56e000 ESP: dbb6de90
[ 141.870134] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
[ 141.870263] Process kblockd/0 (pid: 51, ti=dbb6c000 task=dbb5e270 task.ti=dbb6c000)
[ 141.870398] Stack: dd613bfc dd48ac80 01635dfb 00000000 0003f000 c0208a9d c045f020 0e325bf9
[ 141.871479] 00000021 0e325bf9 dd488ca8 dd48ac80 dd56e000 dade0e48 c0130a67 de8cb6a0
[ 141.872557] dd49f4c0 c049ce80 fffffff4 00000001 00000000 07200000 dd48ac80 00000000
[ 141.873629] Call Trace:
[ 141.873903] [<c0208a9d>] cfq_dispatch_requests+0x6d/0x2f0
[ 141.874151] [<c0130a67>] lock_timer_base+0x27/0x60
[ 141.874389] [<de8cb6a0>] scsi_times_out+0x0/0x80 [scsi_mod]
[ 141.874650] [<de8c7da7>] scsi_dispatch_cmd+0x147/0x280 [scsi_mod]
[ 141.874904] [<de8ce1ac>] scsi_request_fn+0x1fc/0x3e0 [scsi_mod]
[ 141.875155] [<c0138780>] worker_thread+0x0/0xe0
[ 141.875397] [<c0200405>] __generic_unplug_device+0x25/0x30
[ 141.875639] [<c0201265>] generic_unplug_device+0x15/0x50
[ 141.875880] [<c02023a2>] blk_unplug_work+0x42/0xa0
[ 141.876132] [<c0202360>] blk_unplug_work+0x0/0xa0
[ 141.876367] [<c0137c83>] run_workqueue+0x93/0x160
[ 141.876608] [<c0138780>] ...

Read more...

Revision history for this message
Steve Langasek (vorlon) wrote :

Accepted into -proposed, please test and give feedback here. Please see https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Revision history for this message
Timm Essigke (essigke) wrote :

Dear Steve,

thank you for picking up my problem. Unfortunately the new linux-image does not fix the bug. I attached the console output.

Thank you for your help!

Timm

Revision history for this message
Goro (ggoro) wrote :

This patch is included in kernel 2.6.24-20.37 from your log I've seen you are running 2.6.24-19.x
New kernel is available for testing in proposed but it's not automatically upgraded by Linux meta package, you have to do it yourself.

You can see bug fix mentioned in changelog.

https://launchpad.net/ubuntu/hardy/+source/linux/2.6.24-20.37

Revision history for this message
Timm Essigke (essigke) wrote :

I get the same problem with 2.6.24-20.37.

Thanks,

Timm

Revision history for this message
Steve Langasek (vorlon) wrote :

marked as 'verification-failed', and rolled back to 'confirmed'. Looks like more investigation is needed.

Changed in linux:
status: Fix Committed → Confirmed
Revision history for this message
Stefan Bader (smb) wrote :

I have put up two -xen kernels at http://people.ubuntu.com/~smb/bug238118/
While they won't fix the problem I'd really like to see what the driver tries to settle for dma size (the messages should prefix xxx).
Also I saw that the xen iommu implementation seems to support swiotlb=<nrpages> and dma_bits=<bits (default 32)> which might be options to play around with.

Revision history for this message
Timm Essigke (essigke) wrote :
Download full text (4.3 KiB)

I deinstalled linux-image-2.6.20-xen with dpkg -r and installed your kernel with dpkg -i

ls -l /root/linux-image-2.6.24-20-xen_2.6.24-20.37_i386.deb
-rw-r--r-- 1 root root 18747646 2008-07-23 22:21 /root/linux-image-2.6.24-20-xen_2.6.24-20.37_i386.deb

I was a bit confused because the kernel is much older than the deb:
ls -l /boot/vmlinuz-2.6.24-20-xen
-rw-r--r-- 1 root root 1732665 2008-07-18 01:06 /boot/vmlinuz-2.6.24-20-xen

uname -a
Linux xenserver2 2.6.24-20-xen #1 SMP Thu Jul 17 23:00:58 UTC 2008 i686 GNU/Linux

and also the modules

ls -l /lib/modules/2.6.24-20-xen/kernel/drivers/scsi/aic7xxx/
-rw-r--r-- 1 root root 356352 2008-07-18 01:07 aic79xx.ko
-rw-r--r-- 1 root root 220892 2008-07-18 01:07 aic7xxx.ko

I also do not get the message with XXX, but

[ 132.833341] PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
[ 132.833565] ------------[ cut here ]------------
[ 132.833694] kernel BUG at /home/smb/hardy-i386/ubuntu-2.6/debian/build/custom-source-xen/drivers/scsi/aic7xxx/aic79xx_os!
[ 132.833851] invalid opcode: 0000 [#1] SMP
[ 132.834218] Modules linked in: 8021q bridge ipv6 iptable_filter ip_tables x_tables lp loop evdev 8250_pnp container parpe
[ 132.904801]
[ 132.904926] Pid: 4882, comm: kjournald Not tainted (2.6.24-20-xen #1)
[ 132.908861] EIP: 0061:[<de9df951>] EFLAGS: 00010082 CPU: 0
[ 132.929660] EIP is at ahd_linux_queue+0x661/0x670 [aic79xx]
[ 132.946306] EAX: fffffff4 EBX: dcfd7d2a ECX: dcf0a200 EDX: 00000006
[ 132.962920] ESI: dd4f4cbe EDI: dd59320a EBP: dbaa0000 ESP: dcfd7cd4
[ 132.979601] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
[ 133.000383] Process kjournald (pid: 4882, ti=dcfd6000 task=dba69210 task.ti=dcfd6000)
[ 133.016995] Stack: dd6fdb54 dd4f4c80 014d3523 00000000 00080000 c0208a3d 00000000 00000000
[ 133.062812] 00000000 00008000 dd4b94a8 dd4f4c80 dbaa0000 db2b3e48 c0130a67 de8cb6a0
[ 133.087893] db8fff40 c049ce80 fffffff4 00000001 00000000 10200000 dd4f4c80 00000000
[ 133.112828] Call Trace:
[ 133.113086] [<c0208a3d>] cfq_dispatch_requests+0x6d/0x2f0
[ 133.129685] [<c0130a67>] lock_timer_base+0x27/0x60
[ 133.146212] [<de8cb6a0>] scsi_times_out+0x0/0x80 [scsi_mod]
[ 133.158699] [<de8c7da7>] scsi_dispatch_cmd+0x147/0x280 [scsi_mod]
[ 133.175333] [<de8ce1ac>] scsi_request_fn+0x1fc/0x3e0 [scsi_mod]
[ 133.196209] [<c0130de4>] del_timer+0x64/0x80
[ 133.212851] [<c02003a5>] __generic_unplug_device+0x25/0x30
[ 133.225327] [<c0201cc8>] __make_request+0xb8/0x6f0
[ 133.242081] [<c01fe015>] generic_make_request+0x235/0x4d0
[ 133.258698] [<c018180e>] kmem_cache_alloc+0xee/0x100
[ 133.275326] [<c01fe31c>] submit_bio+0x6c/0x100
[ 133.287802] [<c0108333>] sched_clock+0x23/0x70
[ 133.304443] [<c01aadc1>] bio_alloc_bioset+0x81/0x150
[ 133.316922] [<c01a9ca0>] end_buffer_write_sync+0x0/0x70
[ 133.329401] [<c01a6df7>] submit_bh+0xd7/0x120
[ 133.346041] [<dea6ebba>] journal_do_submit_data+0x2a/0x40 [jbd]
[ 133.358549] [<dea6f943>] journal_commit_transaction+0xd53/0xda0 [jbd]
[ 133.379326] [<c0130a67>] lock_timer_base+0x27/0x60
[ 133.396089] [<c0130ae5>] try_to_del_timer_sync+0x45/0x50
[ 133.412742] [<dea72760>] kjournald+0xa0/0x200 [...

Read more...

Revision history for this message
Stefan Bader (smb) wrote :

Now that I look at it this does make sense. For some odd reason I put the wrong package for i386 there (missing the smb extension for the deb I normally use). So this is some other build and looking at the build machine it seems I have to run a rebuild. Sorry for that.

Revision history for this message
Stefan Bader (smb) wrote :

I updated the original location with a fresh rebuild. If you could try that again. http://people.ubuntu.com/~smb/bug238118/linux-image-2.6.24-20-xen_2.6.24-20.37smb3_i386.deb

Revision history for this message
Timm Essigke (essigke) wrote :

Here is the result of running the new kernel. 32bit is right for this machine. Thank you!

Revision history for this message
Stefan Bader (smb) wrote :

Sorry for the late response. Ok, the log shows the driver goes in 32bit DMA mode (which should be right). The question is why the SWIOMMU fails to provide DMA space. The bitsize of 32 sound valid so the tuneable would be swiotlb. Be aware that the unit there is a page, so using 64K there takes away 256M.

Revision history for this message
Timm Essigke (essigke) wrote :

I sampled swioltb in the range from 2k to 512k in steps of factor 2 and 4 M as a very large value. Except for 2k I get the same kernel bug in /home/smb/hardy-i386/ubuntu-2.6/debian/build/custom-source-xen/mm/bootmem.c:190
For 2k I get "low bootmem alloc... Kernel panic ... Out of low memory".
I couldn't find a value where the system boots normally to reproduce the aic7xxx kernel bug.
Could it be the reason that the system has only 512MB RAM?
If it solves the problem I can maybe convince my boss to invest in some more RAM. However I would prefer another solution...

Thanks!

Revision history for this message
Stefan Bader (smb) wrote :

It sounds a bit strange, that the sw-iommu is used in first place. IMO this is only necessary for 64bit. I am not using Xen myself so maybe the question is dumb, but anyway: the host is running exactly the same kernel as the guest? Also, have you tried the effect of swiotlb=off?

Revision history for this message
Timm Essigke (essigke) wrote :

I have the problem even without starting a guest (DomU), i.e. only the Xen hypervisor and the kernel in Dom0 are running.

With swiotlb=off I get a new kernel bug, but it suggests to use swiotlb=force. I also tried this option and it makes the system considerably more stable. I could install and deinstall apache2, but when installing texlive-full (because it is huge) the kernel crashed again.

Seems to me, we are on the right way...

Revision history for this message
Timm Essigke (essigke) wrote :

I found bug #232017 which sounds very similar to me. Setting swiotlb=128 fixed the problem in this case. I get an error similar to swiotlb=force. For values of 64 and 16 I get "BUG: soft lockup - CPU#1 stuck for 11s!", while 16 seems a bit more stable.
With swiotlb=0, 2 or 8 I am back to the original kernel bug.
I searched for the "soft lockup" bug, but couldn't find anything useful so far. acpi=off does not help. Somewhere a kernel >= 2.6.25 was suggested... - not sure if it is worth trying to compile.
Most reports are X related, but I don't have X on the machine.

I'll keep on digging!

Revision history for this message
Martin Pitt (pitti) wrote :

One gutsy task is enough. Any progress here? This is marked verification-failed and thus sounds like a regression?

Changed in linux-source-2.6.22:
status: Fix Committed → Invalid
Revision history for this message
Stefan Bader (smb) wrote :

My impression is that this is maybe is a different cause. The original report was for a standard kernel. The latest updates concern Xen. Unfortunately I did not have much time for progress but it seems the driver picks the right DMA size (32bit) but maybe the Xen SWIOMMU is having problems. I think I saw 24bit somewhere, but I am not sure.

Revision history for this message
Stefan Bader (smb) wrote :

The failed verification IMHO is caused by bug #247148.

Revision history for this message
Launchpad Janitor (janitor) wrote : Kernel team bugs

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

Revision history for this message
Martin Pitt (pitti) wrote :

Thanks, Stefan. Timm, can you please test the current hardy-proposed kernel and report back in bug 247148, which is most probably the one that affects you?

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

As Martin pointed out, it seems that Timm's issue is different than the one originally reported here. The patch which should resolve the original bug reported here has already been released as an updated. As a result, I'm marking the Hardy and Gutsy tasks as Fix Released. Again Timm, if you can follow up in bug 247148 for the issue you are seeing that would be great. Thanks.

Changed in linux:
status: Confirmed → Fix Released
Changed in linux-source-2.6.22:
status: Fix Committed → Fix Released
Changed in linux:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.