BUG: scheduling while atomic: archhttp64/7146/0x1000000001

Bug #235889 reported by Jean-Louis Dupond on 2008-05-29
12
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Colin Ian King
Hardy
Medium
Colin Ian King

Bug Description

When I start my Areca card's webconfig tool, my system locks up with:

BUG: scheduling while atomic: archhttp64/7146/0x1000000001
BUG: soft lockup - CPU#0 stuck for 11s! [archhttp64:7146]
BUG: soft lockup - CPU#0 stuck for 11s! [archhttp64:7146]
BUG: soft lockup - CPU#0 stuck for 11s! [archhttp64:7146]
And that soft lockup keeps going ...

Its just weird I can sometimes run the webconfig tool without having it crashed (usually BEFORE gnome/x-windows loads)
When I was able to load it, I can shutdown the tool, and restart it (even when gnome is running) it works ....

Ubuntu 8.04 Hardy (All updates installed)
2.6.24-16-generic #1 SMP Thu Apr 10 12:47:45 UTC 2008 x86_64 GNU/Linux
Areca 1220 PCIe card

Jean-Louis Dupond (dupondje) wrote :

2.6.24-17-generic crashes also ...

2.6.26-rc4 (from kernel.org) works perfect

Hi Dupond,

As I noted in our IRC converstaion I believe the Ubuntu kernel team recently rebased the upcoming Intrepid Ibex 8.10 kernel with the upstream 2.6.26-rc4 kernel from kernel.org. The Intrepid kernel is currently available for testing in the kernel-ppa (kernel personal package archive) but it looks like the latest version it is still having some build issues so is not yet available. I'll try to update this report when it is ready for testing as well as providing instructions on how to install this newer kernel from the kernel PPA. It would be great if you could test when it's available just to verify this is resolved in the upcoming release. Then we can focus on trying to backport a fix for Hardy.

In the mean time you also mentioned that this is no longer and issue with the upstream 2.6.26-rc4 kernel. It would be great to narrow down the exact patch which would need backporting. You mentioned you tried applying a few patches but were unsuccessful. Another approach you may want to try is a git bisect. That should help narrow down the exact fix. If you are not familiar with doing a git bisect, take a look at the following document:

http://www.kernel.org/doc/local/git-quick.html#bisect

The only tricky part is you will have to switch the meanings of "good" and "bad" when performing the bisect as it tries to isolate the patch which caused a regression rather than the patch which presented the fix. Also in case you didn't pull the upstream kernel source from the git tree, you can get it by doing the following:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

Let us know if you are able to narrow down the patch. Also feel free to ping me again on IRC if I forget to update this report with the instructions of how to install from the kernel-ppa. Thanks.

Changed in linux:
status: New → Confirmed

Also just adding a reference to the kernel-ppa to this report. It's still failing to build. Thanks for your patience.

https://edge.launchpad.net/~kernel-ppa/+archive

Jean-Louis Dupond (dupondje) wrote :

Not fixxed in 2.6.24-19 neither :(

Jean-Louis Dupond (dupondje) wrote :

Tried compiling http://packages.ubuntu.com/hardy/linux-image-2.6.24-16-generic

Without Ubuntu Patch: Works perfect
WITH Ubuntu Patch: Crash ...

Seems like there is something wrong in the patch.

Jean-Louis Dupond (dupondje) wrote :

http://archive.ubuntu.com/ubuntu/pool/main/l/linux/linux_2.6.24-16.30.diff.gz

This Is the Ubuntu Patch I meant the previous post ...

Thanks for the info. However, the patch you reference is actually a rather large patch set containing fixes for multiple bugs. Would you be able to narrow down that patch set even further to the exact changes that introduces the issue you are seeing? I really do apologize for asking for you to do all this extra work, but it just isn't realistic that the kernel team would revert the entire set of changes noted in the patch set you've referenced. But if you can narrow down the specific change to a file(s) it will greatly help them evaluate the change that was made and any possible solutions. Thanks again, we really do appreciate it.

Note that this scenario where you know 2.6.24 works and 2.6.24-16.30 shows a regression is where git bisection really comes in handy. It should only take a few quick builds to narrow down the offending patch. You can grab the Ubuntu Hardy git tree by doing the following:

git://kernel.ubuntu.com/ubuntu/ubuntu-hardy.git

Hope that helps a little. Thanks.

The above command should be:

git clone git://kernel.ubuntu.com/ubuntu/ubuntu-hardy.git

Jean-Louis Dupond (dupondje) wrote :

Damn, Bisect is the biggest crap ever made it seems, this really doesn't work...

Jean-Louis Dupond (dupondje) wrote :

OK, Found out what diff makes it crash ...

--- linux-2.6.24.orig/drivers/scsi/arcmsr/arcmsr_hba.c
+++ linux-2.6.24/drivers/scsi/arcmsr/arcmsr_hba.c
@@ -1381,12 +1381,13 @@

        case ARCMSR_MESSAGE_READ_RQBUFFER: {
                unsigned long *ver_addr;
- dma_addr_t buf_handle;
                uint8_t *pQbuffer, *ptmpQbuffer;
                int32_t allxfer_len = 0;
+ void *tmp;

- ver_addr = pci_alloc_consistent(acb->pdev, 1032, &buf_handle);
- if (!ver_addr) {
+ tmp = kmalloc(1032, GFP_KERNEL|GFP_DMA);
+ ver_addr = (unsigned long *)tmp;
+ if (!tmp) {
                        retvalue = ARCMSR_MESSAGE_FAIL;
                        goto message_out;
                }
@@ -1422,18 +1423,19 @@
                memcpy(pcmdmessagefld->messagedatabuffer, (uint8_t *)ver_addr, allxfer_len);
                pcmdmessagefld->cmdmessage.Length = allxfer_len;
                pcmdmessagefld->cmdmessage.ReturnCode = ARCMSR_MESSAGE_RETURNCODE_OK;
- pci_free_consistent(acb->pdev, 1032, ver_addr, buf_handle);
+ kfree(tmp);
                }
                break;

        case ARCMSR_MESSAGE_WRITE_WQBUFFER: {
                unsigned long *ver_addr;
- dma_addr_t buf_handle;
                int32_t my_empty_len, user_len, wqbuf_firstindex, wqbuf_lastindex;
                uint8_t *pQbuffer, *ptmpuserbuffer;
+ void *tmp;

- ver_addr = pci_alloc_consistent(acb->pdev, 1032, &buf_handle);
- if (!ver_addr) {
+ tmp = kmalloc(1032, GFP_KERNEL|GFP_DMA);
+ ver_addr = (unsigned long *)tmp;
+ if (!tmp) {
                        retvalue = ARCMSR_MESSAGE_FAIL;
                        goto message_out;
                }
@@ -1483,7 +1485,7 @@
                                retvalue = ARCMSR_MESSAGE_FAIL;
                        }
                        }
- pci_free_consistent(acb->pdev, 1032, ver_addr, buf_handle);
+ kfree(tmp);
                }
                break;

Jean-Louis Dupond (dupondje) wrote :

OK Patch To Fix:

--- old_kernel/drivers/scsi/arcmsr/arcmsr_hba.c 2008-06-07 23:45:53.000000000 +0200
+++ arcmsr_hba.c 2008-06-07 23:36:49.000000000 +0200
@@ -1380,18 +1380,16 @@
        switch(controlcode) {

        case ARCMSR_MESSAGE_READ_RQBUFFER: {
- unsigned long *ver_addr;
+ unsigned char *ver_addr;
                uint8_t *pQbuffer, *ptmpQbuffer;
                int32_t allxfer_len = 0;
- void *tmp;

- tmp = kmalloc(1032, GFP_KERNEL|GFP_DMA);
- ver_addr = (unsigned long *)tmp;
- if (!tmp) {
+ ver_addr = kmalloc(1032, GFP_ATOMIC);
+ if (!ver_addr) {
                        retvalue = ARCMSR_MESSAGE_FAIL;
                        goto message_out;
                }
- ptmpQbuffer = (uint8_t *) ver_addr;
+ ptmpQbuffer = ver_addr;
                while ((acb->rqbuf_firstindex != acb->rqbuf_lastindex)
                        && (allxfer_len < 1031)) {
                        pQbuffer = &acb->rqbuffer[acb->rqbuf_firstindex];
@@ -1420,26 +1418,24 @@
                        }
                        arcmsr_iop_message_read(acb);
                }
- memcpy(pcmdmessagefld->messagedatabuffer, (uint8_t *)ver_addr, allxfer_len);
+ memcpy(pcmdmessagefld->messagedatabuffer, ver_addr, allxfer_len);
                pcmdmessagefld->cmdmessage.Length = allxfer_len;
                pcmdmessagefld->cmdmessage.ReturnCode = ARCMSR_MESSAGE_RETURNCODE_OK;
- kfree(tmp);
+ kfree(ver_addr);
                }
                break;

        case ARCMSR_MESSAGE_WRITE_WQBUFFER: {
- unsigned long *ver_addr;
+ unsigned char *ver_addr;
                int32_t my_empty_len, user_len, wqbuf_firstindex, wqbuf_lastindex;
                uint8_t *pQbuffer, *ptmpuserbuffer;
- void *tmp;

- tmp = kmalloc(1032, GFP_KERNEL|GFP_DMA);
- ver_addr = (unsigned long *)tmp;
- if (!tmp) {
+ ver_addr = kmalloc(1032, GFP_ATOMIC);
+ if (!ver_addr) {
                        retvalue = ARCMSR_MESSAGE_FAIL;
                        goto message_out;
                }
- ptmpuserbuffer = (uint8_t *)ver_addr;
+ ptmpuserbuffer = ver_addr;
                user_len = pcmdmessagefld->cmdmessage.Length;
                memcpy(ptmpuserbuffer, pcmdmessagefld->messagedatabuffer, user_len);
                wqbuf_lastindex = acb->wqbuf_lastindex;
@@ -1485,7 +1481,7 @@
                                retvalue = ARCMSR_MESSAGE_FAIL;
                        }
                        }
- kfree(tmp);
+ kfree(ver_addr);
                }
                break;

Changed in linux:
assignee: nobody → info-dupondje
status: Confirmed → Fix Committed

Thanks for all the testing. Am reassigning to the kernel team to consider for a Hardy SRU.

Changed in linux:
assignee: info-dupondje → ubuntu-kernel-team
status: Fix Committed → Triaged
Changed in linux:
assignee: ubuntu-kernel-team → colin-king
importance: Undecided → Medium
status: Triaged → In Progress
Colin Ian King (colin-king) wrote :

Hi Dupond Jean-Louis,

I've put up some kernel packages (linux - 2.6.24-19.34cking6 with a linux-image that contains the fix) in my PPA at https://launchpad.net/~colin-king/+archive

If you are not familiar with how to install packages from a PPA basically do the following:

Create the file /etc/apt/sources.list.d/kernel-ppa.list to include the following two lines:

deb-src http://ppa.launchpad.net/colin-king/ubuntu hardy main
deb http://ppa.launchpad.net/colin-king/ubuntu hardy main

Then run the command:

sudo apt-get update

You should then be able to install the linux-image kernel package.

Please try this kernel and report any success/regressions. If this fixes the problem I will add the fix to Hardy 8.04.1. Thanks.

Jean-Louis Dupond (dupondje) wrote :

I can only try the patch this weekend ... But I tried the patch I posted here, and that worked great ... So dunno :) I'll check it this weekend if needed

Thx for the time :)

Colin Ian King (colin-king) wrote :

Hi,

If you could check it this weekend then I will have a verified Hardy version that I can put into Hardy 8.04.1.

Many thanks!

Colin

Jean-Louis Dupond (dupondje) wrote :

Tested, And Its Crashing My System :(

Jean-Louis Dupond (dupondje) wrote :

Hello,

I just leeched the 2.6.24-19.34cking6.tar.gz file, looked @ the source of arcmsr_hba.c, and the patch doesn't seem to be included ...

Colin Ian King (colin-king) wrote :

Dupond Jean-Louis,

Apologies for this, my mistake. I will be re-loading the correctly patched version in the 24 hours. Sorry to waste your time by my mistake.

Colin

Colin Ian King (colin-king) wrote :

linux - 2.6.24-19.34cking9 now available for re-testing. Thanks!

Jean-Louis Dupond (dupondje) wrote :

Tested, And working ! THX !!

Colin Ian King (colin-king) wrote :

SRU justification:

Impact: When starting up the Areca 1220 PCIe card webconfig tool the
system locks up with:

BUG: scheduling while atomic: archhttp64/7146/0x1000000001
BUG: soft lockup - CPU#0 stuck for 11s! [archhttp64:7146]
BUG: soft lockup - CPU#0 stuck for 11s! [archhttp64:7146]

Fix: Softlockup is caused by arcmsr_iop_message_xfer() being
called from atomic context under the queuecommand scsi_host_template
handler. The current GFP_KERNEL|GFP_DMA flags are wrong: firstly we are
in atomic context, secondly this memory is not used for DMA. The patch
attached corrects these issues.

Patch from upstream commit:
69e562c234440fb7410877b5b24f4b29ef8521d1

Testcase: Starting up Areca card with webconfig tool, system will lockup
without the patch.

Tested and verified OK by user from PPA kernel:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/235889/comments/22

Changed in linux:
status: In Progress → Fix Committed
Steve Langasek (vorlon) on 2008-07-17
Changed in linux:
assignee: nobody → colin-king
importance: Undecided → Medium
status: New → In Progress
Steve Langasek (vorlon) wrote :

Accepted into -proposed, please test and give feedback here. Please see https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in linux:
status: In Progress → Fix Committed
Jean-Louis Dupond (dupondje) wrote :

Tested, And its Working :) THX

Martin Pitt (pitti) wrote :

linux 2.6.24-21 copied to hardy-updates.

Changed in linux:
status: Fix Committed → Fix Released
Adam Schock (adamschock) wrote :

I'm not sure whether to open a new bug report or continue this one, since my problem seems similar and I'm relatively new to participating in the community, rather than as an end user.

On startup I get a kernel message: scheduling while atomic (see attached dmesg.log). Previously the same process ?driver caused a hard system lockup on boot with a message: cpu #7 stuck for 11 seconds: archttp64 (I, unforgivably didn't manage to write down the entire message.) This event and a later system freeze left no evidence to work with. There were no other changes to the system to which I can ascribe the problem.

The bug message is intermittent and the hang and freeze only happened once.

I am using Ubuntu 8.0.4.1 with kernel 2.6.24-19 with nvidia driver manually installed.
I have an Areca 1210 4 port raid card.

The web configuration tool (manufacturer calls it proxy server) was apparently compiled for Ubuntu by my system vendor, since the documentation and applications only refer to RedHat and Suse distributions and derivatives by name. In my ignorance, I don't know if any necessary changes to the kernel will cause problems with this program installed in /usr/local/sbin.

Adam Schock (adamschock) wrote :
Adam Schock (adamschock) wrote :
Adam Schock (adamschock) wrote :

Marking this from Fix Committed to Fix Released for the actively developed kernel as it already contains the referenced patch.

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers