Ubuntu

skge driver broken: invalid call to spin_unlock causes system crash

Reported by Alexander Schulze on 2006-10-12
260
Affects Status Importance Assigned to Milestone
linux-source-2.6.15 (Ubuntu)
Medium
Kees Cook
Declined for Dapper by Kees Cook

Bug Description

After upgrading to kernel 2.6.15-27.48, which includes version 1.5 of the skge driver, we experienced network related lockups on some of our machines, beginning with some "badness" warnings, followed by "scheduling while atomic" errors and finally a system crash (no response to pings, local console dead). We finally found that the skge gigabit ethernet driver was the cause. A comparison between the 1.5 version included in 2.6.15-27.48 in dapper/security and the 1.5 version in vanilla 2.6.17 (where it is supposed to be taken from) showed that the dapper version contains additional calls to spin_unlock for the hw_lock of the skge device (and *no* spin_lock calls for hw_lock at all!), whereas the 2.6.17 vanilla version seems to have eliminated hw_lock completely. I therefore think that something went wrong when "transplanting" version 1.5 to 2.6.15-27.48, and the removal of hw_lock was not done in all places.

To verify this analysis, we are currently running a skge.ko module compiled from a modified source where we eliminated hw_lock and all calls to spin_* corresponding to this lock (basically the 2.6.17 version of the driver, but with the pci_device_id patches from dapper). We have not yet seen lock-ups from this modified driver.

Why does the lockup occur only on a subset of our machines? A quick glance at the 1.5 code in dapper shows another locking-related coding error: The spin_unlock is only called in the second branch of the interrupt service routine skge_intr that seems to handle transmission errors, while the main branch, handling data I/O, seems to be correct. Therefore, the error becomes visible only when the bad branch in the ISR is executed, which seems to depend on the cabling to the machine and the network load (and fortunately our server machines have better cabling and were therefore unaffected by this bug!).

So, when verifying this bug report, don't be surprised if you can't reproduce it in many configurations, but just have a look at the source and compare it to the version in 2.6.17. The difference (and the fact that the locking in 2.6.15-27.48 is broken) is obvious. *So* obvious in fact that I really wonder how this defective driver made its way into dapper security... (not asking the question whether it is really necessary to deliver not security-related (and obviously not thoroughly tested) driver updates in a security update to a LTS version targeted at server use at all!)

CVE References

Alexander Schulze (schulze) wrote :

Included hw spin-lock patch for skge.c and skge.h (based on 2.6.15-27.48 sources). Fix seems to correct the problems in our setup.

Alexander Schulze (schulze) wrote :

I just verified that this defect in the skge driver is also present in the current git tree (URL given below) for edgy. It would be fine if it could be fixed in time, as releasing edgy with this bug would prevent installation of edgy on affected machines (without special workarounds), as accessing the network would lead to instability of the kernel due to wrong preempt_count value.

http://git.kernel.org/git/?p=linux/kernel/git/bcollins/ubuntu-edgy.git;a=blob;h=a724fea2ad52a6cb6cdbdce13f22efca88d0e7d8;hb=f23a2bfe4a30cf528c1874957f51ab86e5bb7e27;f=drivers/net/skge.c

description: updated
Alexander Schulze (schulze) wrote :

As there seems to be no activity on this bug yet, I did a binary search in the bcollins/ubuntu-edgy git tree to find the problematic commit that introduced the problem. (I use the http interface on git.kernel.org to do this.)

The situation is quite confusing. There are two files named skge.c in the kernel, one belonging to the skge driver, the other to the sk98lin driver, and it seems that these two files were confused when doing feature backports. This occurs in commit df5831f36218ca9276866ad26524492810cf3d53 from 2006-05-09 and is corrected by commit 37419595bb1bc44c489daa1d101a0e633bf3b934 from 2006-05-11. Before these commits, there is no hw_lock present in the skge driver; the correction commit reintroduces it, but obviously consistently, reverting skge.c from version 1.5 to 1.3 again. The following commit, 2d2a387199bf38c6628adb9c6184d7ab6e306148 from 2006-05-20, eliminates hw_lock again, and the driver version is now 1.5 (and does not contain the locking error).

Then, on 2006-08-10 in commit c168e0350ac2d286724ea985227360e9605eb9c3, the error seems to be introduced. Here it gets interesting, as the diff for skge.c in this commit does not match the state of the skge.c file. It (the diff) indicates that the line "spin_unlock(&hw->hw_lock);" causing the error is present in the file both before and after the patch (black color in the http interface, no + or - prefix in the diff), while this line is clearly *not* in the skge.c file before this commit. (However, it was present in the 1.3 branch that we temporarily reverted back to as described above.) Git (or, more precisely, patch) seems to get confused and introduces this line into the file, causing the error we are seeing here.

For me, the problem seems to be caused by a commit based on a previous skge.c version (1.3?) instead of the version that was in the git tree at that time (1.5), causing patch to malfunction and introduce the spin_unlock line into the code. As my previous reports said, a comparison with skge 1.5 in vanilla 2.6.17 shows no other significant differences, and an application of the patch I submitted above should be sufficient to correct the situation.

Lauri Nurmi (lauri-ksenos) wrote :

I can confirm this bug still exists in Dapper kernel 2.6.15-28.

We have three almost identical computers at the office, each having an Asus P4P800-SE motherboard and running Dapper. All three computers are connected to a gigabit switch. All three have had stability issues with networking especially after the switch was upgraded from 100Mbps to 1000Mbps. The computers tend to crash whenever something network-intensive is being done, such as downloading an .iso image from another computer on the LAN. On at least one computer, running a distcc build causes a crash within ten seconds _every time_.

The patch above seems to fix all the stability issues.

Changed in linux-source-2.6.15:
status: Unconfirmed → Confirmed
Camille Dominique (cduntu) wrote :

I can confirm that this bug exists in Dapper since version 2.6.15-26.47. The changelog states that skge was then updated to version in 2.6.17, to obtain fixes and D-Link DGE-530T support.

I was able to reproduce the bug on two identical computers (ASUS A8V Deluxe Mainboard, with a Marvell 88E8001 Gigabit Ethernet Controller onboard), attached to a Gigabit-Network.

I can also confirm that Alexander's Patch above fixes the issue with no ill effects so far.

Lauri Nurmi (lauri-ksenos) wrote :

This bug still seems to exist in Dapper kernel 2.6.15-29. It's been almost a year since the bug was reported and the 3-line fix attached.

Changed in linux-source-2.6.15:
assignee: nobody → ubuntu-kernel-team
TJ (tj) wrote :

Can anyone confirm the fault still happens when using Gutsy (2.6.22) ? I've attached a patch for Gutsy to testing.

Although the drivers/net/skge.c and drivers/net/skge.h source has changed somewhat they still use

drivers/net/skge.h::struct skge_hw {
 void __iomem *regs;
 struct pci_dev *pdev;
 spinlock_t hw_lock;

....

drivers/net/skge.c::static int __devinit skge_probe(struct pci_dev *pdev,
    const struct pci_device_id *ent)
{
 struct net_device *dev, *dev1;
 struct skge_hw *hw;

...

spin_lock_init(&hw->hw_lock);

Camille Dominique (cduntu) wrote :

I have been using Gutsy for about two weeks now on one of the machines that were affected by this bug under Dapper (This is a fresh, parallel installation, no dist-upgrade). I haven't experienced any problems so far, so it seems that this bug does not exist in Gutsy. My kernel version is 2.6.22-14-generic.

Kees Cook (kees) wrote :

Hm, I'm not sure how, but the ubuntu-security team wasn't subscribed to this bug (yet it was flagged as security). The fix for this should be available in the next kernel update for Dapper. Thanks for tracking down the cause and building a patch. :)

Kees Cook (kees) on 2007-11-21
Changed in linux-source-2.6.15:
assignee: ubuntu-kernel-team → keescook
importance: Undecided → Medium
status: Confirmed → Fix Committed
Lauri Nurmi (lauri-ksenos) wrote :

The latest Dapper kernel, 2.6.15-51-686, will still easily crash due to skge. Please see attached log.

Kees Cook (kees) wrote :

This should now be fixed in USN-578-1 (Dapper kernel 2.6.15-51.66).

Changed in linux-source-2.6.15:
status: Fix Committed → Fix Released

eh... i'm not sure if this is the same bug or related but i am getting an error with the skge.c file that locks up my computer. the strange thing is that it doesn't always happen, it seems only to lock-up on random.
I've attached the part of the dmesg log that has the problem

It would be best to open a new bug. The issues in this bug have been
addressed, so new problems shouldn't be confused with it. (Especially
since it is considered to be "closed".)

Thank you, new bug is 252856.

To post a comment you must log in.
This report contains Public Security information  Edit
Everyone can see this security related information.

Other bug subscribers

Bug attachments