Comment 0 for bug 65631

Revision history for this message
Alexander Schulze (schulze) wrote :

After upgrading to kernel 2.6.15-27.48, which includes version 1.5 of the skge driver, we experienced network related lockups on some of our machines, beginning with some "badness" warnings, followed by "scheduling while atomic" errors and finally a system crash (no response to pings, locale console dead). We finally found that the skge gigabit ethernet driver was the cause. A comparison between the 1.5 version included in 2.6.15-27.48 in dapper/security and the 1.5 version in vanilla 2.6.17 (where it is supposed to be taken from) showed that the dapper version contains additional calls to spin_unlock for the hw_lock of the skge device (and *no* spin_lock calls for hw_lock at all!), whereas the 2.6.17 vanilla version seems to have eliminated hw_lock completely. I therefore think that something went wrong when "transplanting" version 1.5 to 2.6.15-27.48, and the removal of hw_lock was not done in all places.

To verify this analysis, we are currently running a skge.ko module compiled from a modified source where we eliminated hw_lock and all calls to spin_* corresponding to this lock (basically the 2.6.17 version of the driver, but with the pci_device_id patches from dapper). We have not yet seen lock-ups from this modified driver.

Why does the lockup occur only on a subset of our machines? A quick glance at the 1.5 code in dapper shows another locking-related coding error: The spin_unlock is only called in the second branch of the interrupt service routine skge_intr that seems to handle transmission errors, while the main branch, handling data I/O, seems to be correct. Therefore, the error becomes visible only when the bad branch in the ISR is executed, which seems to depend on the cabling to the machine and the network load (and fortunately our server machines have better cabling and were therefore unaffected by this bug!).

So, when verifying this bug report, don't be surprised if you can't reproduce it in many configurations, but just have a look at the source and compare it to the version in 2.6.17. The difference (and the fact that the locking in 2.6.15-27.48 is broken) is obvious. *So* obvious in fact that I really wonder how this defective driver made its way into dapper security... (not asking the question whether it is really necessary to deliver not security-related (and obviously not thoroughly tested) driver updates in a security update to a LTS version targeted at server use at all!)