After upgrade to new kernel version, machine crashes after a few hours of uptime

Bug #1052861 reported by JanCeuleers
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
High
Unassigned

Bug Description

I upgraded the kernel on this machine to linux-image-2.6.32-43-generic yesterday, and the machine has locked up solid three times since. This is a broadband router (Soekris net5501) without video or keyboard. I am monitoring the serial console and no messages are output when the crash occurs.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-43-generic 2.6.32-43.97
Regression: Yes
Reproducible: Yes
ProcVersionSignature: Ubuntu 2.6.32-42.96-generic 2.6.32.59+drm33.24
Uname: Linux 2.6.32-42-generic i586
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access /dev/snd/: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: i386
ArecordDevices: Error: [Errno 2] No such file or directory
Date: Wed Sep 19 12:53:20 2012
Lsusb:
 Bus 002 Device 002: ID 051d:0002 American Power Conversion Uninterruptible Power Supply
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
PciMultimedia:

ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-42-generic root=UUID=198bbce8-57d4-4b24-8732-c6bd1f56dc81 ro console=ttyS0,9600n8 ipv6.disable=1 quiet splash
ProcEnviron:
 SHELL=/bin/bash
 LANG=en_US.UTF-8
SourcePackage: linux

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Luis Henriques (henrix) wrote :

Hi,

can you confirm that the previous Lucid kernel version (2.6.32-42.96) was working for you? If this is true, and you're available to assist in a kernel bisect, I'll build a few test kernels (should be 2 or 3 at most) for you to test so that we can identify the commit that introduced this regression.

Thanks.

Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

I am the original bug reporter.

I can confirm that reverting to the previous kernel package (linux-image-2.6.32-42-generic, version 2.6.32-42.96) returns the machine to stability.

This is my broadband router, and my wife and I work from home a lot. I can test kernels for you, but only overnight because I need the box to be stable during the day. So a bisection run, even if you provide precompiled packages, is going to take several days to test (1 day per package). Happy to do so though.

Revision history for this message
Luis Henriques (henrix) wrote :

Great, thanks a lot for your help. So, I've uploaded the 1st kernel here:

http://people.canonical.com/~henrix/lp1052861/v1/i386/

This kernel contains the pae kernel up to commit 7db137a69f19021fe7c7614cac3883cc16992b0c.

Please let me know if this is a good (stable) or bad (unstable) kernel.

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

Looking at the security notice that describes this kernel update I see that one of the changes is to do with TCP segmentation offload. Some additional information therefore.

This machine has four Ethernet interfaces:

root@skr03:~# lspci | grep Ethernet
00:06.0 Ethernet controller: VIA Technologies, Inc. VT6105M [Rhine-III] (rev 96)
00:07.0 Ethernet controller: VIA Technologies, Inc. VT6105M [Rhine-III] (rev 96)
00:08.0 Ethernet controller: VIA Technologies, Inc. VT6105M [Rhine-III] (rev 96)
00:09.0 Ethernet controller: VIA Technologies, Inc. VT6105M [Rhine-III] (rev 96)

It uses the via-rhine driver:

root@skr03:~# ethtool -i eth0
driver: via-rhine
version: 1.4.3
firmware-version:
bus-info: 0000:00:06.0

TCP segmentation offload is not being used:

root@skr03:~# (ethtool -k eth0 ; ethtool -k eth1 ; ethtool -k eth2 ; ethtool -k eth3) | grep tcp-segmentation-offload
tcp-segmentation-offload: off
tcp-segmentation-offload: off
tcp-segmentation-offload: off
tcp-segmentation-offload: off

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

The v1 kernel you gave me to test is BAD

Revision history for this message
Luis Henriques (henrix) wrote :

Thanks for the input. I've built the next kernel in this bisect session and uploaded it here:

http://people.canonical.com/~henrix/lp1052861/v2/i386/

Same request: just let me know whether this is a good or a bad kernel.

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

The v2 kernel is also BAD

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

By the way: I may be able to test a few more kernels than one per day over the weekend. If this is a practical proposition, could you prepare three or four packages in one go? Please only do so if it's not too much trouble.

Revision history for this message
Luis Henriques (henrix) wrote :

Great, I've just uploaded another kernel, which should tell us finally which commit has caused the regression. You can download it in the usual place:

http://people.canonical.com/~henrix/lp1052861/v3/i386/

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

v3 has stayed up overnight, which leads me to suspect that it is GOOD.

I can't be categorical about this you understand: if it freezes it is certain to be bad, but if it doesn't it might merely not yet have crashed. So I'll leave it running and report back tomorrow.

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

Right. I'm glad I qualified my previous statement. It crashed after about 20 hours, which is much longer than than it took v1 and v2 to crash before. Could be coincidence, who knows. Going back to the -42.96 package now.

Revision history for this message
Luis Henriques (henrix) wrote :

Thank you for taking your time testing all these kernels. Definitely something went wrong in this kernel bisection. I may have messed up with the bisect operation, because the tests results ended up pointing to commit:

23f18f3 eCryptfs: Initialize empty lower files when opening them

And I'm pretty sure you're not using ecryptfs in your router (I may be wrong though. Can you confirm this?)

Anyway, after some internal discussions, we have decided to revert some of the commits from the last released kernel. And it is very likely that one of these commits is the real responsible for this bug. I've uploaded another kernel to in the usual place:

http://people.canonical.com/~henrix/lp1052861/v4/

This new kernel reverts the following commits:

b52e527 sfc: Fix maximum number of TSO segments and minimum TX queue size
a6638ab sfc: Replace some literal constants with EFX_PAGE_SIZE/EFX_BUF_SIZE
7db137a tcp: Apply device TSO segment limit earlier
bdf7397 tcp: do not scale TSO segment size with reordering degree
d7cdf67 net: Allow driver to limit number of GSO segments per skb

Could you please try this one and see if you still see the problem? It would be great if you could post a kernel log for this kernel? Just gathering the output of dmesg after booting would be enought. Thank you!

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

I am indeed not using ecryptfs.

I am using GSO (generic segmentation offload) on one of the ethernet interfaces (the one that faces the LAN). Whereas the box has four ethernet interfaces I'm using only two. One faces the LAN (eth3), the other faces the VDSL modem (eth2), across which I set up a PPPoE link. GSO is in use on eth3 but not eth2.

Please find the requested dmesg attached.

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

v4 seems to behave similarly to v3: it also crashed after many hours, i.e. after a longer period than the released -43 package and the v1 and v2 packages did. But it still crashed.

Once again I'm back on -42. I think I now need to stay on this package for a couple of days just to make sure that I'm not sending you on a wild goose chase (i.e. some coincidence perhaps due to flaky harware).

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

I can confirm that the box is stable when running -42.

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

By the way, I just noticed that I should have mentioned the presence in the box of a wireless adapter as well. Since the box is a broadband router it also acts as a wireless access point, using an ath9k card. No offload options (TSO, GSO) are active on this interface.

Revision history for this message
Luis Henriques (henrix) wrote :

A new Lucid kernel is available in the -proposed pocket and will soon be released. Could you please try it to check whether it fixes your problem? See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

Assuming I've managed to install the correct package from -proposed:

Linux skr03 2.6.32-44-generic #98-Ubuntu SMP Mon Sep 24 17:32:45 UTC 2012 i586 GNU/Linux

the new kernel is BAD (it crashed after an hour and 10mins)

Revision history for this message
Luis Henriques (henrix) wrote :

Ok, I've been re-reading all the comments in the bug report and realised I may have made a wrong assumption:

Comment #11 suggests the tested kernel was a GOOD one, while comment #12 states the opposite. My comment #13 seems to be made after reading comment #11 -- so, I may have missed comment #12! If I would have taken #12 into account, the bisect operation would have pointed to commit e1a07a02513462dd865e06c9dcc323eee226fba0 instead.

I will build a test kernel reverting this commit, and I would like to ask you if you could give it a try. I'll post the link to the kernel once its built.

Revision history for this message
Luis Henriques (henrix) wrote :

Ok, kernel built and uploaded here:

http://people.canonical.com/~henrix/lp1052861/v5/

If you have a chance, please give it a try and let me know if it made any difference. As I referred in previous comment, this is just the plain Lucid kernel with commit e1a07a02513462dd865e06c9dcc323eee226fba0 reverted.

Thanks.

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

I'm really sorry, but this kernel (v5) is BAD.

But I noticed that for some reason a different set of modules is being loaded. I have no idea why. I attach a file that shows the differences between the first column of lsmod's sorted output under each kernel (-42 and -44 v5). Why are a lot more (filesystem-related) modules being loaded? Furthermore, why are two ath9k-related modules not being loaded? Hmmm.

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

I can now explain one class of differences between the modules loaded under -42 versus -44. I am using the compat-wireless package, a version of which does not yet exist for the -44 kernel (or at least: I hadn't installed it). The two ath9k-related modules which are not being loaded under -44 but are being loaded under -42 come from the compat-wireless package.

This still leaves me stumped as to why all of these filesystem-related modules are being loaded. They consume rather a lot of memory so I guess that (on this small embedded system) they might lead to certain kinds of memory running out. I would be grateful for any hints on finding out why these modules are being loaded. The only filesystem I'm using on this box is ext2 (on a CompactFlash card, if that makes any difference). Other than proc, sysfs, debugfs, securityfs, devtmpfs, devpts and tmpfs, that is.

I guess the above story also holds true for the -43-based kernels you have asked me to test. I sincerely hope that this does not completely invalidate the tests I've been doing (and the kernel packages you've been building etc).

Please advise.

Revision history for this message
Luis Henriques (henrix) wrote :

Hi Jan, the only reason I can think of to have those FS-related modules load is that you installed a kernel, i.e., running dpkg -i <kernel-image>.deb will eventually cause the load of all these modules. If you reboot again, the modules shouldn't be automatically loaded again.

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

I ran the lsmod commands just after a fresh reboot, so this doesn't seem to be it.

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

I guess I need to wait for a compat-wireless package to become available that corresponds with the -44 ABI. Although I can start by installing the -43 compat-wireless package and test that with the v1-v3 kernels you previously built for me (just to see whether the results are any different from the ones I reported earlier). I will make a start with that tonight.

Revision history for this message
Luis Henriques (henrix) wrote :

After looking into /var/log/apt/ logs (provided privately), it was verified that the initial installation of kernel linux-image-2.6.32-43-generic had failed, and this may be causing troubles. I suggested Jan to purge this kernel (and any other kernel installed after this failure) and try again an upgrade.

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

I have now purged all -43 and -44 kernels (i.e. the released -43 kernel and all test kernels v1 through v5), then reinstalled the released -43 kernel, and also pulled in the compat-wireless package for -43. After rebooting the machine has stayed up overnight.

Luis and I both had a theory about the cause:

- Luis found an installation issue in the apt logs which might have caused the initial installation of the -43 kernel to be bodged. I would however have expected that situation to have been corrected by my subsequent dpkg -i-ing of the test kernel packages.

- I noticed that I am using certain modules from the compat-wireless package, and that a compat-wireless package corresponding with the -43 kernel package had not yet been available when I first upgraded to -43, nor while I was testing the v1 through v3 test kernels which were also based on the -43 ABI.

I could try and remove the -43 compat-wireless package and test again; would you like me to do that? (Meanwhile I'll keep the current situation until the machine has been up for at least 48 hours in order to confirm stability).

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

The released 2.6.32-43-generic (with the corresponding compat-wireless package) has now been running for nearly 39 hours, so I'm going to go ahead and remove the -43 compat-wireless package and reboot.

Revision history for this message
JanCeuleers (jan-ceuleers) wrote :

The released 2.6.32-43-generic kernel crashed after about 5 hours uptime while the corresponding compat-wireless package was not also installed. So I think that this validates the hypothesis that the determining factor is the presence of the compat-wireless package (i.e. whether or not newer ath9k code is in use), as opposed to a potentially only partially installed kernel package (I made sure that there were no error messages when I last installed the -43 kernel package).

If anyone's still interested in this issue I guess the next step would be to go back to the -42 kernel package without compat-wireless and so on, as far back as it takes to find the release in which the regression was introduced. But let me say that I've been running this box for some time and have always installed kernel upgrades when they became available (i.e. usually before the corresponding compat-wireless package became available), and I've not seen crashes like this before.

Luis, since I seem to be the only person affected by this problem I would not object to closing this bug.

Revision history for this message
penalvch (penalvch) wrote :

Jan Ceuleers, thank you for reporting this and helping make Ubuntu better. Lucid reached EOL on May 9, 2013.
Please see this document for currently supported Ubuntu releases:
https://wiki.ubuntu.com/Releases

We were wondering if this is still an issue in a supported release? If so, could you please test for this with the latest development release of Ubuntu? ISO images are available from http://cdimage.ubuntu.com/daily-live/current/ .

If it remains an issue, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Also, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the kernel in the mainline kernels archive directory daily folder. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.11-rc4

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

If you are unable to test the mainline kernel, please comment as to why specifically you were unable to test it and add the following tags:
kernel-unable-to-test-upstream
kernel-unable-to-test-upstream-VERSION-NUMBER

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

Helpful bug reporting tips:
https://help.ubuntu.com/community/ReportingBugs

tags: added: needs-kernel-logs
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.