"Kernel unaligned access at TPC" causing network/system to become slow and/or unresponsive

Bug #569610 reported by Luke J Militello
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Medium
Unassigned

Bug Description

Binary package hint: linux-image-2.6.24-27-sparc64-smp

These few lines kept showing up in dmesg/logs over the past few weeks and then started to cause the system/network in question to become slow and unresponsive leading to a hard reboot to remedy.

Apr 7 16:01:48 Hal kernel: [1076377.260268] Kernel unaligned access at TPC[6a1504] tcp_transmit_skb+0x1ac/0x8c0
Apr 7 16:01:48 Hal kernel: [1076377.260330] Kernel unaligned access at TPC[6a150c] tcp_transmit_skb+0x1b4/0x8c0
Apr 7 16:01:48 Hal kernel: [1076377.260355] Kernel unaligned access at TPC[68e724] ip_queue_xmit+0x14c/0x5a0
Apr 7 16:01:48 Hal kernel: [1076377.260390] Kernel unaligned access at TPC[68e734] ip_queue_xmit+0x15c/0x5a0
Apr 7 16:01:48 Hal kernel: [1076377.260418] Kernel unaligned access at TPC[58e904] ip_fast_csum+0xc/0x80
Apr 7 16:32:21 Hal kernel: [1078210.103332] Kernel unaligned access at TPC[6a1504] tcp_transmit_skb+0x1ac/0x8c0
Apr 7 16:32:21 Hal kernel: [1078210.103406] Kernel unaligned access at TPC[6a150c] tcp_transmit_skb+0x1b4/0x8c0
Apr 7 16:32:21 Hal kernel: [1078210.103437] Kernel unaligned access at TPC[68e724] ip_queue_xmit+0x14c/0x5a0
Apr 7 16:32:21 Hal kernel: [1078210.103474] Kernel unaligned access at TPC[68e734] ip_queue_xmit+0x15c/0x5a0
Apr 7 16:32:21 Hal kernel: [1078210.103501] Kernel unaligned access at TPC[58e904] ip_fast_csum+0xc/0x80

I'm not sure if this is kernel related or the fact that I am running ifenslave with interface bonding in fault-tolerant mode.

My system is a Sun Enterprise 420R.
I'm running Ubuntu 8.04.4 w/4GB of RAM.

CPU Info...

cpu : TI UltraSparc II (BlackBird)
fpu : UltraSparc II integrated FPU
prom : OBP 3.23.0 1999/06/30 13:53
type : sun4u
ncpus probed : 4
ncpus active : 4
D$ parity tl1 : 0
I$ parity tl1 : 0
Cpu0ClkTck : 000000001ad35932
Cpu1ClkTck : 000000001ad35932
Cpu2ClkTck : 000000001ad35932
Cpu3ClkTck : 000000001ad35932
MMU Type : Spitfire
State:
CPU0: online
CPU1: online
CPU2: online
CPU3: online

Network card info via dmesg...

[ 139.086540] PCI: Enabling device: (0000:02:00.0), cmd 2
[ 139.092436] eth1: Sun Cassini+ (64bit/33MHz PCI/Cu) Ethernet[15] 00:03:ba:85:5b:01
[ 139.266151] PCI: Enabling device: (0000:02:01.0), cmd 2
[ 139.272054] eth2: Sun Cassini+ (64bit/33MHz PCI/Cu) Ethernet[16] 00:03:ba:85:5b:02
[ 139.446977] PCI: Enabling device: (0000:03:02.0), cmd 2
[ 139.452954] eth3: Sun Cassini+ (64bit/33MHz PCI/Cu) Ethernet[17] 00:03:ba:85:5b:03
[ 139.627883] PCI: Enabling device: (0000:03:03.0), cmd 2
[ 139.633963] eth4: Sun Cassini+ (64bit/33MHz PCI/Cu) Ethernet[18] 00:03:ba:85:5b:04
[ 139.807491] PCI: Enabling device: (0000:05:00.0), cmd 2
[ 139.813690] eth5: Sun Cassini+ (64bit/33MHz PCI/Cu) Ethernet[19] 00:03:ba:85:20:e5
[ 139.990262] PCI: Enabling device: (0000:05:01.0), cmd 2
[ 139.996483] eth6: Sun Cassini+ (64bit/33MHz PCI/Cu) Ethernet[20] 00:03:ba:85:20:e6
[ 140.170039] PCI: Enabling device: (0000:06:02.0), cmd 2
[ 140.176496] eth7: Sun Cassini+ (64bit/33MHz PCI/Cu) Ethernet[21] 00:03:ba:85:20:e7
[ 140.350911] PCI: Enabling device: (0000:06:03.0), cmd 2
[ 140.357394] eth8: Sun Cassini+ (64bit/33MHz PCI/Cu) Ethernet[22] 00:03:ba:85:20:e8

Bonding info via dmesg...

[ 143.262268] Ethernet Channel Bonding Driver: v3.2.3 (December 6, 2007)
[ 143.339438] bonding: MII link monitoring set to 100 ms

Bonding options set in '/etc/network/interfaces'...

post-up ifenslave bond0 eth1 eth5
pre-down ifenslave -d bond0 eth1 eth5

Bonding options loaded via '/etc/modules'...

bonding mode=active-backup miimon=100 max_bonds=4

tags: added: sparc
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Luke,

Please be sure to confirm this issue exists with the latest development release of Ubuntu. ISO CD images are available from http://cdimage.ubuntu.com/releases/ . If the issue remains, please run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 569610

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Luke J Militello (kilahurtz) wrote :

Thanks for the quick response, Jeremy. Unfortunately, I cannot test the latest development release image as this is a production system. However, I can test the latest upstream kernel for you. I might also add the current kernel I am running is 2.6.24-27.69. I poked around the kernel upstream site and was wondering which kernel you would like me to try.

I'm guessing ... 2.6.34-999.201004261005 (daily/current)

As I don't see any recent build of mainline 2.6.24.6 (2.6.24-27.69).

Either way, it looks as I may have to compile it as there is no "sparc64-smp" deb package.

Also, I have no real way to replicate this issue as it just starts appearing all of a sudden. I only caught it when my system became sluggish after a while. I'll have to watch the logs to catch it in action. When I do, shall I run the command you posted above?

Thanks again.

Revision history for this message
Luke J Militello (kilahurtz) wrote :
Download full text (4.2 KiB)

I did some searching and found these two entries in some kernel change logs. I have no idea if it helps any but I figured it couldn't hurt to ask...

 ~ ChangeLog-2.6.28 ~

 11696 commit 33cf71cee14743185305c61625c4544885055733
 11697 Author: Petr Tesarik <email address hidden>
 11698 Date: Fri Nov 21 16:42:58 2008 -0800
 11699
 11700 tcp: Do not use TSO/GSO when there is urgent data
 11701
 11702 This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=12014
 11703
 11704 Since most (if not all) implementations of TSO and even the in-kernel
 11705 software GSO do not update the urgent pointer when splitting a large
 11706 segment, it is necessary to turn off TSO/GSO for all outgoing traffic
 11707 with the URG pointer set.
 11708
 11709 Looking at tcp_current_mss (and the preceding comment) I even think
 11710 this was the original intention. However, this approach is insufficient,
 11711 because TSO/GSO is turned off only for newly created frames, not for
 11712 frames which were already pending at the arrival of a message with
 11713 MSG_OOB set. These frames were created when TSO/GSO was enabled,
 11714 so they may be large, and they will have the urgent pointer set
 11715 in tcp_transmit_skb().
 11716
 11717 With this patch, such large packets will be fragmented again before
 11718 going to the transmit routine.
 11719
 11720 As a side note, at least the following NICs are known to screw up
 11721 the urgent pointer in the TCP header when doing TSO:
 11722
 11723 Intel 82566MM (PCI ID 8086:1049)
 11724 Intel 82566DC (PCI ID 8086:104b)
 11725 Intel 82541GI (PCI ID 8086:1076)
 11726 Broadcom NetXtreme II BCM5708 (PCI ID 14e4:164c)
 11727
 11728 Signed-off-by: Petr Tesarik <email address hidden>
 11729 Signed-off-by: David S. Miller <email address hidden>

 ~ ChangeLog-2.6.29 ~

168533 commit ef711cf1d156428d4c2911b8c86c6ce90519dc45
168534 Author: Eric Dumazet <email address hidden>
168535 Date: Fri Nov 14 00:53:54 2008 -0800
168536
168537 net: speedup dst_release()
168538
168539 During tbench/oprofile sessions, I found that dst_release() was in third position.
168540
168541 CPU: Core 2, speed 2999.68 MHz (estimated)
168542 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
168543 samples % symbol name
168544 483726 9.0185 __copy_user_zeroing_intel
168545 191466 3.5697 __copy_user_intel
168546 185475 3.4580 dst_release
168547 175114 3.2648 ip_queue_xmit
168548 153447 2.8608 tcp_sendmsg
168549 108775 2.0280 tcp_recvmsg
168550 102659 1.9140 sysenter_past_esp
168551 101450 1.8914 tcp_current_mss
168552 95067 1.7724 __copy_from_user_ll
168553 86531 1.6133 tcp_transmit_skb
168554
168555 Of course, all CPUS fight on the dst_entry associated with 127.0.0.1
168556
168557 Instead of first checking the refcount value, then decrement it,
168558 we use atomic_dec_return() to help CPU to make the right memory tran...

Read more...

Revision history for this message
Luke J Militello (kilahurtz) wrote :
Changed in linux (Ubuntu):
status: Incomplete → In Progress
Revision history for this message
Luke J Militello (kilahurtz) wrote :

Also, marking this "In Progress" as I do not want LP Janitor to wipe it out. I will need a bit of guidance on how to proceed seeing as I can not simply install a deb package for the upstream kernel as I am using a Sparc port. If someone could tell me which upstream version to try and where to find the proper "Ubuntu" source with the Ubuntu kernel config files, I have no problem compiling it and testing it out for you.

Revision history for this message
Luke J Militello (kilahurtz) wrote :

Oh and this is also a headless server install / no GUI.

As for the Apport command, how shall I go about running that so I don't have to babysit the log files?

I read Apport is not enabled on stable releases by default. I've never used it before either.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Luke,
    The In Progress status is for use when a bug is assigned to a specific Kernel Team member and they are working the issue.

I've set the bug to triaged. Thank you for the further data.

Thanks!

~JFo

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: In Progress → Triaged
Revision history for this message
Luke J Militello (kilahurtz) wrote :

Jeremy, is there anything you would like me to do in the mean time? Like test a different kernel or similar?

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Luke,
   Nothing that I can think of currently. The Kernel Team may have some requests of you once they've had the opportunity to investigate this bug further.

Thanks!

~JFo

Revision history for this message
Luke J Militello (kilahurtz) wrote :

Was cruising along for 10 days since my last reboot and it showed up again, however, only once and the system is still stable.

[100375.580397] Kernel unaligned access at TPC[6a1504] tcp_transmit_skb+0x1ac/0x8c0
[100375.580458] Kernel unaligned access at TPC[6a150c] tcp_transmit_skb+0x1b4/0x8c0
[100375.580483] Kernel unaligned access at TPC[68e724] ip_queue_xmit+0x14c/0x5a0
[100375.580514] Kernel unaligned access at TPC[68e734] ip_queue_xmit+0x15c/0x5a0
[100375.580536] Kernel unaligned access at TPC[58e904] ip_fast_csum+0xc/0x80

Revision history for this message
penalvch (penalvch) wrote :

Luke J Militello, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please test for this with the latest development release of Ubuntu? ISO images are available from http://cdimage.ubuntu.com/daily-live/current/ .

If it remains an issue, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Also, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the daily kernel folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.12-rc2

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

Changed in linux (Ubuntu):
status: Triaged → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.