Ubuntu

panic in task_rq_lock (race with concurrent semtimedop() timeouts and IPC_RMID)

Reported by Philipp Morger on 2012-03-01
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Natty
Medium
Herton R. Krzesinski

Bug Description

SRU justification
=================

Impact
------
Kernel crash, due to race explained in upstream bug report: https://bugzilla.kernel.org/show_bug.cgi?id=27142
In practice likely to happen on a highly loaded webserver

Fix
---
Upstream commit d694ad62bf539dbb20a0899ac2a954555f9e4a83

Testcase
--------
https://bugzilla.kernel.org/attachment.cgi?id=66162
It's attached to this bug as well.
- Build with gcc -o timedrm timedrm.cpp -lpthread
- Run with ./timedrm 250, sometimes you have to run more than one time to get the oops, but it's very easy to get the crash.

---------------------------------------------------------------------------------------

When logged in I saw:

 unity kernel: [669168.472431] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
 unity kernel: [669168.475971] Stack:
 unity kernel: [669168.476634] Call Trace:
 unity kernel: [669168.477094] Code: 00 48 c7 c3 c0 3c 01 00 49 89 fc 49 89 f5 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 49 89 55 00 49 8b 44 24 08 49 89 de <8b> 40 18 4c 03 34 c5 00 4b ac 81 4c 89 f7 e8 03 36 58 00 49 8b
 unity kernel: [669168.479444] CR2: 00000000801f0f1d

In the log:

Mar 1 06:25:04 unity apache2[14216]: [Thu Mar 01 06:25:04 2012] [notice] SIGUSR1 received. Doing graceful restart
Mar 1 06:25:04 unity kernel: [669168.471999] BUG: unable to handle kernel paging request at 00000000801f0f1d
Mar 1 06:25:04 unity kernel: [669168.472131] IP: [<ffffffff81051aba>] task_rq_lock+0x4a/0xa0
Mar 1 06:25:04 unity kernel: [669168.472229] PGD 0
Mar 1 06:25:04 unity kernel: [669168.472312] Oops: 0000 [#1] SMP
Mar 1 06:25:04 unity kernel: [669168.472431] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
Mar 1 06:25:04 unity kernel: [669168.472508] CPU 7
Mar 1 06:25:04 unity kernel: [669168.472545] Modules linked in: ipt_MASQUERADE iptable_nat kvm_intel kvm ip6t_LOG xt_hl nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT ipt_LOG xt_limit xt_tcpudp ipt_addrtype xt_state
 ip6table_filter ip6_tables radeon nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat nf_conntrack_ipv4 ipmi_devintf nf_defrag_ipv4 ipmi_watchdog nf_conntrack_ftp psmouse nf_conntrack ttm drm_kms_helper ipmi_si drm ipt
able_filter serio_raw joydev i5400_edac edac_core ipmi_poweroff ip_tables ioatdma ipmi_msghandler i5k_amb lp i2c_algo_bit x_tables bridge stp parport shpchp usbhid hid usb_storage uas igb arcmsr dca
Mar 1 06:25:04 unity kernel: [669168.474703]
Mar 1 06:25:04 unity kernel: [669168.474756] Pid: 1832, comm: apache2 Not tainted 2.6.38-10-server #46~lucid1-Ubuntu Supermicro X7DWU/X7DWU
Mar 1 06:25:04 unity kernel: [669168.475004] RIP: 0010:[<ffffffff81051aba>] [<ffffffff81051aba>] task_rq_lock+0x4a/0xa0
Mar 1 06:25:04 unity kernel: [669168.475114] RSP: 0018:ffff88040c10fdc8 EFLAGS: 00010082
Mar 1 06:25:04 unity kernel: [669168.475171] RAX: 00000000801f0f05 RBX: 0000000000013cc0 RCX: 0000000000000002
Mar 1 06:25:04 unity kernel: [669168.475245] RDX: 0000000000000282 RSI: ffff88040c10fe20 RDI: 00007f558925f8f0
Mar 1 06:25:04 unity kernel: [669168.475320] RBP: ffff88040c10fde8 R08: 0000000000989680 R09: 000000000000028b
Mar 1 06:25:04 unity kernel: [669168.475393] R10: 0000000000007bea R11: 0000000000000001 R12: 00007f558925f8f0
Mar 1 06:25:04 unity kernel: [669168.475467] R13: ffff88040c10fe20 R14: 0000000000013cc0 R15: 0000000000000007
Mar 1 06:25:04 unity kernel: [669168.475542] FS: 00007f5589d03740(0000) GS:ffff8800cfdc0000(0000) knlGS:0000000000000000
Mar 1 06:25:04 unity kernel: [669168.475617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 1 06:25:04 unity kernel: [669168.475674] CR2: 00000000801f0f1d CR3: 000000040eb35000 CR4: 00000000000026e0
Mar 1 06:25:04 unity kernel: [669168.475748] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 1 06:25:04 unity kernel: [669168.475821] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 1 06:25:04 unity kernel: [669168.475895] Process apache2 (pid: 1832, threadinfo ffff88040c10e000, task ffff88040c1f2dc0)
Mar 1 06:25:04 unity kernel: [669168.475971] Stack:
Mar 1 06:25:04 unity kernel: [669168.476022] 00007f558925f8f0 ffff88040f155ec8 000000000000000f 0000000000000000
Mar 1 06:25:04 unity kernel: [669168.476225] ffff88040c10fe58 ffffffff8105f6dc ffff88040c10fe28 0000000700000286
Mar 1 06:25:04 unity kernel: [669168.476429] 0000000000000003 0000000181a4d7f0 ffff8804015d9850 0000000000000282
Mar 1 06:25:04 unity kernel: [669168.476634] Call Trace:
Mar 1 06:25:04 unity kernel: [669168.476689] [<ffffffff8105f6dc>] try_to_wake_up+0x3c/0x410
Mar 1 06:25:04 unity kernel: [669168.476747] [<ffffffff8105fb05>] wake_up_process+0x15/0x20
Mar 1 06:25:04 unity kernel: [669168.476806] [<ffffffff8126a590>] freeary+0x1e0/0x220
Mar 1 06:25:04 unity kernel: [669168.476863] [<ffffffff8126b610>] T.607+0xb0/0x100
Mar 1 06:25:04 unity kernel: [669168.476921] [<ffffffff81164055>] ? vfs_write+0x125/0x190
Mar 1 06:25:04 unity kernel: [669168.476979] [<ffffffff8126b6c9>] sys_semctl+0x69/0xa0
Mar 1 06:25:04 unity kernel: [669168.477036] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
Mar 1 06:25:04 unity kernel: [669168.477094] Code: 00 48 c7 c3 c0 3c 01 00 49 89 fc 49 89 f5 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 49 89 55 00 49 8b 44 24 08 49 89 de <8b> 40 18 4c 03 34 c5 00 4b ac 81 4c 89 f7 e8 03 36 58 00 49 8b
Mar 1 06:25:04 unity kernel: [669168.479300] RIP [<ffffffff81051aba>] task_rq_lock+0x4a/0xa0
Mar 1 06:25:04 unity kernel: [669168.479391] RSP <ffff88040c10fdc8>
Mar 1 06:25:04 unity kernel: [669168.479444] CR2: 00000000801f0f1d
Mar 1 06:25:04 unity kernel: [669168.479497] ---[ end trace b2b87cfb63915f6c ]---

This happens QUITE OFTEN. Only solution: Sync Filesystem and power cycle (read: I can't reboot, I have to pull the plug! (well, pushing reset button or the same via MagicKey....)

Furthermore: Apache in this case will no longer answer, and won't be able to Stop, It goes zombie.
The System is still accessible, except for Apache - and Apache can't be braught back to live...

Can't say, if it is a memory issue, but note: This is a Server, it has ECC FB-DIMM Memory. Will have to do a memory check some time. But nothing in this regard has been seen in the logs of the daughter board.

Some System info:

Distributor ID: Ubuntu
Description: Ubuntu 10.04.4 LTS
Release: 10.04
Codename: lucid

     *-memory
          description: System Memory
          physical id: 16
          slot: System board or motherboard
          size: 16GiB
        *-bank:0
             description: DIMM DDR2 FB-DIMM Synchronous 800 MHz (1.2 ns)
             vendor: 9801
             physical id: 0
             serial: 182194CD
             slot: DIMM1A
             size: 4GiB
             width: 64 bits
             clock: 800MHz (1.2ns)
        *-bank:1
             description: DIMM DDR2 FB-DIMM Synchronous 800 MHz (1.2 ns) [empty]
             physical id: 1
             slot: DIMM1B
             clock: 800MHz (1.2ns)
        *-bank:2
             description: DIMM DDR2 FB-DIMM Synchronous 800 MHz (1.2 ns)
             vendor: 9801
             physical id: 2
             serial: 1921B1FA
             slot: DIMM2A
             size: 4GiB
             width: 64 bits
             clock: 800MHz (1.2ns)
        *-bank:3
             description: DIMM DDR2 FB-DIMM Synchronous 800 MHz (1.2 ns) [empty]
             physical id: 3
             slot: DIMM2B
             clock: 800MHz (1.2ns)
        *-bank:4
             description: DIMM DDR2 FB-DIMM Synchronous 800 MHz (1.2 ns)
             vendor: 9801
             physical id: 4
             serial: 16213E76
             slot: DIMM3A
             size: 4GiB
             width: 64 bits
             clock: 800MHz (1.2ns)
        *-bank:5
             description: DIMM DDR2 FB-DIMM Synchronous 800 MHz (1.2 ns) [empty]
             physical id: 5
             slot: DIMM3B
             clock: 800MHz (1.2ns)
        *-bank:6
             description: DIMM DDR2 FB-DIMM Synchronous 800 MHz (1.2 ns)
             vendor: 9801
             physical id: 6
             serial: 1721CBB7
             slot: DIMM4A
             size: 4GiB
             width: 64 bits
             clock: 800MHz (1.2ns)
        *-bank:7
             description: DIMM DDR2 FB-DIMM Synchronous 800 MHz (1.2 ns) [empty]
             physical id: 7
             slot: DIMM4B
             clock: 800MHz (1.2ns)

     *-cpu:0
          description: CPU
          product: Intel(R) Xeon(R) CPU E5472 @ 3.00GHz
          vendor: Intel Corp.
          physical id: 4
          bus info: cpu@0
          version: Intel(R) Xeon(R) CPU E5472 @ 3.00GHz
          slot: LGA771/CPU1
          size: 3GHz
          width: 64 bits
          clock: 1600MHz
          capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx x86-64 constant_tsc arch_perfmon pebs bts
 rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
        *-cache:0
             description: L1 cache
             physical id: 6
             slot: L1 Cache
             size: 16KiB
             capacity: 16KiB
             capabilities: asynchronous internal write-back
        *-cache:1
             description: L2 cache
             physical id: 7
             slot: L2 Cache
             size: 12MiB
             capabilities: burst internal write-back
     *-cpu:1
          description: CPU
          product: Intel(R) Xeon(R) CPU E5472 @ 3.00GHz
          vendor: Intel Corp.
          physical id: 5
          bus info: cpu@1
          version: Intel(R) Xeon(R) CPU E5472 @ 3.00GHz
          slot: LGA771/CPU2
          size: 3GHz
          width: 64 bits
          clock: 1600MHz
          capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx x86-64 constant_tsc arch_perfmon pebs bts
 rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
        *-cache:0
             description: L1 cache
             physical id: 8
             slot: L1 Cache
             size: 16KiB
             capacity: 16KiB
             capabilities: asynchronous internal write-back
        *-cache:1
             description: L2 cache
             physical id: 9
             slot: L2 Cache
             size: 12MiB
             capabilities: burst internal write-back

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 943815

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: natty

AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access /dev/snd/: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 10.04
Frequency: Once every few weeks.
MachineType: Supermicro X7DWU
Package: linux (not installed)
PciMultimedia:

ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.38-10-server root=UUID=cf7be043-0ec2-47de-a781-97c50fa36843 ro nomodeset quiet splash
ProcEnviron:
 LC_CTYPE=en_US.UTF-8
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.38-10.46~lucid1-server 2.6.38.7
Regression: Yes
Reproducible: No
Tags: lucid regression-update needs-upstream-testing
Uname: Linux 2.6.38-10-server x86_64
UserGroups:

dmi.bios.date: 02/04/2008
dmi.bios.vendor: Phoenix Technologies LTD
dmi.bios.version: 6.00
dmi.board.name: X7DWU
dmi.board.vendor: Supermicro
dmi.board.version: PCB Version
dmi.chassis.type: 1
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 0123456789
dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd02/04/2008:svnSupermicro:pnX7DWU:pvr0123456789:rvnSupermicro:rnX7DWU:rvrPCBVersion:cvnSupermicro:ct1:cvr0123456789:
dmi.product.name: X7DWU
dmi.product.version: 0123456789
dmi.sys.vendor: Supermicro

tags: added: apport-collected

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.3 kernel[1] (Not a kernel in the daily directory). Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed by the mainline kernel, please add the following tag 'kernel-fixed-upstream-KERNEL-VERSION'. For example, if kernel version 3.3-rc5 fixed the issue, the tag would be: 'kernel-fixed-upstream-v3.3-rc5'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[1] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-rc5-precise/

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: lucid needs-upstream-testing
removed: natty
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Phoenix (phoenix-dominion) wrote :

I installed the following kernel

Linux unity 3.3.0-030300rc5-generic #201202251535 SMP Sat Feb 25 20:36:29 UTC 2012 x86_64 GNU/Linux

Phoenix (phoenix-dominion) wrote :

I was able to capture why the server does not gracefully reboot.

Herton R. Krzesinski (herton) wrote :

This is almost certainly the same as https://bugzilla.kernel.org/show_bug.cgi?id=27142
I reproduced the same issue here with the testcase there.

summary: - Slow System Crash due to Kernel Problem
+ panic in task_rq_lock (race with concurrent semtimedop() timeouts and
+ IPC_RMID)
Changed in linux (Ubuntu Natty):
importance: Undecided → Medium
status: New → In Progress
assignee: nobody → Herton R. Krzesinski (herton)
Changed in linux (Ubuntu):
status: Incomplete → Fix Released
Herton R. Krzesinski (herton) wrote :
description: updated
description: updated
Tim Gardner (timg-tpi) on 2012-03-02
Changed in linux (Ubuntu Natty):
status: In Progress → Fix Committed
Herton R. Krzesinski (herton) wrote :

This bug is awaiting verification that the kernel for Natty in -proposed solves the problem (2.6.38-13.57), also a linux-lts-backport-natty package for Lucid will be available in -proposed soon based on the same version. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-natty' to 'verification-done-natty'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-natty
Herton R. Krzesinski (herton) wrote :

No problems running the test case with the -proposed kernel.

tags: added: verification-done-natty
tags: removed: verification-needed-natty
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 2.6.38-13.57

---------------
linux (2.6.38-13.57) natty-proposed; urgency=low

  [Herton R. Krzesinski]

  * Release Tracking Bug
    - LP: #947254

  [ Upstream Kernel Changes ]

  * KVM: Device assignment permission checks
    - LP: #897812
    - CVE-2011-4347
  * HID: hid-apple: add device ID of another wireless aluminium
    - LP: #942184
  * eCryptfs: Extend array bounds for all filename chars
    - LP: #944990
  * eCryptfs: Remove extra d_delete in ecryptfs_rmdir
    - LP: #723518
  * eCryptfs: Clear i_nlink in rmdir
    - LP: #723518
  * ipc/sem.c: fix race with concurrent semtimedop() timeouts and IPC_RMID
    - LP: #943815
  * eCryptfs: Sanitize write counts of /dev/ecryptfs
    - LP: #947075
  * eCryptfs: Infinite loop due to overflow in ecryptfs_write()
    - LP: #947143
 -- Herton Ronaldo Krzesinski <email address hidden> Mon, 05 Mar 2012 13:28:11 -0300

Changed in linux (Ubuntu Natty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.