[SRU] openafs-modules segfault after stop

Bug #333197 reported by amaxtt on 2009-02-23
24
This bug affects 1 person
Affects Status Importance Assigned to Milestone
openafs (Ubuntu)
Undecided
Unassigned
Intrepid
Undecided
Unassigned

Bug Description

Impact: This bug causes kernel oopses and hangs at shutdown.

Development: The two deltas being incorporated have been committed to the upstream AFS tree, and have also been included in openafs 1.4.8.dfsg1-3, which was just synced into Jaunty.

Patch: Attached at http://launchpadlibrarian.net/25828644/openafs_1.4.7.dfsg1-6%2Bubuntu0.2.debdiff - please see the comments for explanation of the version number.

Steps to reproduce: Assuming that AFS isn't in use when you reboot, rebooting with a 1.4.7 or 1.4.8 client that doesn't include these patches should consistently trigger the bugs behind them.

Regression potential: For both of these deltas, the changes are limited to the shutdown code, i.e. the functionality that's affected by the bugs, so I find it unlikely that they'll make anything worse, and empirically they seem to fix the oopses and hangs.

was after openafs-client stop on server
ubuntu hardy

[ 99.016655] Starting AFS cache scan...found 45 non-empty cache files (2%).
[ 99.346761] NET: Registered protocol family 17
[ 101.010753] ip_tables: (C) 2000-2006 Netfilter Core Team
[ 101.071133] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[ 101.403948] eth1: no IPv6 routers present
[ 102.199452] vlan11: no IPv6 routers present
[ 106.468767] tap0: no IPv6 routers present
[73434.848255] EXT3-fs: cannot change data mode on remount
[75198.320131] WARM shutting down of: CB... afs... BkG... CTrunc... AFSDB... RxEvent... UnmaskRxkSignals... RxListener...
[75198.833031] WARNING: not all blocks freed: large 1 small 4
[75198.833041] ALL allocated tables
[75219.895067] kjournald starting. Commit interval 120 seconds
[75219.915815] EXT3 FS on dm-3, internal journal
[75219.915823] EXT3-fs: mounted filesystem with writeback data mode.
[75253.358769] Found system call table at 0xc033a680 (pattern scan)
[75253.358773] Address 0xc033a680 is not writable.
[75253.358774] System call hooks will not be installed; proceeding anyway
[75253.398880] Starting AFS cache scan...found 347 non-empty cache files (22%).
[76028.437373] AFS isn't unmounted yet! Call aborted
[76034.981943] AFS isn't unmounted yet! Call aborted
[76056.414781] AFS isn't unmounted yet! Call aborted
[76100.504630] COLD shutting down of: CB... afs... BkG... CTrunc... AFSDB... RxEvent... UnmaskRxkSignals... RxListener...
[76101.000366] osi_linux_free: failed to remove chunk from hashtable
    (repeated about 300 times)
[76101.000952] BUG: unable to handle kernel paging request at virtual address f8f5a020
[76101.001039] printing eip: f8e09364 *pdpt = 0000000000004001 *pde = 0000000035836067 *pte = 0000000000000000
[76101.001140] Oops: 0000 [#1] SMP
[76101.001186] Modules linked in: openafs(P) ipt_REDIRECT ipt_REJECT xt_limit xt_state xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack iptable_mangle iptable_filter ip_tables x_tables af_packet nfsd auth_rpcgss exportfs tun container battery ac video output sbs sbshc dock nfs lockd nfs_acl sunrpc 8021q tcp_bic parport_pc lp parport loop ipv6 usbhid hid iTCO_wdt iTCO_vendor_support button shpchp pci_hotplug evdev pcspkr ext3 jbd mbcache ata_generic sg sd_mod ata_piix pata_acpi libata ehci_hcd uhci_hcd usbcore tg3 mptsas mptscsih mptbase scsi_transport_sas scsi_mod dm_mirror dm_snapshot dm_mod thermal processor fan fbcon tileblit font bitblit softcursor fuse
[76101.001836]
[76101.001872] Pid: 17524, comm: umount Tainted: P (2.6.24-23-server #1)
[76101.001926] EIP: 0060:[<f8e09364>] EFLAGS: 00010282 CPU: 0
[76101.001997] EIP is at shutdown_vcache+0xe4/0x140 [openafs]
[76101.002045] EAX: f8f5a01c EBX: f8f5a01c ECX: f8a0a0b0 EDX: f8a0a218
[76101.002096] ESI: 00000400 EDI: f8e6f080 EBP: df9fce00 ESP: d686bee8
[76101.002147] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[76101.002195] Process umount (pid: 17524, ti=d686a000 task=c249f140 task.ti=d686a000)
[76101.002250] Stack: df9fce00 f8e67380 f8dff329 d69db400 f8e43cc4 f8e3e706 df9fce00 f8e67380
[76101.002355] d69db400 df9fce00 c019c3e5 c01b088b 00000000 00000017 f8e67360 c019c4a9
[76101.002461] df9fce00 c019c55d 00000000 d686bf40 c01b0d36 00000000 ecc11908 d69db400
[76101.002567] Call Trace:
[76101.002637] [<f8dff329>] shutdown_cache+0x39/0xd0 [openafs]
[76101.002702] [<f8e43cc4>] afs_shutdown+0x204/0x2a0 [openafs]
[76101.002769] [<f8e3e706>] afs_put_super+0x66/0xe0 [openafs]
[76101.002836] [<c019c3e5>] generic_shutdown_super+0x55/0xf0
[76101.002888] [<c01b088b>] mntput_no_expire+0x3b/0x70
[76101.002938] [<c019c4a9>] kill_anon_super+0x9/0x40
[76101.002987] [<c019c55d>] deactivate_super+0x5d/0x80
[76101.003036] [<c01b0d36>] sys_umount+0x46/0x250
[76101.003086] [<c019e08f>] sys_stat64+0xf/0x30
[76101.003133] [<c0185fd9>] remove_vma+0x39/0x50
[76101.003181] [<c0186b70>] do_munmap+0x180/0x1f0
[76101.003232] [<c01b0f57>] sys_oldumount+0x17/0x20
[76101.003280] [<c010838a>] sysenter_past_esp+0x6b/0xa1
[76101.003332] [<c0330000>] rt_mutex_slowunlock+0x60/0x1c0
[76101.003384] =======================
[76101.003426] Code: fe ff 8b 9b 54 01 00 00 85 db 75 ab c7 04 b5 a0 ff e6 f8 00 00 00 00 83 c6 01 81 fe 00 04 00 00 75 88 a1 00 30 e7 f8 85 c0 74 13 <8b> 58 04 ba d0 20 00 00 e8 5f 85 ff ff 85 db 89 d8 75 ed b8 bc
[76101.003770] EIP: [<f8e09364>] shutdown_vcache+0xe4/0x140 [openafs] SS:ESP 0068:d686bee8
[76101.004203] ---[ end trace 965514c177c6dca1 ]---
[76101.004292] WARNING: at /build/buildd/linux-2.6.24/kernel/exit.c:917 do_exit()
[76101.004416] Pid: 17524, comm: umount Tainted: P D 2.6.24-23-server #1
[76101.004514] [<c013552b>] do_exit+0x6eb/0x860
[76101.004652] [<c013242b>] printk+0x1b/0x20
[76101.004788] [<c01099f7>] die+0x277/0x280
[76101.004924] [<c03327ae>] do_page_fault+0x4fe/0x900
[76101.005066] [<c03322b0>] do_page_fault+0x0/0x900
[76101.005202] [<c0330aaa>] error_code+0x72/0x78
[76101.005339] [<f8e09364>] shutdown_vcache+0xe4/0x140 [openafs]
[76101.005496] [<f8dff329>] shutdown_cache+0x39/0xd0 [openafs]
[76101.005649] [<f8e43cc4>] afs_shutdown+0x204/0x2a0 [openafs]
[76101.005805] [<f8e3e706>] afs_put_super+0x66/0xe0 [openafs]
[76101.005963] [<c019c3e5>] generic_shutdown_super+0x55/0xf0
[76101.006102] [<c01b088b>] mntput_no_expire+0x3b/0x70
[76101.006240] [<c019c4a9>] kill_anon_super+0x9/0x40
[76101.006376] [<c019c55d>] deactivate_super+0x5d/0x80
[76101.006514] [<c01b0d36>] sys_umount+0x46/0x250
[76101.006652] [<c019e08f>] sys_stat64+0xf/0x30
[76101.006788] [<c0185fd9>] remove_vma+0x39/0x50
[76101.006924] [<c0186b70>] do_munmap+0x180/0x1f0
[76101.007062] [<c01b0f57>] sys_oldumount+0x17/0x20
[76101.007198] [<c010838a>] sysenter_past_esp+0x6b/0xa1
[76101.007337] [<c0330000>] rt_mutex_slowunlock+0x60/0x1c0
[76101.007477] =======================

Launchpad Bug Tracker <email address hidden> writes:

> was after openafs-client stop on server
> ubuntu hardy

I assume you grabbed the 1.4.8 source package from a later version and
rebuilt it on hardy?

Could you attach to the bug your openafs.ko kernel module? That should
help us track down which data structure is hosed and causing the oops.

--
Russ Allbery (<email address hidden>) <http://www.eyrie.org/~eagle/>

We've seen what looks to be the same oops on 1.4.7 in Intrepid, using
the stock package and kernels built with no additional packages. The
last bit of our stack trace is:

(I don't have a full stack strace because it tends to happen at
reboot, and doesn't make it to disk or any other way we can access it;
This was copied down manually from the console)

=====================================================================
BUG: unable to handle kernel paging request at f89e003c
IP: [<f90735d8>] :openafs:shutdown_vcache+0xf8/0x160
...
... EFLAGS: 00010282
EAX: f89e0038 EBX: f89e0038 ECX: f8f400b0 EDX: 00000246
ESI: 00000400 EDI: f90da280 EBP: f57ebf14 ESP: f57ebf0c
...
Stack: f4d84800 f9d09c0 f57ebf20 f906815c f90d09a0 f57ebf28
...
Call Trace:
  [<f906815c>] ? shutdown_cache+0x3c/0xd0 [openafs]
  [<f90b0304>] ? afs_shutdown+0x204/0x310 [openafs]
  [<f90a9d3b>] ? afs_put_super+0x5b/0xf0 [openafs]
[followed by some nonmodule unmounting functions]
...
Code: ... c0 74 17 8d 74 26 00 <8b> 58 04 ba d0 20 00 00 ...
=====================================================================

We also see a WARNING at kernel/exit.c:1001 do_exit+0x353/0x360() from
something called by shutdown_vcache -> error_code -> do_page_fault ->
afs_lhash_address -> -> oops_end -> -> etc.

William Cattey (wdc-mit) wrote :

This just happened to me in a VM. Alas, cut/paste wasn't available. But I WAS able to
scroll back and get the ENTIRE back trace as a series of images.

Start with afs-oops-9.png and work your way forward.

William Cattey (wdc-mit) wrote :
William Cattey (wdc-mit) wrote :
William Cattey (wdc-mit) wrote :
William Cattey (wdc-mit) wrote :
William Cattey (wdc-mit) wrote :
William Cattey (wdc-mit) wrote :
William Cattey (wdc-mit) wrote :
William Cattey (wdc-mit) wrote :

Phew. Sorry for all the noise.

Evan Broder (broder) wrote :

One of the OpenAFS developers (Chaskiel) gave us this patch to try. We'll be testing it over the next few days on some of our machines to see if it fixes the problem.

I should warn anyone interested in trying the patch that it has absolutely not been tested yet, or even built.

We'll report back once we've had some time to test it.

Evan Broder (broder) on 2009-03-25
description: updated
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package openafs - 1.4.8.dfsg1-3

---------------
openafs (1.4.8.dfsg1-3) unstable; urgency=low

  * Apply upstream CVS deltas:
    - STABLE14-cbr-free-what-you-alloc-20090325: dequeue items in the same
      way they were allocated.
    - STABLE14-shutdown-vcache-avoid-null-deref-20090324: avoid oops on
      shutdown. (LP: #333197)
    - STABLE14-uphys-invalidate-returns-void-20081130: fix apparent Ubik
      synchronization errors due to incorrect use of a void return value.
  * Update package sections for the new archive organization.

 -- Evan Broder <email address hidden> Mon, 30 Mar 2009 11:14:46 +0100

Changed in openafs:
status: New → Fix Released
Evan Broder (broder) wrote :

Here's a patch for an SRU that includes STABLE14-cbr-free-what-you-alloc-20090325 and STABLE14-shutdown-vcache-avoid-null-deref-20090324 - both seem to cause similar symptoms at shutdown, and it seems that both fixes are needed sometimes.

The version number in the SRU (1.4.7.dfsg1-6+ubuntu0.1) is intentionally off from the standard SRU version numbering scheme for the sake of the OpenAFS kernel modules. Without the plus, kernel modules built from the SRU would fail to have a higher version number than the current version:

priscus:~ evan$ dpkg --compare-versions '1.4.7.dfsg1-6+2.6.27-11.27' lt '1.4.7.dfsg1-6ubuntu0.1+2.6.27-11.27' && echo "Yes" || echo "No"
No
priscus:~ evan$ dpkg --compare-versions '1.4.7.dfsg1-6+2.6.27-11.27' lt '1.4.7.dfsg1-6+ubuntu0.1+2.6.27-11.27' && echo "Yes" || echo "No"
Yes

I'll update the bug description in a bit for the SRU request.

Evan Broder (broder) wrote :

Whoops - here's a patch that includes the LP closer.

description: updated
Evan Broder (broder) wrote :

(Marking as confirmed so it doesn't get ignored - I can't retarget it correctly :-P)

Changed in openafs:
status: Fix Released → Confirmed
description: updated
Evan Broder (broder) wrote :

Here's a new version of the patch rebased on top of the recent security update.

Evan Broder (broder) on 2009-05-05
description: updated
Evan Broder (broder) on 2009-11-30
Changed in openafs (Ubuntu Intrepid):
status: New → Confirmed
Changed in openafs (Ubuntu):
status: Confirmed → Fix Released
John Dong (jdong) wrote :

ACK from MOTU-SRU

Accepted openafs into intrepid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in openafs (Ubuntu Intrepid):
status: Confirmed → Fix Committed
tags: added: verification-needed
Martin Pitt (pitti) wrote :

This intrepid-proposed SRU has not been verified in the last three months or longer. Intrepid will go out of support in less than two months, so it is not worth pursuing this SRU any further.

I removed the intrepid-proposed version from the archive.

Changed in openafs (Ubuntu Intrepid):
status: Fix Committed → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers