kernel bug rpc nfs client

Bug #212485 reported by Cassus
52
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Critical
Unassigned
Hardy
Fix Released
High
Stefan Bader

Bug Description

SRU justification:

Impact: Having NFS mounted filesystems with autobinding in use there seems
to be a good chance that cloned clients might call rpcb_getport_async()
and thus trigger a Bug() statement in the code.

Fix: A cherry pick from upstream to walk to the parent client to find out
who "owns" the port. Fix was verified to be working.

Testcase: see below

---

After viewing a film on an nfs4 share for about 20 minutes:

Apr 5 22:19:46 lena kernel: [10870.179278] ------------[ cut here ]------------
Apr 5 22:19:46 lena kernel: [10870.179287] kernel BUG at /build/buildd/linux-2.6.24/net/sunrpc/rpcb_clnt.c:322!
Apr 5 22:19:46 lena kernel: [10870.179292] invalid opcode: 0000 [#1] SMP
Apr 5 22:19:46 lena kernel: [10870.179297] Modules linked in: des_generic cbc af_packet binfmt_misc rfcomm l2cap bluetooth ppdev autofs4 ipv6 acpi_cpufreq cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand freq_table cpufreq_conservative sbs sbshc bay dock rpcsec_gss_krb5 auth_rpcgss nfs lockd nfs_acl sunrpc iptable_filter ip_tables x_tables ipw2200 ieee80211 ieee80211_crypt sbp2 lp arc4 ecb blkcipher pcmcia snd_hda_intel snd_hwdep snd_pcm_oss snd_pcm snd_page_alloc snd_mixer_oss snd_seq_dummy nvidia(P) snd_seq_oss iwl3945 iwlwifi_mac80211 snd_seq_midi psmouse tifm_7xx1 snd_rawmidi serio_raw snd_seq_midi_event tifm_core i2c_core yenta_socket rsrc_nonstatic pcmcia_core cfg80211 snd_seq snd_timer snd_seq_device container snd video output intel_agp agpgart irda ac acer_acpi led_class button pcspkr evdev iTCO_wdt iTCO_vendor_support parport_pc parport crc_ccitt battery shpchp pci_hotplug wmi_acer soundcore ext3 jbd mbcache sg sr_mod sd_mod cdrom ata_piix pata_acpi ata_generic libata scsi_mod ohci
Apr 5 22:19:46 lena kernel: 394 ieee1394 ehci_hcd uhci_hcd usbcore tg3 thermal processor fan fbcon tileblit font bitblit softcursor fuse
Apr 5 22:19:46 lena kernel: [10870.179436]
Apr 5 22:19:46 lena kernel: [10870.179440] Pid: 3393, comm: mplayer Tainted: P (2.6.24-14-generic #1)
Apr 5 22:19:46 lena kernel: [10870.179445] EIP: 0060:[<f8e7f540>] EFLAGS: 00010206 CPU: 1
Apr 5 22:19:46 lena kernel: [10870.179483] EIP is at rpcb_getport_async+0x290/0x430 [sunrpc]
Apr 5 22:19:46 lena kernel: [10870.179487] EAX: c2a97c00 EBX: c2a97c00 ECX: c2a97c80 EDX: f8e98b40
Apr 5 22:19:46 lena kernel: [10870.179491] ESI: eb35ef00 EDI: eb35e900 EBP: c2b47000 ESP: c298fcc8
Apr 5 22:19:46 lena kernel: [10870.179495] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Apr 5 22:19:46 lena kernel: [10870.179500] Process mplayer (pid: 3393, ti=c298e000 task=c29685a0 task.ti=c298e000)
Apr 5 22:19:46 lena kernel: [10870.179503] Stack: c299d448 c298fd00 00000000 c298fd00 00000000 c298fd08 c0318059 c2a97c00
Apr 5 22:19:46 lena kernel: [10870.179514] c2a97c80 00000001 f8e770a0 c2a97c80 c03180cb c2a97c00 c2b47000 c298fd80
Apr 5 22:19:46 lena kernel: [10870.179525] c2a97c80 f8e70195 f8e71cc8 c298fd14 c298fd14 00000000 00000000 c2a97c00
Apr 5 22:19:46 lena kernel: [10870.179535] Call Trace:
Apr 5 22:19:46 lena kernel: [10870.179565] [__wait_on_bit+0x59/0x70] __wait_on_bit+0x59/0x70
Apr 5 22:19:46 lena kernel: [10870.179587] [<f8e770a0>] rpc_wait_bit_interruptible+0x0/0x20 [sunrpc]
Apr 5 22:19:46 lena kernel: [10870.179633] [nfs:out_of_line_wait_on_bit+0x5b/0x70] out_of_line_wait_on_bit+0x5b/0x70
Apr 5 22:19:46 lena kernel: [10870.179656] [<f8e70195>] call_bind+0x45/0x80 [sunrpc]
Apr 5 22:19:46 lena kernel: [10870.179688] [<f8e71cc8>] call_timeout+0x18/0x120 [sunrpc]
Apr 5 22:19:46 lena kernel: [10870.179739] [<f8e77542>] __rpc_execute+0x62/0x280 [sunrpc]
Apr 5 22:19:46 lena kernel: [10870.179787] [<f8e76bf4>] rpc_set_active+0x44/0x70 [sunrpc]
Apr 5 22:19:46 lena kernel: [10870.179835] [<f8e70f38>] rpc_do_run_task+0x68/0xc0 [sunrpc]
Apr 5 22:19:46 lena kernel: [10870.179888] [<f8e7100d>] rpc_call_sync+0x1d/0x40 [sunrpc]
Apr 5 22:19:46 lena kernel: [10870.179935] [<f8f03c65>] _nfs4_proc_getattr+0x55/0x60 [nfs]
Apr 5 22:19:46 lena kernel: [10870.179994] [<f8f0515b>] nfs4_proc_getattr+0x2b/0x50 [nfs]
Apr 5 22:19:46 lena kernel: [10870.180049] [<f8ef47c2>] __nfs_revalidate_inode+0x192/0x2d0 [nfs]
Apr 5 22:19:46 lena kernel: [10870.180111] [enqueue_entity+0x5b/0x60] enqueue_entity+0x5b/0x60
Apr 5 22:19:46 lena kernel: [10870.180126] [enqueue_task_fair+0x27/0x30] enqueue_task_fair+0x27/0x30
Apr 5 22:19:46 lena kernel: [10870.180135] [enqueue_task+0x12/0x30] enqueue_task+0x12/0x30
Apr 5 22:19:46 lena kernel: [10870.180148] [nfs:__next_cpu+0x12/0x60] __next_cpu+0x12/0x20
Apr 5 22:19:46 lena kernel: [10870.180156] [find_busiest_group+0x1c4/0x780] find_busiest_group+0x1c4/0x780
Apr 5 22:19:46 lena kernel: [10870.180165] [balance_tasks+0x6a/0x100] balance_tasks+0x6a/0x100
Apr 5 22:19:46 lena kernel: [10870.180199] [<f8ef5653>] nfs_revalidate_mapping+0x43/0xb0 [nfs]
Apr 5 22:19:46 lena kernel: [10870.180247] [<f8ef31b5>] nfs_file_read+0x85/0x120 [nfs]
Apr 5 22:19:46 lena kernel: [10870.180303] [ext3:do_sync_read+0xd5/0xba0] do_sync_read+0xd5/0x120
Apr 5 22:19:46 lena kernel: [10870.180313] [nfs:schedule+0x20a/0x650] schedule+0x20a/0x600
Apr 5 22:19:46 lena kernel: [10870.180341] [<c0141b70>] autoremove_wake_function+0x0/0x40
Apr 5 22:19:46 lena kernel: [10870.180355] [do_nanosleep+0x59/0x70] do_nanosleep+0x59/0x70
Apr 5 22:19:46 lena kernel: [10870.180367] [hrtimer_nanosleep+0x5c/0xd0] hrtimer_nanosleep+0x5c/0xd0
Apr 5 22:19:46 lena kernel: [10870.180384] [hrtimer_wakeup+0x0/0x20] hrtimer_wakeup+0x0/0x20
Apr 5 22:19:46 lena kernel: [10870.180397] [<c018df30>] do_sync_read+0x0/0x120
Apr 5 22:19:46 lena kernel: [10870.180405] [vfs_read+0xb7/0x170] vfs_read+0xb7/0x170
Apr 5 22:19:46 lena kernel: [10870.180419] [sys_read+0x41/0x70] sys_read+0x41/0x70
Apr 5 22:19:46 lena kernel: [10870.180433] [sysenter_past_esp+0x6b/0xa9] sysenter_past_esp+0x6b/0xa9
Apr 5 22:19:46 lena kernel: [10870.180463] =======================
Apr 5 22:19:46 lena kernel: [10870.180465] Code: ac 00 00 00 c7 44 24 08 04 4a e8 f8 c7 04 24 30 81 e8 f8 89 44 24 04 e8 2f e9 2a c7 c7 44 24 20 a0 ff ff ff e9 0c fe ff ff 66 90 <0f> 0b eb fe 0f b7 b0 ac 00 00 00 8b 57 24 8b 4f 20 8b 5f 2c 8b
Apr 5 22:19:46 lena kernel: [10870.180521] EIP: [<f8e7f540>] rpcb_getport_async+0x290/0x430 [sunrpc] SS:ESP 0068:c298fcc8
Apr 5 22:19:46 lena kernel: [10870.180571] ---[ end trace f4581af82034a623 ]---

and the video stream stops. This bug is more-or-less reliably reproducable on my computer.

Revision history for this message
Cassus (cassus) wrote :

The error didn't occurre on 2.6.24-12-generic, but did occurred on 2.6.24-14-generic. So I think this is a regression introduced by -13 or -14.

Revision history for this message
Cassus (cassus) wrote :
Download full text (7.5 KiB)

Also found it in 2.6.24-16-generic #1 SMP Thu Apr 10 13:23:42 UTC 2008 i686 GNU/Linux

Apr 14 20:37:40 lena kernel: [ 461.053638] ------------[ cut here ]------------
Apr 14 20:37:40 lena kernel: [ 461.053644] kernel BUG at /build/buildd/linux-2.6.24/net/sunrpc/rpcb_clnt.c:322!
Apr 14 20:37:40 lena kernel: [ 461.053647] invalid opcode: 0000 [#12] SMP
Apr 14 20:37:40 lena kernel: [ 461.053650] Modules linked in: des_generic cbc tg3 iwl3945 iwlwifi_mac80211 cfg80211 af_packet binfmt_misc rfcomm l2cap bluetooth ppdev autofs4 ipv6 acpi_cpufreq cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand freq_table cpufreq_conservative sbs sbshc bay dock rpcsec_gss_krb5 auth_rpcgss nfs lockd nfs_acl sunrpc iptable_filter ip_tables x_tables ipw2200 ieee80211 ieee80211_crypt sbp2 lp snd_hda_intel snd_hwdep snd_pcm_oss arc4 snd_pcm ecb snd_page_alloc snd_mixer_oss blkcipher snd_seq_dummy snd_seq_oss pcmcia snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device yenta_socket video irda nvidia(P) snd psmouse rsrc_nonstatic tifm_7xx1 output container iTCO_wdt crc_ccitt i2c_core serio_raw pcmcia_core tifm_core acer_acpi led_class button intel_agp agpgart battery ac iTCO_vendor_support parport_pc parport pcspkr evdev shpchp pci_hotplug wmi_acer soundcore ext3 jbd mbcache sg sr_mod cdrom sd_mod ata_piix pata_acpi ata_generic ohci1394 ehci_hc
Apr 14 20:37:40 lena kernel: uhci_hcd ieee1394 libata scsi_mod usbcore thermal processor fan fbcon tileblit font bitblit softcursor fuse
Apr 14 20:37:40 lena kernel: [ 461.053739]
Apr 14 20:37:40 lena kernel: [ 461.053742] Pid: 25325, comm: mount.nfs4 Tainted: P D (2.6.24-16-generic #1)
Apr 14 20:37:40 lena kernel: [ 461.053746] EIP: 0060:[<f8e6f540>] EFLAGS: 00010206 CPU: 1
Apr 14 20:37:40 lena kernel: [ 461.053773] EIP is at rpcb_getport_async+0x290/0x430 [sunrpc]
Apr 14 20:37:40 lena kernel: [ 461.053776] EAX: f1b9ba80 EBX: f1b9ba80 ECX: 00000000 EDX: f8e88b40
Apr 14 20:37:40 lena kernel: [ 461.053778] ESI: f755a400 EDI: f723b400 EBP: da2b2400 ESP: da33798c
Apr 14 20:37:40 lena kernel: [ 461.053780] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Apr 14 20:37:40 lena kernel: [ 461.053783] Process mount.nfs4 (pid: 25325, ti=da336000 task=deed6b40 task.ti=da336000)
Apr 14 20:37:40 lena kernel: [ 461.053786] Stack: 00004000 00000e78 f1b9ba80 00000e78 f1b9ba80 0000018a f8e66fe9 f1b9ba80
Apr 14 20:37:40 lena kernel: [ 461.053793] da3379c4 da152f60 00000000 f1b9ba80 da2b2400 f1b9ba80 da2b2400 da337a44
Apr 14 20:37:40 lena kernel: [ 461.053799] f1b9bb00 f8e60195 f8e60266 da2b2400 00000000 f1b9ba80 00000000 f1b9ba80
Apr 14 20:37:40 lena kernel: [ 461.053805] Call Trace:
Apr 14 20:37:40 lena kernel: [ 461.053828] [<f8e66fe9>] rpc_malloc+0x69/0x90 [sunrpc]
Apr 14 20:37:40 lena kernel: [ 461.053871] [<f8e60195>] call_bind+0x45/0x80 [sunrpc]
Apr 14 20:37:40 lena kernel: [ 461.053891] [<f8e60266>] call_reserve+0x16/0x70 [sunrpc]
Apr 14 20:37:40 lena kernel: [ 461.053925] [<f8e67542>] __rpc_execute+0x62/0x280 [sunrpc]
Apr 14 20:37:40 lena kernel: [ 461.053958] [<f8e66bf4>] rpc_set_active+0x44/0x70 [sunrpc]
Apr 14 20:37:40 lena kernel: [ 461.053990] [<f8...

Read more...

Revision history for this message
Josef Boleininger (iosif69) wrote :

kernel 2.6.24-16, nfs4 client (home directory):

Apr 25 13:41:04 install02 kernel: [ 2132.820187] kernel BUG at /build/buildd/linux-2.6.24/net/sunrpc/rpcb_clnt.c:322!
Apr 25 13:41:04 install02 kernel: [ 2132.820190] invalid opcode: 0000 [#1] SMP

This bug is more-or-less reliably reproducable on my computer.

Josef

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

I've bumped into the same bug, the trace is a bit different but the cause could be the same:

Call Trace:
 [sunrpc:queue_delayed_work+0x51/0x70] queue_delayed_work+0x51/0x70
 [lock_timer_base+0x27/0x60] lock_timer_base+0x27/0x60
 [<f8e65195>] call_bind+0x45/0x80 [sunrpc]
 [shpchp:del_timer_sync+0xe/0x20] del_timer_sync+0xe/0x20
 [<f8e6c07c>] rpc_delete_timer+0x2c/0x50 [sunrpc]
 [<f8e6c542>] __rpc_execute+0x62/0x280 [sunrpc]
 [10295.051592] [<f8e660c2>] rpc_release_client+0x32/0x60 [sunrpc]
 [<f8e6c760>] rpc_async_schedule+0x0/0x10 [sunrpc]
 [run_workqueue+0xbf/0x160] run_workqueue+0xbf/0x160
 [worker_thread+0x0/0xe0] worker_thread+0x0/0xe0
 [worker_thread+0x84/0xe0] worker_thread+0x84/0xe0
 [<c0140b70>] autoremove_wake_function+0x0/0x40
 [worker_thread+0x0/0xe0] worker_thread+0x0/0xe0
 [kthread+0x42/0x70] kthread+0x42/0x70
 [kthread+0x0/0x70] kthread+0x0/0x70
 [kernel_thread_helper+0x7/0x10] kernel_thread_helper+0x7/0x10

 =======================
Code: ac 00 00 00 c7 44 24 08 04 9a e7 f8 c7 04 24 30 d1 e7 f8 89 44 24 04 e8 2f 89 2b c7 c7 44 24 20 a0 ff ff ff e9 0c fe ff ff 66 90 <0f> 0b eb fe 0f b7 b0 ac 00 00 00 8b 57 24 8b 4f 20 8b 5f 2c 8b
 EIP: [<f8e74540>] rpcb_getport_async+0x290/0x430 [sunrpc] SS:ESP 0068:dfb99f00
 ---[ end trace d8ca4a218e8bc80c ]---

Revision history for this message
AZ (m-dev) wrote :

it occured for me too.

Revision history for this message
jepler (jepler) wrote :

Is this bug the same as #224750 ?

Revision history for this message
Daniel J Blueman (danielblueman) wrote :

I experience this every ~5th shutdown, however the stack trace I see is:

rpcb_getport_async
rpc_malloc
call_bind
call_reserve
__rpc_execute
...

which corresponds to rpcb_clnt.c:322:
        BUG_ON(clnt->cl_parent != clnt);

This is called from a kernel thread context and I only see the bug_on trigger when shutting down, where NetworkManager cutts the ethernet connection before NFS is unmounted (yikes).

Revision history for this message
Daniel J Blueman (danielblueman) wrote :

There are only two sites where cl_parent is assigned in the kernel tree, and it's assigned to 'clnt', so the only time this BUG_ON will fire is when there is memory corruption or use-after-free (less likely).

Revision history for this message
Daniel J Blueman (danielblueman) wrote :

I have found this reproduces when you have an active NFSv4 mount (eg /home) and bring the ethernet interface on the client down, eg unplug the cable.

Revision history for this message
Cassus (cassus) wrote :

Interesting. I'll try to set up a test machine, so i can gather more empiric information.

What is your oppinion, what should we do next, to get this bug fixed?

Revision history for this message
Daniel J Blueman (danielblueman) wrote :

I have found this BUG_ON doesn't occur with upstream 2.6.25, so finding a point in the past (equivalent to the gutsy kernel, perhaps 2.6.24) and git-bisecting is the best chance.

I'll try to reproduce it with 2.6.24 and kick this off.

Revision history for this message
Daniel J Blueman (danielblueman) wrote :
Revision history for this message
Daniel J Blueman (danielblueman) wrote :
Download full text (3.8 KiB)

Apologies - I am mistaken! This is not fixed in LP#224750.

Unplugging and replugging the ethernet cable with my homedir mounted over NFSv4 reproduced this in 8.04.1 LTS with 2.6.24-19-generic and all updates as of late July:

[ 708.319893] ------------[ cut here ]------------
[ 708.319962] kernel BUG at /build/buildd/linux-2.6.24/net/sunrpc/rpcb_clnt.c:322!
[ 708.320034] invalid opcode: 0000 [#1] SMP
[ 708.320207] Modules linked in: sonypi ppdev ipv6 acpi_cpufreq cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand freq_table cpufreq_conservat
[ 708.325740]
[ 708.325800] Pid: 6536, comm: gnome-panel Tainted: P (2.6.24-19-generic #1)
[ 708.325903] EIP: 0060:[<f8ce5491>] EFLAGS: 00010287 CPU: 1
[ 708.326019] EIP is at rpcb_getport_async+0x251/0x3d0 [sunrpc]
[ 708.326102] EAX: e8589f00 EBX: e8589f00 ECX: 00000286 EDX: f8cfea20
[ 708.326187] ESI: dfbe1f00 EDI: f3ffd100 EBP: f3efb800 ESP: f2e09c2c
[ 708.326271] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 708.326355] Process gnome-panel (pid: 6536, ti=f2e08000 task=f582c5c0 task.ti=f2e08000)
[ 708.326441] Stack: f2e09c60 00000000 f2e09c60 00000000 f2e09c68 c031b309 f8cdd0a0 e8589f00
[ 708.327031] c01361d7 e8589f48 ffffffff f2e09d70 e8589f00 f3efb800 f2e09d70 e8589f80
[ 708.327620] f8cd6195 c013626e e8589f00 f8cdd07c 00000000 00000000 e8589f00 f8cdd542
[ 708.328222] Call Trace:
[ 708.328384] [<c031b309>] __wait_on_bit+0x59/0x70
[ 708.328528] [<f8cdd0a0>] rpc_wait_bit_interruptible+0x0/0x20 [sunrpc]
[ 708.328705] [<c01361d7>] lock_timer_base+0x27/0x60
[ 708.328851] [<f8cd6195>] call_bind+0x45/0x80 [sunrpc]
[ 708.329011] [<c013626e>] del_timer_sync+0xe/0x20
[ 708.329160] [<f8cdd07c>] rpc_delete_timer+0x2c/0x50 [sunrpc]
[ 708.329337] [<f8cdd542>] __rpc_execute+0x62/0x280 [sunrpc]
[ 708.329509] [<c0137665>] sigprocmask+0x65/0x110
[ 708.329649] [<f8cdcbf4>] rpc_set_active+0x44/0x70 [sunrpc]
[ 708.329832] [<f8cd6f38>] rpc_do_run_task+0x68/0xc0 [sunrpc]
[ 708.330010] [<f8cd700d>] rpc_call_sync+0x1d/0x40 [sunrpc]
[ 708.330185] [<f8d6b633>] nfs4_proc_access+0x73/0x1b0 [nfs]
[ 708.330351] [<c0120000>] __is_prefetch+0x20/0x240
[ 708.330506] [<c0125f20>] default_wake_function+0x0/0x10
[ 708.330652] [<c0125f20>] default_wake_function+0x0/0x10
[ 708.330796] [<c0125f20>] default_wake_function+0x0/0x10
[ 708.330942] [<c0125f20>] default_wake_function+0x0/0x10
[ 708.331101] [<f8d5655d>] nfs_do_access+0xbd/0x2e0 [nfs]
[ 708.331276] [<f8cde5b8>] rpcauth_lookupcred+0x58/0xa0 [sunrpc]
[ 708.331459] [<f8d56834>] nfs_permission+0xb4/0x160 [nfs]
[ 708.331618] [<c01a3f95>] dput+0x85/0x100
[ 708.331771] [<c019916b>] __follow_mount+0x1b/0x70
[ 708.331913] [<f8d56780>] nfs_permission+0x0/0x160 [nfs]
[ 708.332083] [<c0198fe7>] permission+0x97/0x120
[ 708.332227] [<c019acb9>] __link_path_walk+0x79/0xe10
[ 708.332373] [<c0125f20>] default_wake_function+0x0/0x10
[ 708.332522] [<c019ba95>] link_path_walk+0x45/0xc0
[ 708.332670] [<c0146034>] do_gettimeofday+0x34/0xe0
[ 708.332813] [<c021adc0>] copy_to_user+0x30/0x60
[ 708.332956] [<c019bce7>] do_path_lookup+0x77/0x200
[ 708.333095] [<c019aa3a>] getname+0xaa/0...

Read more...

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi Daniel,

Thanks for all of your updates and feedback. You mentioned you do not witness this bug with a 2.6.25 kernel. I just wanted to let you know that the Intrepid Ibex 8.10 Alpah3 release most recently came out and it is based on a 2.6.26 kernel - http://www.ubuntu.com/testing . You may want to give that a test just to verify we will not see this in the upcoming release. You also mentioned trying to perform a git bisect to narrow down the offending patch which caused this regression. Were you ever able to find which patch is the culprit? Thanks.

Revision history for this message
noopy.org (noopy-org) wrote :

I'm doing a git (as we speak) between 2.6.24-12.18 and 2.6.24-14-25 on my Ubuntu Hardy system. I'm assuming for the moment that 2.6.24-12.18 is a known "good" version of rpcb_clnt.c. In addition, I'll be installing a 2.6.26 kernel + patches from source on my system and retesting.

In the meantime are we certain that there is not another issue in play? For instance, in my case, I'm running on a Sun x4100 so this means I have a nForce NIC and use the forcedeth driver. Periodically my NIC will go vegetative and instance is denoted with messages like:

  Jul 31 12:53:16 prod-unix-shell04 kernel: [ 596.950188] nv_stop_tx: TransmitterStatus remained busy<6>eth0: link up.

When this happens my system will fall of the network until it is power cycled -- as using ifup, ifconfig, or even a reboot doesn't set the NIC straight. Sometimes this happens when I'm using NFSv4 and sometimes this happens when I'm not. It always happens when I'm doing lots of I/O (apt, bonnie++, SQL queries, dd, etc.).

Given what's been described here earlier (RE: pulling network cable and watching nfsv4 go haywire) I'm curious if the problem is not limited to changes to rpcb_clnt.c.

Does anyone have thoughts on this?

Revision history for this message
noopy.org (noopy-org) wrote :

I'm backtracking w/my last statement a bit. Looks like the 2.6.24-11 kernel doesn't exhibit these problems as well and on top of that I've had zero problems (knock on wood) with the forcedeth driver and this kernel. Yesterday I was able to reliably generate the kernel "Oops" noted above AND crash my NIC by doing 2-3 large dd's plus a bonnie++ in less than 15 minutes. I've been running the same regression against -11 for almost 2 hours without any problems.

I'm still working on git for -12 through -14 but just a FYI about -11.

Revision history for this message
noopy.org (noopy-org) wrote :

It appears (see attached git bisect log) that this version works: f180b030afa2df41482b990ac502c64d51a7492c .

My (starting) good revision is f180b030afa2df41482b990ac502c64d51a7492c. My bad revision is marked as 2ce4b2c2c29f32060580bf37654269a0c15e7845 although I haven't gotten there yet in my testing. Will let you know once complete.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi noopy.org,

Thanks for all of you testing and feedback. Were you ever able to finish with your testing? Thanks.

Revision history for this message
noopy.org (noopy-org) wrote :

Leann,

Looks like -12 has the issue as well but happens completely randomly as well as when I load the system with high load.

Went as far back as -11 and as far forward as -19 and those both have the same issue.

Revision history for this message
Daniel J Blueman (danielblueman) wrote :

void rpcb_getport_async(struct rpc_task *task)
{
        struct rpc_clnt *clnt = task->tk_client;
        int bind_version;
        struct rpc_xprt *xprt = task->tk_xprt;
        struct rpc_clnt *rpcb_clnt;
        static struct rpcbind_args *map;
        struct rpc_task *child;
        struct sockaddr addr;
        int status;
        struct rpcb_info *info;

        dprintk("RPC: %5u %s(%s, %u, %u, %d)\n",
                task->tk_pid, __FUNCTION__,
                clnt->cl_server, clnt->cl_prog, clnt->cl_vers, xprt->prot);

        /* Autobind on cloned rpc clients is discouraged */
        BUG_ON(clnt->cl_parent != clnt); <------

This is triggering I believe due to memory corruption. Confirming with NFS developers any patches which address memory corruption related to this, eg:

http://groups.google.co.uk/group/linux.kernel/browse_thread/thread/599db16a8d9d84ed#

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Revision history for this message
Florian Hackenberger (f-hackenberger) wrote :

I will test the upcoming alpha5, but please bear in mind that this bug should be fixed in Hardy, as it is an LTS release and this bug affects basic functionality.

Revision history for this message
Daniel J Blueman (danielblueman) wrote :

This bug is not an issue in 2.6.26-based kernels (ie with patches that were added in 2.6.25-rc kernels; I have been running 2.6.26 or later and can confirm this issue has not occurred from testing; we need to identify the patch(es) and release an SRU ultimately for Hardy.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi Daniel,

I agree this is SRU worthy if patches can be isolated. I'll go ahead and open the nomination for Hardy. From your last comment it appears this should not be an issue for Intrepid so I'll tentatively mark this "Fix Released" for Intrepid. Thanks.

Changed in linux:
status: Confirmed → Fix Released
Revision history for this message
hpdoerr (doerr-kis-deactivatedaccount) wrote :

I can reproduce this bug with the 2.6.24-21-server and 2.6.24-19-server on hardy/x68_64.

Did you progress with this problem? I am currently trying to setup a Sun Fire X4600 with Hardy/Server and this is actually the latest and most important of three show-stopper Kernel-related bugs I've encountered so far.
All of them are known for some month now, and all of them seem to be related to 2.6.24 only.
Honestly, this release (hardy) looks like beeing far away from the proffessionality you guys claim to provide. I mean this is a LTS release, isn't it?

Revision history for this message
Martin (lodp) wrote :
Download full text (6.0 KiB)

I've been getting something similar when changing network connections while having nfs stuff mounted:

[27025.825196] ------------[ cut here ]------------
[27025.825206] kernel BUG at /build/buildd/linux-2.6.27/net/sunrpc/rpcb_clnt.c:498!
[27025.825212] invalid opcode: 0000 [#10] SMP
[27025.825218] Modules linked in: aes_i586 crypto_blkcipher aes_generic ieee80211_crypt_ccmp ipv6 af_packet binfmt_misc rfcomm bridge stp bnep sco l2cap bluetooth ppdev acpi_cpufreq cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_ondemand cpufreq_conservative freq_table container sbs pci_slot sbshc nfs lockd nfs_acl sunrpc iptable_filter ip_tables x_tables ext3 jbd mbcache eeprom i2c_i801 i2c_core parport_pc lp parport joydev pcmcia led_class wmi psmouse serio_raw pcspkr evdev snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy video ipw2200 output ieee80211 ieee80211_crypt yenta_socket rsrc_nonstatic pcmcia_core snd_seq_oss fglrx(P) snd_seq_midi snd_rawmidi battery snd_seq_midi_event snd_seq ac button snd_timer snd_seq_device iTCO_wdt iTCO_vendor_support snd soundcore snd_page_alloc shpchp pci_hotplug intel_agp agpgart reiserfs usbhid hid sr_mod cdrom sd_mod crc_t10dif sg ata_piix ata_generic 8139cp pata_acpi libata 8139too mii scsi_mod dock ehci_hcd uhci_hcd usbcore thermal processor fan fbcon tileblit font bitblit softcursor fuse
[27025.825378]
[27025.825384] Pid: 18298, comm: mount.nfs Tainted: P D (2.6.27-9-generic #1)
[27025.825390] EIP: 0060:[<f909a5c8>] EFLAGS: 00010287 CPU: 0
[27025.825428] EIP is at rpcb_getport_async+0x478/0x4e0 [sunrpc]
[27025.825433] EAX: c2548840 EBX: eaf0ea00 ECX: eaf0e200 EDX: f90b4920
[27025.825438] ESI: d50f8000 EDI: 00000000 EBP: c26e9ca8 ESP: c26e9bbc
[27025.825443] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[27025.825449] Process mount.nfs (pid: 18298, ti=c26e8000 task=dd503240 task.ti=c26e8000)
[27025.825454] Stack: 00000000 00000046 714c9ccf 00001894 f750f110 c26e9bf0 c012a571 c2548840
[27025.825468] eaf0e200 c190cd00 dd503240 f750f110 c26e9c14 c0102df6 dd503...

Read more...

Revision history for this message
Martin (lodp) wrote :

OH, and I'm running Intrepid with kernel 2.6.27-9-generic

Revision history for this message
Daniel J Blueman (danielblueman) wrote :

I've sent a quick mail to a couple of core NFS maintainers asking for how we can help progress tracking this down. I believed it was fixed in 2.6.25, but your bug report does show it affecting 2.6.27, which is unexpected and why I've escalated it further. Hope it helps!

Revision history for this message
Daniel J Blueman (danielblueman) wrote :

This is apparently fixed in 2.6.28 by Trond Myklebust's patch at:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=9a4bd29fe8f6d3f015fe1c8e5450eb62cfebfcc9

Martin - is it possible you can try with this?

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Thanks for the note Daniel. I just wanted to add that this patch is also in the upcoming Jaunty kernel.

ogasawara@emiko:~$ apt-cache policy linux-image-2.6.28-3-generic
linux-image-2.6.28-3-generic:
  Installed: 2.6.28-3.4
  Candidate: 2.6.28-3.4
  Version table:
 *** 2.6.28-3.4 0
        500 http://archive.ubuntu.com jaunty/main Packages
        100 /var/lib/dpkg/status

ogasawara@yoji:~/ubuntu-jaunty$ git log 9a4bd29fe8f6d3f015fe1c8e5450eb62cfebfcc9
commit 9a4bd29fe8f6d3f015fe1c8e5450eb62cfebfcc9
Author: Trond Myklebust <email address hidden>
Date: Fri Oct 3 16:48:34 2008 -0400

    SUNRPC: Fix autobind on cloned rpc clients

    Despite the fact that cloned rpc clients won't have the cl_autobind flag
    set, they may still find themselves calling rpcb_getport_async(). For this
    to happen, it suffices for a _parent_ rpc_clnt to use autobinding, in which
    case any clone may find itself triggering the !xprt_bound() case in
    call_bind().

    The correct fix for this is to walk back up the tree of cloned rpc clients,
    in order to find the parent that 'owns' the transport, either because it
    has clnt->cl_autobind set, or because it originally created the
    transport...

    Signed-off-by: Trond Myklebust <email address hidden>

Revision history for this message
Daniel J Blueman (danielblueman) wrote :
Download full text (3.8 KiB)

I've just been shot-down by this bug again! Kernel is '2.6.24-23-server' (ubuntu-server 8.04.2 hardy LTS x86-64).

Guys, we badly need to bury this hatchet - this appears to be the only bug preventing LTS release being a solid fileserver.

The Linux NFS maintainer squarely cited this patch to fix this failure:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=9a4bd29fe8f6d3f015fe1c8e5450eb62cfebfcc9

------------[ cut here ]------------
kernel BUG at /build/buildd/linux-2.6.24/net/sunrpc/rpcb_clnt.c:322!
invalid opcode: 0000 [1] SMP
CPU 1
Modules linked in: eip ep elan4 elan qsnet myri10ge dca inet_lro tcp_cubic af_packet nfsd auth_rpcgss exportfs acpi_cpufreq cpufreq_powersave cpufreq_ondemand cpufreq_stats freq_table cpufreq_conservative cpufreq_userspace video output ac battery sbs sbshc dock container microcode msr iptable_filter ip_tables x_tables xfs nfs lockd nfs_acl sunrpc parport_pc lp parport loop iTCO_wdt iTCO_vendor_support serio_raw i5000_edac edac_core psmouse button pcspkr shpchp evdev pci_hotplug ext3 jbd mbcache sg sd_mod ahci ehci_hcd libata uhci_hcd scsi_mod e1000 usbcore raid10 raid456 async_xor async_memcpy async_tx xor raid1 raid0 multipath linear md_mod thermal processor fan fbcon tileblit font bitblit softcursor fuse
Pid: 26534, comm: atftpd Not tainted 2.6.24-23-server #1
RIP: 0010:[<ffffffff882968b2>] [<ffffffff882968b2>] :sunrpc:rpcb_getport_async+0x272/0x3c0
RSP: 0018:ffff8100058837c8 EFLAGS: 00010206
RAX: ffffffff882b45c0 RBX: ffff810154c4fe00 RCX: ffffffff80288483
RDX: ffffffff882b45c0 RSI: 0000000000011200 RDI: ffff810151574300
RBP: ffff810155090000 R08: 0000000000000000 R09: 000000000000006b
R10: 0000000000000001 R11: ffffffff88296640 R12: ffff810159d13600
R13: ffff810151574300 R14: ffff810005883898 R15: ffff8100058839c8
FS: 000000004310e950(0063) GS:ffff81015ac01800(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fc95fb60000 CR3: 0000000151554000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process atftpd (pid: 26534, threadinfo ffff810005882000, task ffff810158806000)
Stack: ffff810151574300 ffffffff8831e198 ffff8101574c4a80 ffffffff882878c0
 0000000000000000 ffff810151574300 ffffffff8829c390 ffff8101515743f0
 ffff810005883898 ffffffff8828db3b ffffffff8829c390 ffff810151574300
Call Trace:
 [<ffffffff882878c0>] :sunrpc:call_allocate+0xc0/0x1b0
 [<ffffffff8828db3b>] :sunrpc:__rpc_execute+0x6b/0x290
 [<ffffffff88286f86>] :sunrpc:rpc_do_run_task+0x76/0xd0
 [<ffffffff88287045>] :sunrpc:rpc_call_sync+0x15/0x40
 [<ffffffff882f1fff>] :nfs:nfs4_proc_access+0x8f/0x1e0
 [<ffffffff802934ad>] zone_statistics+0x7d/0x80
 [<ffffffff8028bdfb>] get_page_from_freelist+0x47b/0x6c0
 [<ffffffff882db84a>] :nfs:nfs_do_access+0xda/0x350
 [<ffffffff882dbb98>] :nfs:nfs_permission+0xd8/0x1a0
 [<ffffffff802bd430>] permission+0xb0/0x160
 [<ffffffff802bf767>] __link_path_walk+0x87/0xe90
 [<ffffffff802c05cb>] link_path_walk+0x5b/0x100
 [<ffffffff802b3337>] get_unused_fd_flags+0x77/0x120
 [<ffffffff802cbcb8>] touch_atime+0x88/0x160
 [<ffffffff802bfd4a>] __lin...

Read more...

Revision history for this message
Wouter Hanegraaff (wouter-openoffice) wrote :

Hi all,

I've been hit with this bug also, and I've found a relatively reliable
way to reproduce it.
- desktop system mounts /home/ with nfs4
- login with gdm, start firefox and do some browsing, by entering some
urls in the location bar
- issue some find /home & commands
- unplug the network cable

This wil crash the kernel within a few minutes, for me, on two different systems.

Maybe not all steps are needed, logging in and using firefox certainly is necessary. It seems that the file locking that firefox uses for a lot of database files helps in reproducing the problem.

I've also recompiled the ubuntu kernel with the patch mentioned by Daniel Blueman. This was a bit harder then I imagined, I got some rejects and had to fix that manually. However, with the patched kernel I was unable to reproduce the kernel crash.

Finally, I've installed the Jaunty kernel, and haven't been able to reproduce the bug with it. The jaunty kernel also has an updated rpcb_clnt.c file.

Please, fix this bug!

Stefan Bader (smb)
Changed in linux:
assignee: nobody → stefan-bader-canonical
importance: Undecided → High
status: New → In Progress
Revision history for this message
Stefan Bader (smb) wrote :

i placed Hardy kernel that include the proposed bug-fix on http://people.ubuntu.com/~smb/bug212485/
Could you please verify whether this works as expected and shows no apparent new problems with NFS? Thanks.

Revision history for this message
Wouter Hanegraaff (wouter-openoffice) wrote :

Hi,

I'm using this kernel on my workstation now for 2 days, and I haven't seen the crash since. Disconnecting the network cable doesn't seem to crash my system, I get "nfs server timed out, still trying" messages and after plugging the cable back in, the system comes back online after some time. So, it seems to work fine for me.

Below are the last lines from dmesg as i plugged the cable back in. Confusingly, my nfs server is called "server", but the log message displays it as "serve", just like it did with previous kernels. It doesn't seem to cause problems however, and later kernels have a fix for this.
[113053.138733] nfs: server serve not responding, still trying
[113078.132905] nfs: server serve not responding, still trying
[113078.132910] nfs: server serve not responding, timed out
[113082.383909] nfs: server serve not responding, still trying
[113109.125922] nfs: server serve OK
[113109.125976] nfs: server serve OK
[113109.126003] nfs: server serve OK
[113109.126050] nfs: server serve OK
[113109.126072] nfs: server serve OK
[113109.126507] nfs: server serve OK

Stefan Bader (smb)
Changed in linux:
status: In Progress → Fix Committed
description: updated
Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Just setting this to Fix Released as 2.6.24-23.53 is now available as an update and should contain this patch. Thanks.

Changed in linux (Ubuntu Hardy):
status: Fix Committed → Fix Released
Revision history for this message
Stefan Bader (smb) wrote :

Setting back to committed. This seems to be in the portion of changes I am about to upload.

Changed in linux (Ubuntu Hardy):
status: Fix Released → Fix Committed
Revision history for this message
Martin Pitt (pitti) wrote :

Accepted linux into hardy-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

tags: added: verification-needed
Revision history for this message
Neil Hoggarth (neil-hoggarth) wrote :

For the record, I have been running linux-image-2.6.24-24-generic version 2.6.24-24.54 from
hardy-proposed on my primary x86-64 workstation for the last two weeks. In that time I haven't
experienced any of the NFSv4 hang-ups that I was previously observing, and no instances of the
kernel BUG backtrace appear in the logs.

Martin Pitt (pitti)
tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 2.6.24-24.56

---------------
linux (2.6.24-24.56) hardy-proposed; urgency=low

  [Stefan Bader]

  * Rebuild of 2.6.24-24.54 with 2.6.24-24.55 security release applied

linux (2.6.24-24.54) hardy-proposed; urgency=low

  [Andy Whitcroft]

  * SAUCE: do not make sysdev links for processors which are not booted
    - LP: #295091

  [Brad Figg]

  * SAUCE: Add information to recognize Toshiba Satellite Pro M10 Alps Touchpad
    - LP: #330885
  * SAUCE: Add signatures to airprime driver to support newer Novatel devices
    - LP: #365291

  [Stefan Bader]

  * SAUCE: vgacon: Return the upper half of 512 character fonts
    - LP: #355057

  [Upstream Kernel Changes]

  * SUNRPC: Fix autobind on cloned rpc clients
    - LP: #341783, #212485
  * Input: atkbd - mark keyboard as disabled when suspending/unloading
    - LP: #213988
  * x86: mtrr: don't modify RdDram/WrDram bits of fixed MTRRs
    - LP: #292619
  * sis190: add identifier for Atheros AR8021 PHY
    - LP: #247889
  * bluetooth hid: enable quirk handling for Apple Wireless Keyboards in
    2.6.24
    - LP: #227501
  * nfsd: move callback rpc_client creation into separate thread
    - LP: #253004
  * nfsd4: probe callback channel only once
    - LP: #253004

 -- Stefan Bader <email address hidden> Sat, 20 Jun 2009 00:14:36 +0200

Changed in linux (Ubuntu Hardy):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.