invalid opcode xdr_buf_read_netobj on nfs4+krb5i directory
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Disco |
Won't Fix
|
Medium
|
Po-Hsu Lin |
Bug Description
== SRU Justification ==
The xdr_shrink_
handling of GSS MIC without slack), which applied in the Disco tree via
stable update process, sometimes will raise the following kernel trace
when the bytes to remove from buf->pages is larger than buf->page_len:
[ 49.420081] ------------[ cut here ]------------
[ 49.420084] kernel BUG at /build/
[ 49.420092] invalid opcode: 0000 [#1] SMP NOPTI
[ 49.420095] CPU: 16 PID: 469 Comm: kworker/u64:13 Tainted: P OE 5.0.0-37-generic #40~18.04.1-Ubuntu
[ 49.420096] Hardware name: System manufacturer System Product Name/ROG CROSSHAIR VII HERO (WI-FI), BIOS 3004 12/16/2019
[ 49.420109] Workqueue: rpciod rpc_async_schedule [sunrpc]
[ 49.420123] RIP: 0010:xdr_
[ 49.420124] Code: 29 ea e8 85 f4 ff ff 44 8b 63 34 8b 43 3c 45 29 ec 44 29 e8 3b 43 40 44 89 63 34 89 43 3c 73 03 89 43 40 5b 41 5c 41 5d 5d c3 <0f> 0b 0f 1f 44 00 00 4c 8d 54 24 08 48 83 e4 f0 b9 04 00 00 00 41
[ 49.420126] RSP: 0018:ffffb93787
[ 49.420128] RAX: 000000000000000c RBX: 000000000000006c RCX: 000000000000001c
[ 49.420129] RDX: 000000000000005c RSI: 0000000000000010 RDI: ffff8e1a87c56e50
[ 49.420130] RBP: ffffb93787be7b50 R08: ffff8e1b06999700 R09: 0000000000000000
[ 49.420131] R10: 00000000ffffffff R11: ffff8e1b0ecd1cd0 R12: ffff8e1a87c56e50
[ 49.420132] R13: ffffb93787be7c00 R14: 0000000000000058 R15: ffffffffc228e8c0
[ 49.420134] FS: 000000000000000
[ 49.420135] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 49.420136] CR2: 00007ffa1faeb000 CR3: 0000000f19abe000 CR4: 0000000000340ee0
[ 49.420137] Call Trace:
[ 49.420150] xdr_buf_
[ 49.420154] ? kzfree+0x2d/0x40
[ 49.420158] ? crypto_
[ 49.420162] gss_unwrap_
[ 49.420164] ? gss_unwrap_
[ 49.420167] gss_unwrap_
[ 49.420170] ? gss_unwrap_
[ 49.420172] ? gss_validate+
[ 49.420184] ? nfs4_xdr_
[ 49.420194] rpcauth_
[ 49.420204] ? nfs4_xdr_
[ 49.420213] call_decode+
[ 49.420216] ? __switch_
[ 49.420224] ? rpc_check_
[ 49.420233] __rpc_execute+
[ 49.420242] rpc_async_
[ 49.420245] process_
[ 49.420247] worker_
[ 49.420249] kthread+0x121/0x140
[ 49.420250] ? process_
[ 49.420252] ? kthread_
[ 49.420254] ret_from_
== Fixes ==
* e8d70b32 (SUNRPC: Fix another issue with MIC buffer space)
Instead of calling BUG_ON, this patch will just cap the number of bytes
that xdr_shrink_
Only Disco kernel needs this patch, for Bionic and earlier they don't
have 5f1bc39, and this fix has been applied to Eoan and onward.
== Test ==
Test kernel can be found here:
https:/
And it's been stress-tested by the bug reporter, Michael, this issue
can no longer be reproduced.
== Regression Potential ==
Low. It's just changing the length of bytes to shrink, change limited
to a single driver with positive test result.
== Original Bug Report ==
RELEASE=19.3
CODENAME=tricia
EDITION="Cinnamon"
DESCRIPTION="Linux Mint 19.3 Tricia"
DESKTOP=Gnome
TOOLKIT=GTK
NEW_FEATURES_URL=https:/
RELEASE_NOTES_URL=https:/
USER_GUIDE_URL=https:/
GRUB_TITLE=Linux Mint 19.3 Cinnamon
My home dir is mounted through nfs on a local server via nfs4 and krb5i.
When stressing the mounted directory or its sub-directories (sometimes starting firefox, sometimes starting thunderbird, nearly guaranteed when compiling, sometimes the login itself), it will eventually lead to the following stack-trace. The corresponding process is then stuck and
accessing the mounted directory (like calling ls) easily yields further and similar stack trace and causing the process to also stuck.
Currently I am running an AMD 3950x on a ASUS Crosshair VII Hero Wifi (chipset x470), but I had the same issues with an Intel 6700K on a ASUS Crosshair VIII Hero in fall of 2019. I couldn't be bother back then to report the bug so I just kept running a working kernel (~5.0.0-15 I think) without updating it. After Christmas I updated said Intel machine with the AMD machine, re-installed Linux Mint, installed all updates and therefore ran into this issue again.
[ 49.420081] ------------[ cut here ]------------
[ 49.420084] kernel BUG at /build/
[ 49.420092] invalid opcode: 0000 [#1] SMP NOPTI
[ 49.420095] CPU: 16 PID: 469 Comm: kworker/u64:13 Tainted: P OE 5.0.0-37-generic #40~18.04.1-Ubuntu
[ 49.420096] Hardware name: System manufacturer System Product Name/ROG CROSSHAIR VII HERO (WI-FI), BIOS 3004 12/16/2019
[ 49.420109] Workqueue: rpciod rpc_async_schedule [sunrpc]
[ 49.420123] RIP: 0010:xdr_
[ 49.420124] Code: 29 ea e8 85 f4 ff ff 44 8b 63 34 8b 43 3c 45 29 ec 44 29 e8 3b 43 40 44 89 63 34 89 43 3c 73 03 89 43 40 5b 41 5c 41 5d 5d c3 <0f> 0b 0f 1f 44 00 00 4c 8d 54 24 08 48 83 e4 f0 b9 04 00 00 00 41
[ 49.420126] RSP: 0018:ffffb93787
[ 49.420128] RAX: 000000000000000c RBX: 000000000000006c RCX: 000000000000001c
[ 49.420129] RDX: 000000000000005c RSI: 0000000000000010 RDI: ffff8e1a87c56e50
[ 49.420130] RBP: ffffb93787be7b50 R08: ffff8e1b06999700 R09: 0000000000000000
[ 49.420131] R10: 00000000ffffffff R11: ffff8e1b0ecd1cd0 R12: ffff8e1a87c56e50
[ 49.420132] R13: ffffb93787be7c00 R14: 0000000000000058 R15: ffffffffc228e8c0
[ 49.420134] FS: 000000000000000
[ 49.420135] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 49.420136] CR2: 00007ffa1faeb000 CR3: 0000000f19abe000 CR4: 0000000000340ee0
[ 49.420137] Call Trace:
[ 49.420150] xdr_buf_
[ 49.420154] ? kzfree+0x2d/0x40
[ 49.420158] ? crypto_
[ 49.420162] gss_unwrap_
[ 49.420164] ? gss_unwrap_
[ 49.420167] gss_unwrap_
[ 49.420170] ? gss_unwrap_
[ 49.420172] ? gss_validate+
[ 49.420184] ? nfs4_xdr_
[ 49.420194] rpcauth_
[ 49.420204] ? nfs4_xdr_
[ 49.420213] call_decode+
[ 49.420216] ? __switch_
[ 49.420224] ? rpc_check_
[ 49.420233] __rpc_execute+
[ 49.420242] rpc_async_
[ 49.420245] process_
[ 49.420247] worker_
[ 49.420249] kthread+0x121/0x140
[ 49.420250] ? process_
[ 49.420252] ? kthread_
[ 49.420254] ret_from_
[ 49.420255] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache edac_mce_amd snd_hda_codec_hdmi joydev kvm hid_roccat_koneplus hid_roccat irqbypass hid_roccat_common nvidia_uvm(OE) nvidia_drm(POE) nvidia_modeset(POE) snd_hda_
[ 49.420282] hid_generic usbhid hid igb i2c_piix4 nvme dca ahci i2c_algo_bit nvme_core libahci gpio_amdpt wmi gpio_generic
[ 49.420293] ---[ end trace 75bda976d7f1c02d ]---
[ 49.420305] RIP: 0010:xdr_
[ 49.420306] Code: 29 ea e8 85 f4 ff ff 44 8b 63 34 8b 43 3c 45 29 ec 44 29 e8 3b 43 40 44 89 63 34 89 43 3c 73 03 89 43 40 5b 41 5c 41 5d 5d c3 <0f> 0b 0f 1f 44 00 00 4c 8d 54 24 08 48 83 e4 f0 b9 04 00 00 00 41
[ 49.420307] RSP: 0018:ffffb93787
[ 49.420309] RAX: 000000000000000c RBX: 000000000000006c RCX: 000000000000001c
[ 49.420310] RDX: 000000000000005c RSI: 0000000000000010 RDI: ffff8e1a87c56e50
[ 49.420311] RBP: ffffb93787be7b50 R08: ffff8e1b06999700 R09: 0000000000000000
[ 49.420312] R10: 00000000ffffffff R11: ffff8e1b0ecd1cd0 R12: ffff8e1a87c56e50
[ 49.420312] R13: ffffb93787be7c00 R14: 0000000000000058 R15: ffffffffc228e8c0
[ 49.420314] FS: 000000000000000
[ 49.420315] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 49.420316] CR2: 00007ffa1faeb000 CR3: 0000000f19abe000 CR4: 0000000000340ee0
.
[Jan 1 03:45] ------------[ cut here ]------------
[ +0,000002] kernel BUG at /build/
[ +0,000006] invalid opcode: 0000 [#1] SMP NOPTI
[ +0,000002] CPU: 4 PID: 28219 Comm: kworker/u64:2 Tainted: P OE 5.0.0-35-generic #38~18.04.1-Ubuntu
[ +0,000001] Hardware name: System manufacturer System Product Name/ROG CROSSHAIR VII HERO (WI-FI), BIOS 3004 12/16/2019
[ +0,000011] Workqueue: rpciod rpc_async_schedule [sunrpc]
[ +0,000010] RIP: 0010:xdr_
[ +0,000001] Code: 29 ea e8 85 f4 ff ff 44 8b 63 34 8b 43 3c 45 29 ec 44 29 e8 3b 43 40 44 89 63 34 89 43 3c 73 03 89 43 40 5b 41 5c 41 5d 5d c3 <0f> 0b 0f 1f 44 00 00 4c 8d 54 24 08 48 83 e4 f0 b9 04 00 00 00 41
[ +0,000001] RSP: 0018:ffffa2dd18
[ +0,000001] RAX: 0000000000000010 RBX: 0000000000000070 RCX: 000000000000001c
[ +0,000001] RDX: 000000000000005c RSI: 0000000000000014 RDI: ffff8b96c0856650
[ +0,000001] RBP: ffffa2dd18117b40 R08: ffff8b97d1f82e00 R09: 0000000000000000
[ +0,000000] R10: 1d1cc51b00000000 R11: ffff8b97cf00e520 R12: ffff8b96c0856650
[ +0,000001] R13: ffffa2dd18117bf0 R14: 0000000000000058 R15: ffffffffc0eb8920
[ +0,000001] FS: 000000000000000
[ +0,000001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0,000001] CR2: 0000191e985bac88 CR3: 0000000fd656c000 CR4: 0000000000340ee0
[ +0,000001] Call Trace:
[ +0,000009] xdr_buf_
[ +0,000003] ? kzfree+0x2d/0x40
[ +0,000002] ? crypto_
[ +0,000003] gss_unwrap_
[ +0,000002] ? gss_unwrap_
[ +0,000002] gss_unwrap_
[ +0,000002] ? kmem_cache_
[ +0,000002] ? gss_unwrap_
[ +0,000002] ? gss_validate+
[ +0,000008] ? nfs4_xdr_
[ +0,000008] rpcauth_
[ +0,000007] ? nfs4_xdr_
[ +0,000007] call_decode+
[ +0,000002] ? __switch_
[ +0,000006] ? call_refreshres
[ +0,000006] __rpc_execute+
[ +0,000007] rpc_async_
[ +0,000002] process_
[ +0,000002] worker_
[ +0,000001] kthread+0x121/0x140
[ +0,000001] ? process_
[ +0,000002] ? kthread_
[ +0,000001] ret_from_
[ +0,000001] Modules linked in: nls_utf8 udf crc_itu_t rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache edac_mce_amd snd_hda_codec_hdmi kvm irqbypass joydev crct10dif_pclmul nvidia_uvm(OE) crc32_pclmul hid_roccat_koneplus nvidia_drm(POE) hid_roccat ghash_clmulni_intel hid_roccat_common nvidia_modeset(POE) nvidia(POE) snd_usb_audio snd_hda_
snd_usbmidi_lib snd_hda_
fb_sys_fops syscopyarea sysfillrect snd_seq_device sysimgblt snd_timer k10temp ccp snd soundcore mac_hid sch_fq_codel asus_wmi_
[ +0,000019] hid_plantronics hid_generic usbhid hid igb i2c_piix4 dca i2c_algo_bit ahci nvme libahci nvme_core wmi gpio_amdpt gpio_generic
[ +0,000008] ---[ end trace 4314523bc923f697 ]---
[ +0,000007] RIP: 0010:xdr_
[ +0,000001] Code: 29 ea e8 85 f4 ff ff 44 8b 63 34 8b 43 3c 45 29 ec 44 29 e8 3b 43 40 44 89 63 34 89 43 3c 73 03 89 43 40 5b 41 5c 41 5d 5d c3 <0f> 0b 0f 1f 44 00 00 4c 8d 54 24 08 48 83 e4 f0 b9 04 00 00 00 41
[ +0,000001] RSP: 0018:ffffa2dd18
[ +0,000001] RAX: 0000000000000010 RBX: 0000000000000070 RCX: 000000000000001c
[ +0,000001] RDX: 000000000000005c RSI: 0000000000000014 RDI: ffff8b96c0856650
[ +0,000000] RBP: ffffa2dd18117b40 R08: ffff8b97d1f82e00 R09: 0000000000000000
[ +0,000001] R10: 1d1cc51b00000000 R11: ffff8b97cf00e520 R12: ffff8b96c0856650
[ +0,000001] R13: ffffa2dd18117bf0 R14: 0000000000000058 R15: ffffffffc0eb8920
[ +0,000001] FS: 000000000000000
[ +0,000001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0,000000] CR2: 0000191e985bac88 CR3: 0000000fd656c000 CR4: 0000000000340ee0
.
With a little compile-
* 4.15.0-69
* 4.15.0-70
* 4.15.0-72
* 5.0.0-32 (current daily driver, runs without a hassle, max test length 2d 4h 33m - I am writing this bug report on it)
But the following kernels do not run stable:
* 5.0.0-35 (second stack-trace from above)
* 5.0.0-37 (fist stack-trace from above, as you can see 49s after boot will already throw the error)
* 5.3.0-24
$ lspci | grep -i ether
06:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03
$ mount | grep filer
filer:/ on /share type nfs4 (rw,noatime,
filer:/home/michael on /share/home/michael type nfs4 (rw,noatime,
$ cat /etc/fstab | grep -i filer
filer:/ /share/ nfs4 nfsvers=
summary: |
- invalid opcode xdr_buf_read_netobj on >= 5.0.0-35 + invalid opcode xdr_buf_read_netobj on > 5.0.0-32 |
summary: |
- invalid opcode xdr_buf_read_netobj on > 5.0.0-32 + invalid opcode xdr_buf_read_netobj |
summary: |
- invalid opcode xdr_buf_read_netobj + invalid opcode xdr_buf_read_netobj on nfs4+krb5i directory |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
Changed in linux (Ubuntu Disco): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Disco): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Disco): | |
status: | Fix Committed → Won't Fix |
I am totally noob at this, looking at the source code (on this random website) line for the stack trace /elixir. bootlin. com/linux/ v5.0/source/ net/sunrpc/ xdr.c#L434 there is a BUG_ON macro(?) up until kernel 5.4 - and has then been rewritten in kernel 5.5 https:/ /elixir. bootlin. com/linux/ v5.5-rc1/ source/ net/sunrpc/ xdr.c#L447
https:/