NULL deference in nfs4_get_valid_delegation

Bug #1885010 reported by Elvis Stansvik
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned
Eoan
Fix Released
High
Unassigned
Focal
Fix Released
High
Unassigned

Bug Description

We are getting the following on an NFSv4 client running focal (kernel 5.4.0-33.37):

[296787.347971] BUG: unable to handle page fault for address: ffffffffffffffb0
[296787.350255] #PF: supervisor read access in kernel mode
[296787.352315] #PF: error_code(0x0000) - not-present page
[296787.354137] PGD 15bf00e067 P4D 15bf00e067 PUD 15bf010067 PMD 0
[296787.355798] Oops: 0000 [#2] SMP NOPTI
[296787.357271] CPU: 49 PID: 605315 Comm: kworker/u131:3 Tainted: P D OE 5.4.0-33-generic #37-Ubuntu
[296787.358756] Hardware name: GIGABYTE G291-Z20-00/MZ21-G20-00, BIOS F06 10/04/2019
[296787.360274] Workqueue: rpciod rpc_async_schedule [sunrpc]
[296787.361790] RIP: 0010:nfs4_get_valid_delegation+0xd/0x30 [nfsv4]
[296787.363281] Code: 89 ef e8 06 c0 f9 ff e9 ec fd ff ff 90 0f 1f 44 00 00 55 48 89 e5 f0 80 4f 48 08 5d c3 0f 1f 44 00 00 55 31 f6 48 89 e5 41 54 <4c> 8b 67 b0 4c 89 e7 e8 07 f9 ff ff 84 c0 b8 00 00 00 00 4c 0f 44
[296787.366780] RSP: 0018:ffffb7b1634a7d98 EFLAGS: 00010246
[296787.368740] RAX: ffff9ef2958e9b00 RBX: ffff9ef59f910000 RCX: 0000000000000000
[296787.370648] RDX: 0000000000008000 RSI: 0000000000000000 RDI: 0000000000000000
[296787.372559] RBP: ffffb7b1634a7da0 R08: 0000000000000000 R09: 8080808080808080
[296787.374441] R10: ffff9ef731e9d26c R11: 0000000000000018 R12: ffff9ef781f22600
[296787.376330] R13: 0000000000000000 R14: ffff9efe1db4bc00 R15: ffffffffc0cc2950
[296787.378220] FS: 0000000000000000(0000) GS:ffff9ef78fc40000(0000) knlGS:0000000000000000
[296787.380165] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[296787.382076] CR2: ffffffffffffffb0 CR3: 0000001a2799c000 CR4: 00000000003406e0
[296787.384031] Call Trace:
[296787.385985] nfs4_open_prepare+0x89/0x1e0 [nfsv4]
[296787.387973] rpc_prepare_task+0x1f/0x30 [sunrpc]
[296787.389971] __rpc_execute+0x8c/0x3a0 [sunrpc]
[296787.391903] rpc_async_schedule+0x30/0x50 [sunrpc]
[296787.393787] process_one_work+0x1eb/0x3b0
[296787.395617] worker_thread+0x4d/0x400
[296787.397431] kthread+0x104/0x140
[296787.399166] ? process_one_work+0x3b0/0x3b0
[296787.400868] ? kthread_park+0x90/0x90
[296787.402518] ret_from_fork+0x1f/0x40
[296787.404158] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace sunrpc xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter br_netfilter bridge stp llc aufs overlay md4 cmac nls_utf8 cifs libarc4 fscache libdes binfmt_misc snd_hda_codec_hdmi amd64_edac_mod edac_mce_amd ipmi_ssif nls_iso8859_1 kvm_amd kvm snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_seq_midi snd_seq_midi_event snd_hda_core snd_rawmidi snd_hwdep snd_pcm snd_seq snd_seq_device snd_timer ucsi_ccg snd typec_ucsi typec soundcore k10temp ccp ipmi_si mac_hid nvidia_uvm(OE) sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 mlx5_ib nvidia_drm(POE) nvidia_modeset(POE) ib_uverbs ib_core nvidia(POE) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ast drm_vram_helper aesni_intel i2c_algo_bit crypto_simd ixgbe cryptd ttm glue_helper xfrm_algo
[296787.404232] drm_kms_helper mlx5_core dca mdio syscopyarea sysfillrect sysimgblt fb_sys_fops nvme pci_hyperv_intf drm tls nvme_core mlxfw ahci ipmi_devintf i2c_piix4 libahci ipmi_msghandler i2c_nvidia_gpu
[296787.421858] CR2: ffffffffffffffb0
[296787.423680] ---[ end trace 2cf3edda87955a36 ]---
[296787.425547] RIP: 0010:nfs4_get_valid_delegation+0xd/0x30 [nfsv4]
[296787.427389] Code: 89 ef e8 06 c0 f9 ff e9 ec fd ff ff 90 0f 1f 44 00 00 55 48 89 e5 f0 80 4f 48 08 5d c3 0f 1f 44 00 00 55 31 f6 48 89 e5 41 54 <4c> 8b 67 b0 4c 89 e7 e8 07 f9 ff ff 84 c0 b8 00 00 00 00 4c 0f 44
[296787.431172] RSP: 0018:ffffb7b1615e3d98 EFLAGS: 00010246
[296787.433050] RAX: ffff9ee9faf45ec0 RBX: ffff9ef16c5dd000 RCX: 0000000000000000
[296787.434922] RDX: 0000000000008000 RSI: 0000000000000000 RDI: 0000000000000000
[296787.436810] RBP: ffffb7b1615e3da0 R08: 0000000000000000 R09: 8080808080808080
[296787.438673] R10: ffff9ef26a0b8c6c R11: 0000000000000018 R12: ffff9ef7817cfa00
[296787.440539] R13: 0000000000000004 R14: ffff9ef8bdeb0400 R15: ffffffffc0cc2950
[296787.442400] FS: 0000000000000000(0000) GS:ffff9ef78fc40000(0000) knlGS:0000000000000000
[296787.444289] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[296787.446126] CR2: ffffffffffffffb0 CR3: 0000001a2799c000 CR4: 00000000003406e0

The problem is a known issue which has been fixed upstream:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=29fe839976266bc7c55b927360a1daae57477723

The patch is a simple 2 line fix.

Would be great if you could do an SRU and add that upstream patch.

Tags: focal
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1885010

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: focal
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Hi Elvis, thanks for you report! I've verified, and this fix is present on Focal kernel tag "Ubuntu-5.4.0-38.42" - can you give a try with our proposed kernel (currently it's version 5.4.0-39)?

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed.
Cheers,

Guilherme

Changed in linux (Ubuntu):
status: Incomplete → Fix Committed
importance: Undecided → Medium
Revision history for this message
Elvis Stansvik (elvstone) wrote :

Thanks Guilherme, shame on me for not even checking that :)

I will try with 5.4.0-39 and report back. It takes a while for our workload to run before it starts hitting the issue, but I can give a somewhat confident answer in an hour or so. By tomorrow I will be very sure. I'm optimistic.

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Thanks Elvis, no hurries. And no need to be ashamed, there's a bunch of tags/trees, and this is not released yet (just on proposed), so we're here to help =)

Cheers,

Guilherme

Revision history for this message
Elvis Stansvik (elvstone) wrote :
Download full text (4.3 KiB)

I'm sad to say, but it seems it did not fix our problem. The crash happened again after running for about an hour:

[ 4914.661585] BUG: unable to handle page fault for address: ffffffffffffffb0
[ 4914.661620] #PF: supervisor read access in kernel mode
[ 4914.661638] #PF: error_code(0x0000) - not-present page
[ 4914.661656] PGD 1b6580e067 P4D 1b6580e067 PUD 1b65810067 PMD 0
[ 4914.661680] Oops: 0000 [#1] SMP NOPTI
[ 4914.661695] CPU: 12 PID: 4840 Comm: kworker/u130:5 Tainted: P OE 5.4.0-39-generic #43-Ubuntu
[ 4914.661725] Hardware name: GIGABYTE G291-Z20-00/MZ21-G20-00, BIOS F06 10/04/2019
[ 4914.661770] Workqueue: rpciod rpc_async_schedule [sunrpc]
[ 4914.661803] RIP: 0010:nfs4_get_valid_delegation+0xd/0x30 [nfsv4]
[ 4914.661824] Code: 89 ef e8 76 0f ce ff e9 ec fd ff ff 90 0f 1f 44 00 00 55 48 89 e5 f0 80 4f 48 08 5d c3 0f 1f 44 00 00 55 31 f6 48 89 e5 41 54 <4c> 8b 67 b0 4c 89 e7 e8 07 f9 ff ff 84 c0 b8 00 00 00 00 4c 0f 44
[ 4914.661879] RSP: 0018:ffffa0c1e0193d98 EFLAGS: 00010246
[ 4914.661898] RAX: ffff8b35bd8400c0 RBX: ffff8b3dc849f000 RCX: 0000000000000000
[ 4914.661920] RDX: 0000000000008000 RSI: 0000000000000000 RDI: 0000000000000000
[ 4914.661942] RBP: ffffa0c1e0193da0 R08: 0000000000000000 R09: 8080808080808080
[ 4914.661965] R10: ffff8b3da8b397ac R11: 0000000000000018 R12: ffff8b3d93417300
[ 4914.661988] R13: 0000000000000000 R14: ffff8b45ccf6e000 R15: ffffffffc0696950
[ 4914.662011] FS: 0000000000000000(0000) GS:ffff8b3dcf900000(0000) knlGS:0000000000000000
[ 4914.662036] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4914.662056] CR2: ffffffffffffffb0 CR3: 00000007df20e000 CR4: 00000000003406e0
[ 4914.662078] Call Trace:
[ 4914.662101] nfs4_open_prepare+0x89/0x1e0 [nfsv4]
[ 4914.662128] rpc_prepare_task+0x1f/0x30 [sunrpc]
[ 4914.662154] __rpc_execute+0x8c/0x3a0 [sunrpc]
[ 4914.662179] rpc_async_schedule+0x30/0x50 [sunrpc]
[ 4914.662199] process_one_work+0x1eb/0x3b0
[ 4914.662215] worker_thread+0x4d/0x400
[ 4914.662230] kthread+0x104/0x140
[ 4914.662244] ? process_one_work+0x3b0/0x3b0
[ 4914.662259] ? kthread_park+0x90/0x90
[ 4914.662275] ret_from_fork+0x1f/0x40
[ 4914.662290] Modules linked in: rpcsec_gss_krb5 auth_rpcgss md4 cmac nfsv4 nls_utf8 cifs nfs libarc4 lockd grace libdes fscache aufs overlay ipmi_ssif binfmt_misc snd_hda_codec_hdmi amd64_edac_mod edac_mce_amd kvm_amd nls_iso8859_1 kvm snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi ucsi_ccg snd_seq_midi_event typec_ucsi typec snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore k10temp ccp ipmi_si mac_hid nvidia_uvm(OE) sch_fq_codel parport_pc ppdev lp parport sunrpc ip_tables x_tables autofs4 mlx5_ib nvidia_drm(POE) nvidia_modeset(POE) ib_uverbs ib_core nvidia(POE) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ast drm_vram_helper i2c_algo_bit ttm aesni_intel mlx5_core drm_kms_helper ixgbe syscopyarea crypto_simd sysfillrect nvme sysimgblt fb_sys_fops pci_hyperv_intf cryptd xfrm_algo dca glue_helper mdio drm tls nvme_core ahci mlxfw libahci ipmi_devintf i2c_piix4 ipmi_msghandler i2c_nvidia_gpu
[ 4914.662566] CR2: ffffffffffffffb0
[ 4914.662581] ---[ end trace cdb67bb8c51af6b1 ]---
[ 4914...

Read more...

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi Elvis, Guilherme

Guilherme, I went and had a look at what was built into 5.4.0-39-generic, and the commit "nfs: fix NULL deference in nfs4_get_valid_delegation" wasn't there.

From what I can see looking at the git tree located at:
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/?h=master-next

The commit is there:
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/commit/?h=master-next&id=4cb9edb12735c41f48dff1741bf17e99384f55e3

But it is currently untagged, on master-next. From the changelog of 5.4.0-40-generic, we can see it is listed there:
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/commit/?h=master-next&id=96bef2d870f538dd0849a0f8d7e343a01ce6bbf0

So I am expecting the fix to land in 5.4.0-40-generic which should be tagged shortly, and likely built and pushed to -proposed early next week as a part of the next SRU cycle.

Elvis, this is why 5.4.0-39-generic is still broken, since the fix will likely arrive in 5.4.0-40-generic.
If you like, we can make you a test kernel with the fix so you can double check the patch really fixes your problems. It seems 5.4.0-39-generic is in -updates now, but it is untagged. If it is tagged tomorrow, either myself or Guilherme will build you a test kernel.

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Eoan):
status: New → Fix Committed
Changed in linux (Ubuntu Focal):
status: New → Fix Committed
Revision history for this message
Elvis Stansvik (elvstone) wrote :

Hi Matthew, thanks a lot for looking into this. Like you, I looked at Git to double-check, and thought the fix was in there, but it was next on my list to actually double-double-check, since I thought it so strange the issue wasn't gone.

Thanks for offering to build a test kernel! I will have a go at it myself right away. I was actually on my way to doing that before, but got a silly error at the end of the build because I had been sloppy with installing the build depends and was missing gawk (so some final step failed where gawk was used in the build machinery). But just as I got to that error, I was informed that the fix was probably in 5.4.0-39 and canceled my building efforts.

I'll report back how it goes.

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Hi Matthew, thanks for looking into that! I was re-checking the Git tree, and the commit _is_ present on 5.4.0-38 but..not in 5.4.0-39 heheh
And it's back on 5.4.0-40. What happened is that 5.4.0-39 was a security release, based on 5.4.0-37! So, it's basically 5.4.0-37 + 1 security patch.

My mistake was to assume -39 contained everything in -38! I've discussed that with kernel team and they mentioned it's common to have that kind of situation due to security fixes jumping in and forcing a respin.

So Elvis, the plan is to have 5.4.0-40 released around next week - you could use a mainline kernel (available at [0]) meanwhile, we'll let you know when 5.4.0-40 gets released. If it's urgent for you to have 5.4 series with the fix, as Matthew said, we could build you a test kernel with this fix.
Cheers,

Guilherme

[0] https://kernel.ubuntu.com/~kernel-ppa/mainline/

Revision history for this message
Elvis Stansvik (elvstone) wrote :

No worries Guilherme, that explains it.

I've now built my own kernel from the Ubuntu-5.4.0-38.42 Git tag, which I've verified includes the fix. I'm running our workload again with this kernel and will know within a few hours whether it's looking good. The job as a whole is going to take the full weekend to finish. If it works fine, we can run on this custom kernel until 5.4.0-40 is out around next week.

I'm both testing, but also trying to get actual work done here (get our job run to run to completion). Your fast support here was much appreciated.

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Oh, good news Elvis: just checked and 5.4.0-40 is available in -proposed. Can you give it a try?
Thanks,

Guilherme

Revision history for this message
Elvis Stansvik (elvstone) wrote :

Ah, I would really like to get this job that is running done first. It's processing some data we need to ideally should deliver on Monday. It's looking good so far, so I'm quite confident the fix is good.

When it's done processing, I can give 5.4.0-40 from proposed a try and restart the job again. But I'm pretty sure that it's going to be good.

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

That's great for me Elvis, whenever you can =)
I'm glad you could build the kernel with the fix!
Cheers,

Guilherme

Revision history for this message
Elvis Stansvik (elvstone) wrote :

The long production job with the custom built kernel finished successfully. I'm very confident this fixes our issue. Now running with 5.4.0-40.44 from focal-proposed as a test. I'll let it run for a few hours. It usually hits the crash in an hour or so.

Revision history for this message
Elvis Stansvik (elvstone) wrote :

The test on 5.4.0-40.44 from focal-proposed was successful, so I would say that the fix did it. Thanks for including it!

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi Elvis, excellent to hear your long production job finishes successfully on your test kernel and on 5.4.0-40 from -proposed.

Looking at the schedule, 5.4.0-40 should be released to -updates in the next day or two, and it will be exactly the same package that was in -proposed. I'll mark this bug as fixed released when the kernel is released.

Revision history for this message
Elvis Stansvik (elvstone) wrote :

👍

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Hi Elvis, both kernels (for Eoan and Focal) are officially released; the versions are 5.3.0-62 (Eoan) and 5.4.0-40 (Focal). Thanks for the report!
Cheers,

Guilherme

Changed in linux (Ubuntu Eoan):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Eoan):
importance: Undecided → High
Changed in linux (Ubuntu Focal):
importance: Undecided → High
Changed in linux (Ubuntu):
importance: Medium → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.