kernel panic using CIFS share in smb2_push_mandatory_locks()

Bug #1795659 reported by Stefan Stranz
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Guilherme G. Piccoli
Xenial
Invalid
Undecided
Guilherme G. Piccoli
Bionic
Fix Released
High
Guilherme G. Piccoli
Cosmic
Won't Fix
High
Guilherme G. Piccoli
Disco
Fix Released
High
Guilherme G. Piccoli
linux-azure (Ubuntu)
Fix Released
Critical
Unassigned
Xenial
Fix Released
Critical
Unassigned
Bionic
Invalid
Undecided
Unassigned
Cosmic
Invalid
Undecided
Unassigned
Disco
Invalid
Undecided
Unassigned

Bug Description

NOTICE: The new patch merge is being worked on https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1856949 - if you face this issue, please report there!

[Impact]

* We got reports of a kernel crash in cifs module with the following signature:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
IP: smb2_push_mandatory_locks+0x10e/0x3b0 [cifs]
PGD 0 P4D 0
RIP: 0010:smb2_push_mandatory_locks+0x10e/0x3b0 [cifs]
Call Trace:
 cifs_oplock_break+0x12f/0x3d0 [cifs]
 process_one_work+0x14d/0x410
 worker_thread+0x4b/0x460
 kthread+0x105/0x140
[...]

* Low-level analysis (decodecode script output and the objdump of the function) revealed that we are crashing in a NULL ptr dereference when trying to access "cfile->tlink"; below, a snippet of the objdump at function smb2_push_mandatory_locks():

[...]
mov 0x10(%r14),%r15 # %r15 = cifsFileInfo *cfile
mov 0x18(%r14),%rbx # %rbx = cifsLockInfo *li = (fdlocks->locks)
lea 0x18(%r14),%r12
mov 0x90(%r15),%rax # %rax = struct tcon_link *tlink (cfile->tlink)
cmp %r12,%rbx
mov 0x38(%rax),%rax # <--- TRAP [trying to get cifs_tcon *tl_tcon]
[...]

* After discussing the issue with CIFS maintainers (Steve French and Pavel Shilovsky) they suggested commit b98749cac4a6 ("CIFS: keep FileInfo handle live during oplock break") [http://git.kernel.org/linus/b98749cac4a6] as a fix for multiple reports of this kind of crash.

* The fix was sent to stable kernels and is present in Ubuntu kernels 5.0 and newer. We are requesting the SRU for this patch here in order to fix the crashes, after reports of successful testing with the patch (see below section) and since the patch is restricted to the cifs module scope and accepted on linux stable.

* Alternatively the issue is known to be avoided when oplocks are disabled using "cifs.enable_oplocks=N" module parameter.

[Test case]

* Unfortunately we cannot reproduce the issue. The patch proposed here was
validated by us with xfstests (instructions followed from
https://wiki.samba.org/index.php/Xfstesting-cifs) and fio. Also, we
have a user report of test validation using LISA (https://github.com/LIS/LISAv2).

* Using xfstest with the exclusions proposed in the link above we managed to get the same results as a non-patched kernel, i.e., the same tests failed in both kernels, we didn't get worse results with the patch. Fio also didn't show noticeable performance regression with the patch.

[Regression potential]

* The patch was validated by the cifs filesystem maintainers (in fact they suggested its inclusion in Ubuntu) and by the aforementioned tests; also, the scope is restricted to cifs only so the likelihood of regressions is considered low.

* Due to the nature of the code modification (add a new reference of a file handler and manipulate it in different places), I consider that if we have a regression it'll manifest as deadlock/blocked tasks, not something more serious like crashes or data corruption.

Revision history for this message
Stefan Stranz (sfs1988) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.19 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19-rc6

affects: linux-hwe (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Triaged
tags: added: kernel-da-key
Changed in linux (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Stefan Stranz (sfs1988) wrote :

I'm unable to fully test the upstream bugs as I can't install linux-headers-4.19.0-041900rc6-generic due to libssl1.1 not being available for 4.15. System does boot without the generic package on the new version though.

While i'm looking at upgrading the boxes to 18.04 from the 16.04 install, these are production servers and I can't just swap them over.

I can see the random reboots happening as far back as February, but I don't have a log from those so it could be a different issue then. I know both -34 and -36 kernel builds had the issue.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Thanks for reporting that Stefan! There's a Debian bug report (although the kernel
version is different, seems the same issue): https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=830771 .

Also, recently I had a report of this issue happening in Azure instance, kernel version 4.15.0-1041-azure.

Cheers,

Guilherme

Changed in linux (Ubuntu):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
status: Expired → Confirmed
Changed in linux (Ubuntu Bionic):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
tags: added: sts
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Hi, have some news in this one. I've emailed Microsoft CIFS maintainers, they recommended the following patch as a potential fix:
b98749cac4a6 ("CIFS: keep FileInfo handle live during oplock break").

There's a PPA with a kernel built including this patch, for those interested in testing: https://launchpad.net/~dragan-s/+archive/ubuntu/lp1795659 .

Cheers,

Guilherme

Changed in linux (Ubuntu Cosmic):
status: New → Confirmed
Changed in linux (Ubuntu Disco):
status: New → Confirmed
Changed in linux (Ubuntu Cosmic):
importance: Undecided → High
Changed in linux (Ubuntu Disco):
importance: Undecided → High
Changed in linux (Ubuntu Cosmic):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in linux (Ubuntu Disco):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in linux (Ubuntu Disco):
status: Confirmed → Fix Released
Changed in linux (Ubuntu Cosmic):
status: Confirmed → Won't Fix
Changed in linux (Ubuntu Bionic):
status: Confirmed → In Progress
summary: - kernel panic using CIFS share smb2_push_mandatory_locks
+ kernel panic using CIFS share in smb2_push_mandatory_locks()
description: updated
description: updated
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

SRU request sent to kernel-team mailing list: https://lists.ubuntu.com/archives/kernel-team/2019-July/102354.html

Cheers,

Guilherme

Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Marcelo Cerri (mhcerri)
Changed in linux (Ubuntu Xenial):
status: New → Invalid
Changed in linux-azure (Ubuntu Bionic):
status: New → Invalid
Changed in linux-azure (Ubuntu Cosmic):
status: New → Invalid
Changed in linux-azure (Ubuntu Disco):
status: New → Invalid
Changed in linux-azure (Ubuntu):
importance: Undecided → Critical
status: New → Confirmed
Changed in linux-azure (Ubuntu Xenial):
importance: Undecided → Critical
Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu Xenial):
status: New → Fix Committed
Brad Figg (brad-figg)
tags: added: cscc
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-azure - 4.15.0-1052.57

---------------
linux-azure (4.15.0-1052.57) xenial; urgency=medium

  * xenial/linux-azure: 4.15.0-1052.57 -proposed tracker (LP: #1837632)

  * kernel panic using CIFS share in smb2_push_mandatory_locks() (LP: #1795659)
    - CIFS: keep FileInfo handle live during oplock break

 -- Marcelo Henrique Cerri <email address hidden> Tue, 23 Jul 2019 13:19:53 -0300

Changed in linux-azure (Ubuntu Xenial):
status: Fix Committed → Fix Released
Changed in linux-azure (Ubuntu):
status: Confirmed → Fix Released
Changed in linux (Ubuntu Xenial):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

I've validated the -proposed kernel for Bionic (4.15.0-56) using the xfstests suite mentioned in the description. Interestingly, the patch seems to fix the test generic/504, which failed both in smb2 and smb3 only in kernel 4.15.0-55. Other than that, the same amount of tests failed in both cases, and no significant performance impact was noticed.

Cheers,

Guilherme

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (171.3 KiB)

This bug was fixed in the package linux - 4.15.0-58.64

---------------
linux (4.15.0-58.64) bionic; urgency=medium

  * unable to handle kernel NULL pointer dereference at 000000000000002c (IP:
    iget5_locked+0x9e/0x1f0) (LP: #1838982)
    - Revert "ovl: set I_CREATING on inode being created"
    - Revert "new primitive: discard_new_inode()"

linux (4.15.0-57.63) bionic; urgency=medium

  * CVE-2019-1125
    - x86/cpufeatures: Carve out CQM features retrieval
    - x86/cpufeatures: Combine word 11 and 12 into a new scattered features word
    - x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations
    - x86/speculation: Enable Spectre v1 swapgs mitigations
    - x86/entry/64: Use JMP instead of JMPQ
    - x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS

  * Packaging resync (LP: #1786013)
    - update dkms package versions

linux (4.15.0-56.62) bionic; urgency=medium

  * bionic/linux: 4.15.0-56.62 -proposed tracker (LP: #1837626)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync git-ubuntu-log
    - [Packaging] update helper scripts

  * CVE-2019-2101
    - media: uvcvideo: Fix 'type' check leading to overflow

  * hibmc-drm Causes Unreadable Display for Huawei amd64 Servers (LP: #1762940)
    - [Config] Set CONFIG_DRM_HISI_HIBMC to arm64 only
    - SAUCE: Make CONFIG_DRM_HISI_HIBMC depend on ARM64

  * Bionic: support for Solarflare X2542 network adapter (sfc driver)
    (LP: #1836635)
    - sfc: make mem_bar a function rather than a constant
    - sfc: support VI strides other than 8k
    - sfc: add Medford2 (SFC9250) PCI Device IDs
    - sfc: improve PTP error reporting
    - sfc: update EF10 register definitions
    - sfc: populate the timer reload field
    - sfc: update MCDI protocol headers
    - sfc: support variable number of MAC stats
    - sfc: expose FEC stats on Medford2
    - sfc: expose CTPIO stats on NICs that support them
    - sfc: basic MCDI mapping of 25/50/100G link speeds
    - sfc: support the ethtool ksettings API properly so that 25/50/100G works
    - sfc: add bits for 25/50/100G supported/advertised speeds
    - sfc: remove tx and MCDI handling from NAPI budget consideration
    - sfc: handle TX timestamps in the normal data path
    - sfc: add function to determine which TX timestamping method to use
    - sfc: use main datapath for HW timestamps if available
    - sfc: only enable TX timestamping if the adapter is licensed for it
    - sfc: MAC TX timestamp handling on the 8000 series
    - sfc: on 8000 series use TX queues for TX timestamps
    - sfc: only advertise TX timestamping if we have the license for it
    - sfc: simplify RX datapath timestamping
    - sfc: support separate PTP and general timestamping
    - sfc: support second + quarter ns time format for receive datapath
    - sfc: support Medford2 frequency adjustment format
    - sfc: add suffix to large constant in ptp
    - sfc: mark some unexported symbols as static
    - sfc: update MCDI protocol headers
    - sfc: support FEC configuration through ethtool
    - sfc: remove ctpio_dmabuf_start from stats
    - sfc: stop the TX queue before pushing new buffers

  * [18.04 FEAT] zKVM: Add hardwar...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Guillaume Penin (guillaume-penin) wrote :
Download full text (3.8 KiB)

Hi,

Unfortunately, the patch does not seem to solve the problem. Running Ubuntu 18.04 LTS with kernel 4.15.0-64, the crash still occurs with the same signature :

Oct 14 08:00:01 uzorldsp01 kernel: [109699.415372] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
Oct 14 08:00:01 uzorldsp01 kernel: [109699.420042] IP: smb2_push_mandatory_locks+0x10d/0x3c0 [cifs]
Oct 14 08:00:01 uzorldsp01 kernel: [109699.420042] PGD 0 P4D 0
Oct 14 08:00:01 uzorldsp01 kernel: [109699.420042] Oops: 0000 [#1] SMP PTI
Oct 14 08:00:01 uzorldsp01 kernel: [109699.420042] Modules linked in: btrfs zstd_compress xor raid6_pq ufs qnx4 minix ntfs msdos jfs xfs dm_snapshot dm_bufio cmac arc4 md4 nls_utf8 cifs ccm fscache nf_conntr
ack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack libcrc32c iptable_security sb_edac kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd
glue_helper cryptd intel_rapl_perf input_leds serio_raw hyperv_fb hv_balloon joydev mac_hid sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 hid_generic hid_hyperv hv_util
s hv_storvsc ptp hyperv_keyboard hv_netvsc hid scsi_transport_fc pps_core psmouse i2c_piix4 pata_acpi hv_vmbus floppy
Oct 14 08:00:01 uzorldsp01 kernel: [109699.472261] CPU: 0 PID: 50766 Comm: kworker/0:0 Not tainted 4.15.0-64-generic #73-Ubuntu
Oct 14 08:00:01 uzorldsp01 kernel: [109699.472261] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
Oct 14 08:00:01 uzorldsp01 kernel: [109699.472261] Workqueue: cifsoplockd cifs_oplock_break [cifs]
Oct 14 08:00:01 uzorldsp01 kernel: [109699.472261] RIP: 0010:smb2_push_mandatory_locks+0x10d/0x3c0 [cifs]
Oct 14 08:00:01 uzorldsp01 kernel: [109699.472261] RSP: 0000:ffffab360da2bdd8 EFLAGS: 00010246
Oct 14 08:00:01 uzorldsp01 kernel: [109699.472261] RAX: 0000000000000000 RBX: ffff9887646617d8 RCX: 0000000000000000
Oct 14 08:00:01 uzorldsp01 kernel: [109699.472261] RDX: 0000000000001000 RSI: 0000000000000000 RDI: ffff98876d006b80
Oct 14 08:00:02 uzorldsp01 kernel: [109699.472261] RBP: ffffab360da2be28 R08: ffff988568596000 R09: ffff98876d006b80
Oct 14 08:00:02 uzorldsp01 kernel: [109699.472261] R10: ffffab360da2bd98 R11: ffff988568596000 R12: 00000000000000aa
Oct 14 08:00:02 uzorldsp01 kernel: [109699.472261] R13: ffff9887646617d8 R14: ffff9887646617c0 R15: ffff988764d7f200
Oct 14 08:00:02 uzorldsp01 kernel: [109699.472261] FS: 0000000000000000(0000) GS:ffff98876d600000(0000) knlGS:0000000000000000
Oct 14 08:00:02 uzorldsp01 kernel: [109699.472261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 14 08:00:02 uzorldsp01 kernel: [109699.472261] CR2: 0000000000000038 CR3: 000000046d80a003 CR4: 00000000001606f0
Oct 14 08:00:02 uzorldsp01 kernel: [109699.472261] Call Trace:
Oct 14 08:00:02 uzorldsp01 kernel: [109699.472261] cifs_oplock_break+0x131/0x410 [cifs]
Oct 14 08:00:02 uzorldsp01 kernel: [109699.472261] process_one_work+0x1de/0x420
Oct 14 08:00:02 uzorldsp01 kernel: [109699.472261] worker_thread+0x32/0x410
Oct 14 08:00:02 uzorldsp01 kernel: [109699.472261] kthread+0x121/0x140
Oct 14 08:00:02 uzorldsp01 kernel: [109699.472...

Read more...

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Hi Guillaume, thanks for your report! Do you have more details on how did you reproduce it? It was "organic", or can you trigger that using some workload or test pattern? Also, based on the log output that you pasted, seems it took about 30 hours to reproduce, correct? Let me know if possible the topology of the cifs volumes in your system (how many mount points you have, how many targets - and where are they -, etc).

Do you think you could run a debug kernel in order to collect more data? I'll take a look in the code and see if I can instrument the cifs code to narrow how this race might be happening, let me know if you are able to test that.

Thanks,

Guilherme

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Hi Guillaume and all involved, it seems this bug still can occur even with the backported fix [0]. I found an upstream new fix that is quite promising, it addresses this specific oops. In the past it was thought by maintainers that other fix [0] could reduce the likelihood of those crashes in smb2_push_mandatory_locks (and it may worked, reducing the occurrence), but the fact is a proper fix was never worked until kernel 5.5.

The commit is 6f582b273ec2 ("CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks") [1]. The reasoning about the fix is that the struct cifsFileInfo is initialized and ready for usage before all members are initialized, like cifs->tlink (the one being dereferenced in most oops reports). The maintainer then enforced full struct initialization before it gets used.

I've built a 4.15 Bionic kernel with this fix, available in the following PPA:
launchpad.net/~gpiccoli/+archive/ubuntu/test1795659

To use this kernel, one just needs to run:
sudo add-apt-repository ppa:gpiccoli/test1795659
sudo apt-get update
sudo apt install linux-image-unsigned-4.15.0-74-generic linux-modules-4.15.0-74-generic linux-modules-extra-4.15.0-74-generic

Then reboot the machine and check if the right kernel is running; to verify that,
just run "uname -rv" and the output should be:
4.15.0-74-generic #84+TEST256303v20191229b1-Ubuntu <...>

In case anybody reproducing this issue can test the PPA kernel, I'd strongly appreciate it.
Cheers,

Guilherme

[0] http://git.kernel.org/linus/b98749cac4a6
[1] http://git.kernel.org/linus/6f582b273ec2

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

The new patch merge is being work on https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1856949 (thanks Juergh and Marcelo for this work). So in case you face this issue in the future, please use the LP #1856949 for reporting.

Thanks,

Guilherme

description: updated
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.