FSCache kernel panic on presumed CIFS share disconnect

Bug #1805062 reported by Tomasz J Góralczyk on 2018-11-26
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned

Bug Description

What Happened
-------------

I have a setup with my home NAS server mounted via CIFS to an Ubuntu 18.04 VPS over OpenVPN. The CIFS share is served by Nginx as an open file directory, with fscache/cachefilesd being used to cache the files locally, to reduce the load on my home connection.

I was using this setup for web development by editing files directly on the CIFS share from another machine (mounted via LAN, independently of the OpenVPN server). After making changes to an HTML file, I noticed that the Nginx server was no longer responding.

I connected to the server via SSH to diagnose the issue and found that trying to perform a file listing (using ls) on the fileshare mount froze my shell, so I opened another SSH connection. I ran dmesg, expecting to find a CIFS timeout error, but instead there was an fscache error trace. Thinking it would help resolve the issue, I ran `systemd restart cachefilesd` to restart cachefilesd. My SSH connection froze again, but this time the server was no longer responding to SSH connections and pings. I was forced to perform a hard reset to restore operation.

Upon further inspection of the logs, I noticed that the NAS reconnected to the OpenVPN server between my first and second SSH connection, so it's possible that a network issue caused the initial fscache error. I have no way of confirming this, that's why I put "presumed" in the title.

I suspect the issue here might be a race condition when invalidating a cache object (the updated HTML file) during a network issue.

What I expected to happen
-------------------------

Accessing an updated file shouldn't cause an fscache error, even when a network error occurs.

Restarting cachefilesd shouldn't cause a fatal panic, even when a network error occurs.

Machine information
-------------------

/proc/version_signature: Ubuntu 4.15.0-20.21-generic 4.15.17

lsb_release:
Description: Ubuntu 18.04.1 LTS
Release: 18.04

apt-cache policy linux-image-4.15.0-20-generic:
linux-image-4.15.0-20-generic:
  Installed: 4.15.0-20.21
  Candidate: 4.15.0-20.21
  Version table:
 *** 4.15.0-20.21 500
        500 http://gb.archive.ubuntu.com/ubuntu bionic/main amd64 Packages
        100 /var/lib/dpkg/status

apt-cache policy cachefilesd cachefilesd:
cachefilesd:
  Installed: 0.10.10-0.1
  Candidate: 0.10.10-0.1
  Version table:
 *** 0.10.10-0.1 500
        500 http://gb.archive.ubuntu.com/ubuntu bionic/universe amd64 Packages
        100 /var/lib/dpkg/status

apt-cache policy libsmbclient:
libsmbclient:
  Installed: (none)
  Candidate: 2:4.7.6+dfsg~ubuntu-0ubuntu2.4
  Version table:
     2:4.7.6+dfsg~ubuntu-0ubuntu2.4 500
        500 http://gb.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
     2:4.7.6+dfsg~ubuntu-0ubuntu2.2 500
        500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages
     2:4.7.6+dfsg~ubuntu-0ubuntu2 500
        500 http://gb.archive.ubuntu.com/ubuntu bionic/main amd64 Packages

apt-cache policy cifs-utils:
cifs-utils:
  Installed: 2:6.8-1
  Candidate: 2:6.8-1
  Version table:
 *** 2:6.8-1 500
        500 http://gb.archive.ubuntu.com/ubuntu bionic/main amd64 Packages
        100 /var/lib/dpkg/status

The VPS provider is UK2, I believe they use KVM.

journalctl output
-----------------

Private and irrelevant information removed.

Nov 26 00:30:31 hostname kernel: [kworke] preemptive burial: OBJ66 [DROP_OBJECT] 000000004b2987a1

--- [snip] (CacheFiles error + kernel trace) ---

Nov 26 00:34:01 hostname kernel: CacheFiles: Error: Object already active
Nov 26 00:34:01 hostname kernel: CacheFiles: object: OBJ6d
Nov 26 00:34:01 hostname kernel: CacheFiles: objstate=LOOK_UP_OBJECT fl=18 wbusy=2 ev=0[0]
Nov 26 00:34:01 hostname kernel: CacheFiles: ops=0 inp=0 exc=0
Nov 26 00:34:01 hostname kernel: CacheFiles: parent=00000000316fbe48
Nov 26 00:34:01 hostname kernel: CacheFiles: cookie=000000001fa898cd [pr=00000000f0382671 nd=00000000ba69f109 fl=22]
Nov 26 00:34:01 hostname kernel: CacheFiles: key=[8] 'dd05000000000000'
Nov 26 00:34:01 hostname kernel: ------------[ cut here ]------------
Nov 26 00:34:01 hostname kernel: kernel BUG at /build/linux-5s7Xkn/linux-4.15.0/fs/cachefiles/namei.c:164!
Nov 26 00:34:01 hostname kernel: invalid opcode: 0000 [#1] SMP PTI
Nov 26 00:34:01 hostname kernel: Modules linked in: binfmt_misc cmac arc4 md4 nls_utf8 cifs ccm cachefiles fscache ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_addrtype ipt_MASQUERADE nf_nat_masquerade_ip
Nov 26 00:34:01 hostname kernel: async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid cirrus ttm psmouse drm_kms_helper syscopyarea sysfillrect sysimgblt virtio_net fb_sys_fops virtio_blk drm i2c_piix4 pata_acpi floppy
Nov 26 00:34:01 hostname kernel: CPU: 0 PID: 22679 Comm: kworker/u64:2 Not tainted 4.15.0-20-generic #21-Ubuntu
Nov 26 00:34:01 hostname kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Nov 26 00:34:01 hostname kernel: Workqueue: fscache_object fscache_object_work_func [fscache]
Nov 26 00:34:01 hostname kernel: RIP: 0010:cachefiles_walk_to_object+0x820/0x8b0 [cachefiles]
Nov 26 00:34:01 hostname kernel: RSP: 0018:ffffa36bc5a87d40 EFLAGS: 00010202
Nov 26 00:34:01 hostname kernel: RAX: ffff96b739444801 RBX: ffff96b732f5e3a8 RCX: 0000000000847bda
Nov 26 00:34:01 hostname kernel: RDX: 0000000000847bd9 RSI: ffff96b73be27140 RDI: ffff96b73b807000
Nov 26 00:34:01 hostname kernel: RBP: ffffa36bc5a87de8 R08: 0000000000027140 R09: ffffffffc0709f4b
Nov 26 00:34:01 hostname kernel: R10: ffffd9cc40e51100 R11: 0000000000010101 R12: ffff96b732f5d200
Nov 26 00:34:01 hostname kernel: R13: ffff96b732f5d310 R14: ffff96b70730ec00 R15: ffff96b732f5e3a8
Nov 26 00:34:01 hostname kernel: FS: 0000000000000000(0000) GS:ffff96b73be00000(0000) knlGS:0000000000000000
Nov 26 00:34:01 hostname kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 26 00:34:01 hostname kernel: CR2: 00007ff8ad9d9f60 CR3: 000000002d522000 CR4: 00000000000006f0
Nov 26 00:34:01 hostname kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 26 00:34:01 hostname kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 26 00:34:01 hostname kernel: Call Trace:
Nov 26 00:34:01 hostname kernel: ? wait_woken+0x80/0x80
Nov 26 00:34:01 hostname kernel: cachefiles_lookup_object+0x4d/0xe0 [cachefiles]
Nov 26 00:34:01 hostname kernel: fscache_look_up_object+0xbd/0x1c0 [fscache]
Nov 26 00:34:01 hostname kernel: fscache_object_work_func+0x63/0x1f0 [fscache]
Nov 26 00:34:01 hostname kernel: process_one_work+0x1de/0x410
Nov 26 00:34:01 hostname kernel: worker_thread+0x32/0x410
Nov 26 00:34:01 hostname kernel: kthread+0x121/0x140
Nov 26 00:34:01 hostname kernel: ? process_one_work+0x410/0x410
Nov 26 00:34:01 hostname kernel: ? kthread_create_worker_on_cpu+0x70/0x70
Nov 26 00:34:01 hostname kernel: ? do_syscall_64+0x73/0x130
Nov 26 00:34:01 hostname kernel: ? kthread_create_worker_on_cpu+0x70/0x70
Nov 26 00:34:01 hostname kernel: ret_from_fork+0x35/0x40
Nov 26 00:34:01 hostname kernel: Code: 1e ca 48 c7 c7 70 b1 70 c0 e8 ed 58 1e ca 0f 0b e8 d6 3a 18 ca 48 c7 c7 30 b2 70 c0 e8 da 58 1e ca 31 f6 4c 89 e7 e8 e8 2a 00 00 <0f> 0b 80 bb 1c 01 00 00 00 0f 85 6c f9 ff ff 3d 00 00 20 00 0f
Nov 26 00:34:01 hostname kernel: RIP: cachefiles_walk_to_object+0x820/0x8b0 [cachefiles] RSP: ffffa36bc5a87d40
Nov 26 00:34:01 hostname kernel: ---[ end trace 788b2ad4e37538d5 ]---

--- [snip] (fileshare server reconnects to OpenVPN) ---

Nov 26 00:34:43 hostname openvpn[15768]: [snip]:[snip] TLS: tls_process: killed expiring key
Nov 26 00:34:46 hostname openvpn[15768]: [snip]:[snip] TLS: soft reset sec=0
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] VERIFY OK: depth=1, CN=[snip]
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] VERIFY KU OK
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] Validating certificate extended key usage
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] ++ Certificate has EKU (str) TLS Web Client Authentication, expects TLS Web Client Authentication
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] VERIFY EKU OK
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] VERIFY OK: depth=0, CN=[snip]
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] peer info: IV_VER=2.3.11
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] peer info: IV_PLAT=linux
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] peer info: IV_PROTO=2
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] Outgoing Data Channel: Cipher 'AES-256-CBC' initialized with 256 bit key
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] Outgoing Data Channel: Using 256 bit message hash 'SHA256' for HMAC authentication
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] Incoming Data Channel: Cipher 'AES-256-CBC' initialized with 256 bit key
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] Incoming Data Channel: Using 256 bit message hash 'SHA256' for HMAC authentication
Nov 26 00:34:48 hostname openvpn[15768]: [snip]:[snip] Control Channel: TLSv1.2, cipher TLSv1.2 DHE-RSA-AES256-GCM-SHA384, 2048 bit RSA

--- [snip] (i connect via ssh) ---

Nov 26 00:35:25 hostname sshd[23061]: Connection from [snip] port [snip] on [snip] port [snip]
Nov 26 00:35:25 hostname sshd[23061]: Postponed publickey for tjg from [snip] port [snip] ssh2 [preauth]
Nov 26 00:35:26 hostname sshd[23061]: Accepted publickey for tjg from [snip] port [snip] ssh2: RSA SHA256:/[snip]
Nov 26 00:35:26 hostname sshd[23061]: pam_unix(sshd:session): session opened for user tjg by (uid=0)
Nov 26 00:35:26 hostname systemd-logind[604]: New session 214 of user tjg.
Nov 26 00:35:26 hostname systemd[1]: Started Session 214 of user tjg.
Nov 26 00:35:27 hostname sshd[23061]: User child is on pid 23141
Nov 26 00:35:27 hostname sshd[23141]: Starting session: shell on pts/1 for tjg from [snip] port [snip] id 0

-- [snip] (i try to restart cachefilesd) --

Nov 26 00:35:51 hostname sudo[23179]: tjg : TTY=pts/1 ; PWD=/home/tjg ; USER=root ; COMMAND=/bin/systemctl restart cachefilesd.service
Nov 26 00:35:51 hostname sudo[23179]: pam_unix(sudo:session): session opened for user root by tjg(uid=0)
Nov 26 00:35:51 hostname systemd[1]: Stopping LSB: CacheFiles daemon...
Nov 26 00:35:51 hostname cachefilesd[23183]: * Stopping FilesCache daemon cachefilesd
Nov 26 00:35:51 hostname cachefilesd[903]: Daemon Terminated
Nov 26 00:35:51 hostname kernel: CacheFiles: File cache on vda1 unregistering
Nov 26 00:35:51 hostname kernel: FS-Cache: Withdrawing cache "mycache"

--- [crash] (no more logs after this point until hard reset 5 minutes later) ---

Tomasz J Góralczyk (tjg1) wrote :

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1805062

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.20 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20-rc4

Changed in linux (Ubuntu):
importance: Undecided → High
Tomasz J Góralczyk (tjg1) wrote :

I'm yet to find a way to reliably reproduce this.

Including apport report. Not much more information in there, as none was collected during the crash.

Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers