cifsd deadlocks / CIFS related Oopses

Bug #1888936 reported by Jürgen Kreileder
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-aws (Ubuntu)
New
Undecided
Unassigned

Bug Description

We're running a server at AWS which collects data from machines over CIFS. This involves a a lot of mounting and umounting of CIFS (about 100 targets with 2 shares each with 10 delay in between). The targets might sometimes become unavailable when they turned of for the weekend or rebooted.

The server doing this has to be rebooted every few hours because CIFS connection start to hang and don't recover. The usual symptom is:

Jul 24 10:12:59 connector kernel: [ 7765.705409] CIFS: Attempting to mount //172.22.2.112/Meldung
Jul 24 10:13:01 connector kernel: [ 7767.689258] CIFS: Attempting to mount //172.22.2.112/Wartung
Jul 24 10:13:06 connector kernel: [ 7772.758283] CIFS: Attempting to mount //172.30.113.108/Meldung
Jul 24 10:13:06 connector kernel: [ 7773.300475] CIFS: Attempting to mount //172.30.113.108/Wartung
Jul 24 10:13:09 connector kernel: [ 7776.364516] CIFS: Attempting to mount //172.30.99.55/Meldung
Jul 24 10:13:11 connector kernel: [ 7777.978731] CIFS: Attempting to mount //172.30.99.55/Wartung
[...]
Jul 24 10:16:13 connector kernel: [ 7960.390529] CIFS VFS: \\172.30.113.108 has not responded in 180 seconds. Reconnecting...
Jul 24 10:16:15 connector kernel: [ 7962.468649] CIFS VFS: \\172.30.93.171 has not responded in 180 seconds. Reconnecting...
Jul 24 10:16:18 connector kernel: [ 7964.999037] CIFS VFS: \\172.30.99.55 has not responded in 180 seconds. Reconnecting...
Jul 24 10:16:31 connector kernel: [ 7977.798821] INFO: task cifsd:26252 blocked for more than 120 seconds.
Jul 24 10:16:31 connector kernel: [ 7977.803730] Not tainted 5.4.0-1020-aws #20-Ubuntu
Jul 24 10:16:31 connector kernel: [ 7977.808526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 24 10:16:31 connector kernel: [ 7977.820291] cifsd D 0 26252 2 0x80004000
Jul 24 10:16:31 connector kernel: [ 7977.820298] Call Trace:
Jul 24 10:16:31 connector kernel: [ 7977.820307] __schedule+0x2e3/0x740
Jul 24 10:16:31 connector kernel: [ 7977.820310] ? __switch_to_asm+0x40/0x70
Jul 24 10:16:31 connector kernel: [ 7977.820313] ? __switch_to_asm+0x34/0x70
Jul 24 10:16:31 connector kernel: [ 7977.820315] schedule+0x42/0xb0
Jul 24 10:16:31 connector kernel: [ 7977.820318] rwsem_down_read_slowpath+0x16c/0x4a0
Jul 24 10:16:31 connector kernel: [ 7977.820321] down_read+0x85/0xa0
Jul 24 10:16:31 connector kernel: [ 7977.820324] iterate_supers_type+0x70/0xf0
Jul 24 10:16:31 connector kernel: [ 7977.820411] ? cifs_set_cifscreds.isra.0+0x800/0x800 [cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820429] cifs_reconnect+0x8a/0xdc0 [cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820433] ? vprintk_func+0x4c/0xbc
Jul 24 10:16:31 connector kernel: [ 7977.820449] cifs_readv_from_socket+0x17a/0x260 [cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820465] cifs_read_from_socket+0x4c/0x70 [cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820482] ? allocate_buffers+0x43/0x130 [cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820497] cifs_demultiplex_thread+0xe1/0xcc0 [cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820500] kthread+0x104/0x140
Jul 24 10:16:31 connector kernel: [ 7977.820516] ? cifs_handle_standard+0x1b0/0x1b0 [cifs]
Jul 24 10:16:31 connector kernel: [ 7977.820518] ? kthread_park+0x90/0x90
Jul 24 10:16:31 connector kernel: [ 7977.820520] ret_from_fork+0x22/0x40
Jul 24 10:16:31 connector kernel: [ 7977.820524] INFO: task cifsd:26328 blocked for more than 120 seconds.
Jul 24 10:16:31 connector kernel: [ 7977.827503] Not tainted 5.4.0-1020-aws #20-Ubuntu

That is, cifsd gets stuck fetching credentials for the reconnect. I'm attaching the full syslog with stack traces from all hung cifsd task (I don't see where the deadlock is there).

The mounting/unmounting is done in a privileged Docker container. If we restart that, we usually run into an Oops:

Jul 25 07:43:29 connector kernel: [64677.164367] Oops: 0000 [#1] SMP NOPTI
Jul 25 07:43:29 connector kernel: [64677.164370] CPU: 0 PID: 265452 Comm: cifsd Not tainted 5.4.0-1020-aws #20-Ubuntu
Jul 25 07:43:29 connector kernel: [64677.164370] Hardware name: Amazon EC2 t3a.large/, BIOS 1.0 10/16/2017
Jul 25 07:43:29 connector kernel: [64677.164400] RIP: 0010:cifs_reconnect+0x9be/0xdc0 [cifs]
Jul 25 07:43:29 connector kernel: [64677.164403] Code: e8 bb 43 0c d5 66 90 48 8b 45 c0 48 8d 55 c0 4c 8d 6d b8 48 39 c2 74 62 49 be 00 01 00 00 00 00 ad de 48 8b 45 c0 4c 8d 78 f
8 <48> 8b 00 48 8d 58 f8 4d 39 ef 74 3d 49 8b 57 10 48 89 50 08 48 89
Jul 25 07:43:29 connector kernel: [64677.218175] RSP: 0018:ffffbf25c0b27cf8 EFLAGS: 00010286
Jul 25 07:43:29 connector kernel: [64677.222539] RAX: 0000000000000000 RBX: ffff9cdef66f0800 RCX: ffffffff95cd8510
Jul 25 07:43:29 connector kernel: [64677.227607] RDX: ffffbf25c0b27d30 RSI: ffffbf25c0b27d18 RDI: ffffffffc0aeec18
Jul 25 07:43:29 connector kernel: [64677.232638] RBP: ffffbf25c0b27d70 R08: 0000000000000180 R09: 0000000000000000
Jul 25 07:43:29 connector kernel: [64677.237666] R10: ffff9cdf32a173c8 R11: 0000000000000000 R12: 00000000fffffffe
Jul 25 07:43:29 connector kernel: [64677.242789] R13: ffffbf25c0b27d28 R14: dead000000000100 R15: fffffffffffffff8
Jul 25 07:43:29 connector kernel: [64677.247874] FS: 0000000000000000(0000) GS:ffff9cdf32a00000(0000) knlGS:0000000000000000
Jul 25 07:43:29 connector kernel: [64677.254956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 25 07:43:29 connector kernel: [64677.259348] CR2: 0000000000000000 CR3: 00000001cddce000 CR4: 00000000003406f0
Jul 25 07:43:29 connector kernel: [64677.264439] Call Trace:
Jul 25 07:43:29 connector kernel: [64677.267345] ? vprintk_func+0x4c/0xbc
Jul 25 07:43:29 connector kernel: [64677.270720] cifs_readv_from_socket+0x17a/0x260 [cifs]
Jul 25 07:43:29 connector kernel: [64677.274889] cifs_read_from_socket+0x4c/0x70 [cifs]
Jul 25 07:43:29 connector kernel: [64677.278914] ? cifs_add_credits+0x56/0x60 [cifs]
Jul 25 07:43:29 connector kernel: [64677.282722] ? allocate_buffers+0x6d/0x130 [cifs]
Jul 25 07:43:29 connector kernel: [64677.286453] cifs_demultiplex_thread+0xe1/0xcc0 [cifs]
Jul 25 07:43:29 connector kernel: [64677.290566] kthread+0x104/0x140
Jul 25 07:43:29 connector kernel: [64677.293969] ? cifs_handle_standard+0x1b0/0x1b0 [cifs]
Jul 25 07:43:29 connector kernel: [64677.298096] ? kthread_park+0x90/0x90
Jul 25 07:43:29 connector kernel: [64677.301535] ret_from_fork+0x22/0x40
Jul 25 07:43:29 connector kernel: [64677.304799] Modules linked in: md4 nls_utf8 cifs libarc4 libdes rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache xt_nat veth vxlan ip
6_udp_tunnel udp_tunnel xt_policy iptable_mangle xt_mark xt_u32 xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter
iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter br_netfilter bridge stp llc aufs overlay dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ppdev
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ena serio_raw parport_pc parport sch_fq_codel drm i2c_core sunrpc ip_tables x_tables a
utofs4
Jul 25 07:43:29 connector kernel: [64677.387761] CR2: 0000000000000000
Jul 25 07:43:29 connector kernel: [64677.391027] ---[ end trace b498d70d7111f607 ]---

The mount options used are:
ro,relatime,vers=1.0,cache=strict,username=xxx,domain=xxx,uid=0,noforceuid,gid=0,noforcegid,addr=172.30.2.138,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,rsize=61440,wsize=65536,bsize=1048576,echo_interval=60,actimeo=1

The attached log files also contain a bit of CIFS debug messages generated with:
  echo 'module cifs +p' > /sys/kernel/debug/dynamic_debug/control
  echo 'file fs/cifs/* +p' > /sys/kernel/debug/dynamic_debug/control
  echo 1 > /proc/fs/cifs/cifsFYI

Is there any way of trying a newer kernel? https://github.com/torvalds/linux/commits/master/fs/cifs suggests some of the problems (at least the Oops) might have been fixed.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.4.0-1020-aws 5.4.0-1020.20
ProcVersionSignature: User Name 5.4.0-1020.20-aws 5.4.44
Uname: Linux 5.4.0-1020-aws x86_64
ApportVersion: 2.20.11-0ubuntu27.4
Architecture: amd64
CasperMD5CheckResult: skip
Date: Sat Jul 25 11:55:47 2020
Ec2AMI: ami-07d14b5d47292e022
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: eu-central-1a
Ec2InstanceType: t3a.large
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/usr/bin/zsh
SourcePackage: linux-aws
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Jürgen Kreileder (jk) wrote :
Revision history for this message
Jürgen Kreileder (jk) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.