Ubuntu
nfs-utils package

kernel soft lockup on nfs server when using a kerberos mount

Bug #1466654 reported by Keith Johnson on 2015-06-18

This bug affects 8 people

Affects		Status	Importance	Assigned to	Milestone
	nfs-utils (Ubuntu)	Confirmed	Medium	Unassigned

Bug Description

The kernel seems to lock up when accessing a kerberos mounted nfs share with a user holding a kerberos ticket. This is in a virtualbox vm(run by vagrant), but it also occurs in a VMware vm from a standard ubuntu 14.04 server install disk.

To reproduce:
Join machine to windows active directory domain using sssd
Install nfs-kernel server, enable NEED_SVCGSSD
Enable NEED_GSSD for the client(can be same host as the server)
$ sudo mount -t nfs4 <fqdn>:/ /mnt/nfs -o sec=krb5
$ sudo ls -l /mnt/nfs # this works ok
$ kinit <a_domain_user>
$ ls /mnt/nfs
Permission denied error # I don't recall the exact wording
# wait a few moments, and the kernel starts reporting a soft lockup.

I think there are a few other things that trigger this error as well. It basically makes nfs+kerberos unusable.

I have the kernel crash dump, and the debugsyms installed so if there is any other information I can provide please let me know.

lsb_release -rd:
Description: Ubuntu 14.04.2 LTS
Release: 14.04

$ apt-cache policy nfs-common
nfs-common:
Installed: 1:1.2.8-6ubuntu1.1

Info from crash:
      KERNEL: /usr/lib/debug/boot/vmlinux-3.13.0-53-generic
    DUMPFILE: dump.201506181954 [PARTIAL DUMP]
        CPUS: 2
        DATE: Thu Jun 18 19:54:08 2015
      UPTIME: 00:05:48
LOAD AVERAGE: 1.50, 0.45, 0.18
       TASKS: 120
    NODENAME: t-fileserver
     RELEASE: 3.13.0-53-generic
     VERSION: #89-Ubuntu SMP Wed May 20 10:34:39 UTC 2015
     MACHINE: x86_64 (3581 Mhz)
      MEMORY: 511.6 MB
       PANIC: "Kernel panic - not syncing: softlockup: hung tasks"
         PID: 1353
     COMMAND: "rpc.svcgssd"
        TASK: ffff880014dce000 [THREAD_INFO: ffff88001514e000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

Backtrace:
PID: 1353 TASK: ffff880014dce000 CPU: 0 COMMAND: "rpc.svcgssd"
#0 [ffff88001fc03d18] machine_kexec at ffffffff8104ace2
#1 [ffff88001fc03d68] crash_kexec at ffffffff810e7423
#2 [ffff88001fc03e30] panic at ffffffff8171bcc4
#3 [ffff88001fc03ea8] watchdog_timer_fn at ffffffff8110dc85
#4 [ffff88001fc03ed8] __run_hrtimer at ffffffff8108e8c7
#5 [ffff88001fc03f18] hrtimer_interrupt at ffffffff8108f08f
#6 [ffff88001fc03f80] local_apic_timer_interrupt at ffffffff81043617
#7 [ffff88001fc03f98] smp_apic_timer_interrupt at ffffffff81735ccf
#8 [ffff88001fc03fb0] apic_timer_interrupt at ffffffff8173465d
--- <IRQ stack> ---
#9 [ffff88001514fd58] apic_timer_interrupt at ffffffff8173465d
    [exception RIP: qword_addhex+176]
    RIP: ffffffffa01c2df0 RSP: ffff88001514fe08 RFLAGS: 00000206
    RAX: 0000000000000001 RBX: 0000000000000006 RCX: 00000000000001f6
    RDX: ffff880015e54678 RSI: ffff88001514fe84 RDI: ffff88001514fe88
    RBP: ffff88001514fe18 R8: ffff880015e57cf5 R9: 000000000000030b
    R10: 0000000000000039 R11: 0000000000000027 R12: 0000000000000006
    R13: ffffea0000422420 R14: ffffea00003e46e0 R15: ffff88001514fe98
    ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#10 [ffff88001514fe20] rsi_request at ffffffffa01f61bb [auth_rpcgss]
#11 [ffff88001514fe48] cache_read at ffffffffa01c5045 [sunrpc]
#12 [ffff88001514fec0] cache_read_procfs at ffffffffa01c51a1 [sunrpc]
#13 [ffff88001514fee8] proc_reg_read at ffffffff81224a6d
#14 [ffff88001514ff08] vfs_read at ffffffff811bdf55
#15 [ffff88001514ff40] sys_read at ffffffff811bea69
#16 [ffff88001514ff80] system_call_fastpath at ffffffff8173391d
    RIP: 00007f85cf9f3810 RSP: 00007ffdba7dfe78 RFLAGS: 00000206
    RAX: 0000000000000000 RBX: ffffffff8173391d RCX: ffffffffffffffff
    RDX: 0000000000001000 RSI: 00000000008746f0 RDI: 0000000000000004
    RBP: 00000000006083f8 R8: 0000000000000000 R9: 0000000000878820
    R10: 00007f85cfcc67b8 R11: 0000000000000246 R12: 0000000000000000
    R13: 0000000000000004 R14: 0000000000608400 R15: 00000000008744b0
    ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b

Log:
[ 348.084011] BUG: soft lockup - CPU#0 stuck for 23s! [rpc.svcgssd:1353]
[ 348.084011] Modules linked in: cts vboxsf(OX) nfsv4 rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache dm_crypt ip6t_REJECT ppdev xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT xt_comment xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter parport_pc ip6_tables parport nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack iptable_filter serio_raw ip_tables x_tables vboxguest(OX) psmouse ahci libahci e1000
[ 348.084011] CPU: 0 PID: 1353 Comm: rpc.svcgssd Tainted: G OX 3.13.0-53-generic #89-Ubuntu
[ 348.084011] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 348.084011] task: ffff880014dce000 ti: ffff88001514e000 task.ti: ffff88001514e000
[ 348.084011] RIP: 0010:[<ffffffffa01c2df0>] [<ffffffffa01c2df0>] qword_addhex+0xb0/0x100 [sunrpc]
[ 348.084011] RSP: 0018:ffff88001514fe08 EFLAGS: 00000206
[ 348.084011] RAX: 0000000000000001 RBX: 0000000000000006 RCX: 00000000000001f6
[ 348.084011] RDX: ffff880015e54678 RSI: ffff88001514fe84 RDI: ffff88001514fe88
[ 348.084011] RBP: ffff88001514fe18 R08: ffff880015e57cf5 R09: 000000000000030b
[ 348.084011] R10: 0000000000000039 R11: 0000000000000027 R12: 0000000000000006
[ 348.084011] R13: ffffea0000422420 R14: ffffea00003e46e0 R15: ffff88001514fe98
[ 348.084011] FS: 00007f85d02f3740(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
[ 348.084011] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 348.084011] CR2: 0000000000408000 CR3: 00000000152f7000 CR4: 00000000000006f0
[ 348.084011] Stack:
[ 348.084011] ffff88001478d580 ffff88001514fe88 ffff88001514fe40 ffffffffa01f61bb
[ 348.084011] ffff8800150b2840 0000000000001000 ffff88001f48f500 ffff88001514feb8
[ 348.084011] ffffffffa01c5045 0002000000000001 ffff88000c084a60 ffff880015c2bb60
[ 348.084011] Call Trace:
[ 348.084011] [<ffffffffa01f61bb>] rsi_request+0x3b/0x50 [auth_rpcgss]
[ 348.084011] [<ffffffffa01c5045>] cache_read.isra.19+0x275/0x3a0 [sunrpc]
[ 348.084011] [<ffffffffa01c51a1>] cache_read_procfs+0x31/0x40 [sunrpc]
[ 348.084011] [<ffffffff81224a6d>] proc_reg_read+0x3d/0x80
[ 348.084011] [<ffffffff811bdf55>] vfs_read+0x95/0x160
[ 348.084011] [<ffffffff811bea69>] SyS_read+0x49/0xa0
[ 348.084011] [<ffffffff811d35e5>] ? SyS_poll+0x65/0x100
[ 348.084011] [<ffffffff8173391d>] system_call_fastpath+0x1a/0x1f
[ 348.084011] Code: e0 27 42 8d 44 20 30 41 88 40 fe 41 0f b6 c2 83 f8 0a 44 89 d8 0f 4c c3 41 83 e9 02 83 e9 01 46 8d 54 10 30 0f 95 c0 41 83 f9 01 <45> 88 50 ff 7f aa 45 85 c9 7f 1f 5b 41 5c 5d 41 b9 ff ff ff ff
[ 348.084011] Kernel panic - not syncing: softlockup: hung tasks
[ 348.084011] CPU: 0 PID: 1353 Comm: rpc.svcgssd Tainted: G OX 3.13.0-53-generic #89-Ubuntu
[ 348.084011] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 348.084011] 000000000000012d ffff88001fc03e28 ffffffff81722e1e ffffffff81a62b16
[ 348.084011] ffff88001fc03ea0 ffffffff8171bcbd 0000000000000008 ffff88001fc03eb0
[ 348.084011] ffff88001fc03e50 0000000000000086 0000000000000046 0000000000000007
[ 348.084011] Call Trace:
[ 348.084011] <IRQ> [<ffffffff81722e1e>] dump_stack+0x45/0x56
[ 348.084011] [<ffffffff8171bcbd>] panic+0xc8/0x1d7
[ 348.084011] [<ffffffff8110dc85>] watchdog_timer_fn+0x165/0x170
[ 348.084011] [<ffffffff8108e8c7>] __run_hrtimer+0x77/0x1d0
[ 348.084011] [<ffffffff8110db20>] ? watchdog_cleanup+0x10/0x10
[ 348.084011] [<ffffffff8108f08f>] hrtimer_interrupt+0xef/0x230
[ 348.084011] [<ffffffff81043617>] local_apic_timer_interrupt+0x37/0x60
[ 348.084011] [<ffffffff81735ccf>] smp_apic_timer_interrupt+0x3f/0x60
[ 348.084011] [<ffffffff8173465d>] apic_timer_interrupt+0x6d/0x80
[ 348.084011] <EOI> [<ffffffffa01c2df0>] ? qword_addhex+0xb0/0x100 [sunrpc]
[ 348.084011] [<ffffffffa01f61bb>] rsi_request+0x3b/0x50 [auth_rpcgss]
[ 348.084011] [<ffffffffa01c5045>] cache_read.isra.19+0x275/0x3a0 [sunrpc]
[ 348.084011] [<ffffffffa01c51a1>] cache_read_procfs+0x31/0x40 [sunrpc]
[ 348.084011] [<ffffffff81224a6d>] proc_reg_read+0x3d/0x80
[ 348.084011] [<ffffffff811bdf55>] vfs_read+0x95/0x160
[ 348.084011] [<ffffffff811bea69>] SyS_read+0x49/0xa0
[ 348.084011] [<ffffffff811d35e5>] ? SyS_poll+0x65/0x100
[ 348.084011] [<ffffffff8173391d>] system_call_fastpath+0x1a/0x1f

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: nfs-common 1:1.2.8-6ubuntu1.1
ProcVersionSignature: Ubuntu 3.13.0-53.89-generic 3.13.11-ckt19
Uname: Linux 3.13.0-53-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.11
Architecture: amd64
Date: Thu Jun 18 20:46:05 2015
ProcEnviron:
TERM=xterm-256color
PATH=(custom, no user)
XDG_RUNTIME_DIR=<set>
LANG=en_US.UTF-8
SHELL=/bin/bash
SourcePackage: nfs-utils
UpgradeStatus: No upgrade log present (probably fresh install)

Tags:

Revision history for this message

Keith Johnson (kj42) wrote on 2015-06-18:

Dependencies.txt Edit (4.4 KiB, text/plain; charset="utf-8")

Revision history for this message

Keith Johnson (kj42) wrote on 2015-06-23:

I worked around this by setting NO_AUTH_DATA_REQUIRED on the userAccountControl attribute in ldap for the server account to prevent the PAC from being added to the kerberos ticket. I guess maybe when svcgssd gets a kerberos ticket that is too large it gets unhappy and stuck in a loop?

A few references:
https://lists.samba.org/archive/samba/2013-June/174045.html
http://blog.evad.io/2014/11/04/kerberos-protected-nfs-with-active-directory-and-the-pac/

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-01-18:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nfs-utils (Ubuntu):
status:	New → Confirmed

Revision history for this message

Jeremy BATTISTON (jeremy-battiston) wrote on 2016-01-20:

Hello, I have a server in 12.04.5 LTS with kernel 3.13.0-74 (HWE) and packages nfs-common in version 1:1.2.5-3ubuntu3.2 and nfs-kernel-server in version 1:1.2.5-3ubuntu3.2
I have the same problem. I tested today with the new version of kernel 3.13.0-76, I have always the problem of CPU stuck and service rpc.svcgssd

Revision history for this message

Benoit Gschwind (gschwind) wrote on 2016-03-08:

hello,

Similar bug here with

Linux xxx 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri Feb 26 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Best regards

Revision history for this message

Jeremy BATTISTON (jeremy-battiston) wrote on 2016-03-08:

Hello,

For me, the problem is solved with downgrade my kernel to 12.04 "standard" kernel (Linux xxx 3.2.0-99-generic #139-Ubuntu) and delete HWE. The problem is not reappeared since my change 3 weeks ago with 700 users.

Best regards

Revision history for this message

Jochen (jradmacher) wrote on 2016-09-05:

This problem is still present in Ubuntu 16.04:
# Linux version 4.4.0-36-generic (buildd@lcy01-01) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2) ) #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016

The workaround from comment #2 (disable PAC) still works

dmesg output:
[349643.292059] [<ffffffff8182dfb2>] ? entry_SYSCALL_64_fastpath+0x16/0x71
[349668.156063] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [rpc.svcgssd:1342]
[349668.156063] Modules linked in: cts rpcsec_gss_krb5 intel_rapl x86_pkg_temp_thermal coretemp nfsd nfs_acl lockd auth_rpcgss grace sunrpc xenfs xen_privcmd autofs4
[349668.156063] CPU: 1 PID: 1342 Comm: rpc.svcgssd Tainted: G L 4.4.0-36-generic #55-Ubuntu
[349668.156063] task: ffff880006ce2580 ti: ffff88007bb78000 task.ti: ffff88007bb78000
[349668.156063] RIP: e030:[<ffffffffc0036fcc>] [<ffffffffc0036fcc>] qword_addhex+0x6c/0xe0 [sunrpc]
[349668.156063] RSP: e02b:ffff88007bb7bdc8 EFLAGS: 00000202
[349668.156063] RAX: ffff8800074089c3 RBX: ffff88003ed81980 RCX: 000000000000033f
[349668.156063] RDX: ffff88000740f4de RSI: ffff88007bb7be34 RDI: ffff88007bb7be38
[349668.156063] RBP: ffff88007bb7bdc8 R08: 0000000000000000 R09: 000000000000063d
[349668.156063] R10: 0000000000000001 R11: 0000000000000246 R12: ffff88007bb7be38
[349668.156063] R13: ffff88007bb7be34 R14: ffff8800794356a8 R15: 00000000fffffff5
[349668.156063] FS: 00007fa8c93ed740(0000) GS:ffff88007d300000(0000) knlGS:0000000000000000
[349668.156063] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[349668.156063] CR2: 00007f261bf81000 CR3: 000000007813a000 CR4: 0000000000002660
[349668.156063] Stack:
[349668.156063] ffff88007bb7bdf0 ffffffffc009045b 0000000000001000 ffff8800793ce940
[349668.156063] ffff88007b8c3320 ffff88007bb7be70 ffffffffc0039101 000000008c6f5a6d
[349668.156063] 0000000100000000 ffff88007b1d7e10 ffff880007235bb0 ffff880079435600
[349668.156063] Call Trace:
[349668.156063] [<ffffffffc009045b>] rsi_request+0x3b/0x50 [auth_rpcgss]
[349668.156063] [<ffffffffc0039101>] cache_read.isra.19+0x2b1/0x400 [sunrpc]
[349668.156063] [<ffffffffc0039281>] cache_read_procfs+0x31/0x40 [sunrpc]
[349668.156063] [<ffffffff8127a132>] proc_reg_read+0x42/0x70
[349668.156063] [<ffffffff8120cc28>] __vfs_read+0x18/0x40
[349668.156063] [<ffffffff8120d1f6>] vfs_read+0x86/0x130
[349668.156063] [<ffffffff8120df45>] SyS_read+0x55/0xc0
[349668.156063] [<ffffffff8182dfb2>] entry_SYSCALL_64_fastpath+0x16/0x71
[349668.156063] Code: 00 00 84 c0 0f 84 84 00 00 00 4c 89 d0 eb 05 45 84 c0 74 5d 44 0f b6 02 48 83 c0 02 41 83 e9 02 45 89 c2 41 83 e0 0f 41 c0 ea 04 <45> 0f b6 80 80 4a a5 81 41 83 e2 0f 45 0f b6 92 80 4a a5 81 44

The workaround from comment #2 (disable PAC) still works

dmesg output:
[349643.292059]  [<ffffffff8182dfb2>] ? entry_SYSCALL_64_fastpath+0x16/0x71
[349668.156063] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [rpc.svcgssd:1342]
[349668.156063] Modules linked in: cts rpcsec_gss_krb5 intel_rapl x86_pkg_temp_thermal coretemp nfsd nfs_acl lockd auth_rpcgss grace sunrpc xenfs xen_privcmd autofs4
[349668.156063] CPU: 1 PID: 1342 Comm: rpc.svcgssd Tainted: G             L  4.4.0-36-generic #55-Ubuntu
[349668.156063] task: ffff880006ce2580 ti: ffff88007bb78000 task.ti: ffff88007bb78000
[349668.156063] RIP: e030:[<ffffffffc0036fcc>]  [<ffffffffc0036fcc>] qword_addhex+0x6c/0xe0 [sunrpc]
[349668.156063] RSP: e02b:ffff88007bb7bdc8  EFLAGS: 00000202
[349668.156063] RAX: ffff8800074089c3 RBX: ffff88003ed81980 RCX: 000000000000033f
[349668.156063] RDX: ffff88000740f4de RSI: ffff88007bb7be34 RDI: ffff88007bb7be38
[349668.156063] RBP: ffff88007bb7bdc8 R08: 0000000000000000 R09: 000000000000063d
[349668.156063] R10: 0000000000000001 R11: 0000000000000246 R12: ffff88007bb7be38
[349668.156063] R13: ffff88007bb7be34 R14: ffff8800794356a8 R15: 00000000fffffff5
[349668.156063] FS:  00007fa8c93ed740(0000) GS:ffff88007d300000(0000) knlGS:0000000000000000
[349668.156063] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[349668.156063] CR2: 00007f261bf81000 CR3: 000000007813a000 CR4: 0000000000002660
[349668.156063] Stack:
[349668.156063]  ffff88007bb7bdf0 ffffffffc009045b 0000000000001000 ffff8800793ce940
[349668.156063]  ffff88007b8c3320 ffff88007bb7be70 ffffffffc0039101 000000008c6f5a6d
[349668.156063]  0000000100000000 ffff88007b1d7e10 ffff880007235bb0 ffff880079435600
[349668.156063] Call Trace:
[349668.156063]  [<ffffffffc009045b>] rsi_request+0x3b/0x50 [auth_rpcgss]
[349668.156063]  [<ffffffffc0039101>] cache_read.isra.19+0x2b1/0x400 [sunrpc]
[349668.156063]  [<ffffffffc0039281>] cache_read_procfs+0x31/0x40 [sunrpc]
[349668.156063]  [<ffffffff8127a132>] proc_reg_read+0x42/0x70
[349668.156063]  [<ffffffff8120cc28>] __vfs_read+0x18/0x40
[349668.156063]  [<ffffffff8120d1f6>] vfs_read+0x86/0x130
[349668.156063]  [<ffffffff8120df45>] SyS_read+0x55/0xc0
[349668.156063]  [<ffffffff8182dfb2>] entry_SYSCALL_64_fastpath+0x16/0x71
[349668.156063] Code: 00 00 84 c0 0f 84 84 00 00 00 4c 89 d0 eb 05 45 84 c0 74 5d 44 0f b6 02 48 83 c0 02 41 83 e9 02 45 89 c2 41 83 e0 0f 41 c0 ea 04 <45> 0f b6 80 80 4a a5 81 41 83 e2 0f 45 0f b6 92 80 4a a5 81 44

tags:

added: xenial

Alberto Salvia Novella (es20490446e) on 2016-09-05

Changed in nfs-utils (Ubuntu):
importance:	Undecided → Medium

Revision history for this message

urraca (urraca) wrote on 2017-02-20:

Download full text (6.1 KiB)

Not 100% sure if we're seeing the same bug (we use plain Kerberos, no AD or Samba involved). However, since we started rolling out 14.04 and 16.04 in bigger numbers, we get literally hundreds of BUG messages (see below) and a few kernel panics a week (console output showing among other "oops_end" and "rpc_pipe_read", so we're pretty sure there's a direct connection between what we have in the logs and the panics).

The kernel BUGs _all_ look like this:

NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:59963]
NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:59963]
Modules linked in: cpuid 8021q garp mrp stp llc cts nfsv4 ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_comment xt_conntrack nf_conntrack xt_multiport iptable_filter ip_tables x_tables autofs4 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass ipmi_devintf crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac dcdbas ipmi_si wmi mei_me ipmi_msghandler edac_core mei shpchp 8250_fintek lpc_ich acpi_power_meter mac_hid rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl lp nfs parport lockd grace sunrpc fscache tg3 ahci megaraid_sas ptp libahci pps_core fjes
CPU: 5 PID: 59963 Comm: stat Tainted: G D W 4.4.0-53-generic #74~14.04.1-Ubuntu
Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.2.10 03/09/2015
task: ffff880b2b0b6040 ti: ffff8809e1d00000 task.ti: ffff8809e1d00000
RIP: 0010:[<ffffffff810c4fa0>] [<ffffffff810c4fa0>] native_queued_spin_lock_slowpath+0x160/0x170
RSP: 0018:ffff8809e1d03960 EFLAGS: 00000202
RAX: 0000000000000101 RBX: ffff8808d6319600 RCX: 0000000000000001
RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff881056f679c8
RBP: ffff8809e1d03960 R08: 0000000000000101 R09: 000000000000ffff
R10: 0000000000000000 R11: ffffea00415b4a00 R12: ffff881056f67900
R13: ffff881056f679c8 R14: ffff8808d631976b R15: ffff8808d6319600
FS: 00007fdcb0f39840(0000) GS:ffff88105e480000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fdcb0604160 CR3: 00000008dd211000 CR4: 00000000001406e0
Stack:
ffff8809e1d03970 ffffffff81180e47 ffff8809e1d03980 ffffffff817fe4d0
ffff8809e1d039d0 ffffffffc0114d50 ffff8808d6319740 0000000000000000
ffffffffc0129060 ffff8810560eaf00 0000000000000001 ffffffff81ef6bc0
Call Trace:
[<ffffffff81180e47>] queued_spin_lock_slowpath+0xb/0xf
[<ffffffff817fe4d0>] _raw_spin_lock+0x20/0x30
[<ffffffffc0114d50>] gss_setup_upcall+0x160/0x390 [auth_rpcgss]
[<ffffffffc0115b8e>] gss_cred_init+0xce/0x350 [auth_rpcgss]
[<ffffffff810bde50>] ? prepare_to_wait_event+0xf0/0xf0
[<ffffffffc00d1473>] rpcauth_lookup_credcache+0x1e3/0x280 [sunrpc]
[<ffffffffc011338e>] gss_lookup_cred+0xe/0x10 [auth_rpcgss]
[<ffffffffc00d0c7c>] rpcauth_lookupcred+0x7c/0xb0 [sunrpc]
[<ffffffffc00d1c6a>] rpcauth_refreshcred+0x12a/0x1a0 [sunrpc]
[<ffffffffc00c1650>] ? call_bc_transmit+0x1a0/0x1a0 [sunrpc]
[<ffffffffc00c1650>] ? call_bc_transmit+0x1a0/0x1a0 [sunrpc]
[<ffffffffc00c1b30>] ? call_retry_reserve+0x60/0...

The kernel BUGs _all_ look like this:

NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:59963]
NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:59963]
Modules linked in: cpuid 8021q garp mrp stp llc cts nfsv4 ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_comment xt_conntrack nf_conntrack xt_multiport iptable_filter ip_tables x_tables autofs4 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass ipmi_devintf crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac dcdbas ipmi_si wmi mei_me ipmi_msghandler edac_core mei shpchp 8250_fintek lpc_ich acpi_power_meter mac_hid rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl lp nfs parport lockd grace sunrpc fscache tg3 ahci megaraid_sas ptp libahci pps_core fjes
CPU: 5 PID: 59963 Comm: stat Tainted: G      D W       4.4.0-53-generic #74~14.04.1-Ubuntu
Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.2.10 03/09/2015
task: ffff880b2b0b6040 ti: ffff8809e1d00000 task.ti: ffff8809e1d00000
RIP: 0010:[<ffffffff810c4fa0>]  [<ffffffff810c4fa0>] native_queued_spin_lock_slowpath+0x160/0x170
RSP: 0018:ffff8809e1d03960  EFLAGS: 00000202
RAX: 0000000000000101 RBX: ffff8808d6319600 RCX: 0000000000000001
RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff881056f679c8
RBP: ffff8809e1d03960 R08: 0000000000000101 R09: 000000000000ffff
R10: 0000000000000000 R11: ffffea00415b4a00 R12: ffff881056f67900
R13: ffff881056f679c8 R14: ffff8808d631976b R15: ffff8808d6319600
FS:  00007fdcb0f39840(0000) GS:ffff88105e480000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fdcb0604160 CR3: 00000008dd211000 CR4: 00000000001406e0
Stack:
 ffff8809e1d03970 ffffffff81180e47 ffff8809e1d03980 ffffffff817fe4d0
 ffff8809e1d039d0 ffffffffc0114d50 ffff8808d6319740 0000000000000000
 ffffffffc0129060 ffff8810560eaf00 0000000000000001 ffffffff81ef6bc0
Call Trace:
 [<ffffffff81180e47>] queued_spin_lock_slowpath+0xb/0xf
 [<ffffffff817fe4d0>] _raw_spin_lock+0x20/0x30
 [<ffffffffc0114d50>] gss_setup_upcall+0x160/0x390 [auth_rpcgss]
 [<ffffffffc0115b8e>] gss_cred_init+0xce/0x350 [auth_rpcgss]
 [<ffffffff810bde50>] ? prepare_to_wait_event+0xf0/0xf0
 [<ffffffffc00d1473>] rpcauth_lookup_credcache+0x1e3/0x280 [sunrpc]
 [<ffffffffc011338e>] gss_lookup_cred+0xe/0x10 [auth_rpcgss]
 [<ffffffffc00d0c7c>] rpcauth_lookupcred+0x7c/0xb0 [sunrpc]
 [<ffffffffc00d1c6a>] rpcauth_refreshcred+0x12a/0x1a0 [sunrpc]
 [<ffffffffc00c1650>] ? call_bc_transmit+0x1a0/0x1a0 [sunrpc]
 [<ffffffffc00c1650>] ? call_bc_transmit+0x1a0/0x1a0 [sunrpc]
 [<ffffffffc00c1b30>] ? call_retry_reserve+0x60/0x60 [sunrpc]
 [<ffffffffc00c1b30>] ? call_retry_reserve+0x60/0x60 [sunrpc]
 [<ffffffffc00c1b6c>] call_refresh+0x3c/0x70 [sunrpc]
 [<ffffffffc00cd496>] __rpc_execute+0x86/0x440 [sunrpc]
 [<ffffffffc00d057e>] rpc_execute+0x5e/0xb0 [sunrpc]
 [<ffffffffc00c4210>] rpc_run_task+0x70/0x90 [sunrpc]
 [<ffffffffc0484176>] nfs4_call_sync_sequence+0x56/0x80 [nfsv4]
 [<ffffffffc0484d78>] _nfs4_proc_statfs+0xb8/0xd0 [nfsv4]
 [<ffffffffc048efb9>] nfs4_proc_statfs+0x49/0x70 [nfsv4]
 [<ffffffffc014a479>] nfs_statfs+0x59/0x170 [nfs]
 [<ffffffff8123189b>] statfs_by_dentry+0x9b/0x120
 [<ffffffff8123193b>] vfs_statfs+0x1b/0xb0
 [<ffffffff81231a19>] user_statfs+0x49/0x80
 [<ffffffff81231a65>] SYSC_statfs+0x15/0x30
 [<ffffffff81231b9e>] SyS_statfs+0xe/0x10
 [<ffffffff817fe876>] entry_SYSCALL_64_fastpath+0x16/0x75
Code: 8b 01 48 85 c0 75 0a f3 90 48 8b 01 48 85 c0 74 f6 c7 40 08 01 00 00 00 e9 61 ff ff ff 83 fa 01 75 07 e9 c2 fe ff ff f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 66 90 0f 1f 44 00 00

This is from yesterday, Sunday. Nobody was logged in to the machine (according to wtmp & wtmp.1, not since at least Feb 01, but those could of course have been borked by the panic). 
Nevertheless there are 676 "gss_setup_upcall" lines between 2017-02-19T09:06:46.535855+01:00 and 2017-02-19T09:41:02.900460+01:00.
And sure enough, after the last occurence the server PANIC'd.

In this paricular case it is 14.04.5, Kernel 4.4.0-53-generic, but we've seen this with basically every 4.4.0-* flavour on 14.04 and 16.04.
This config has basically been in use (on 12.04) for a few years:

*****
/etc/krb5.conf (excerpt):
[libdefaults]
        dns_lookup_realm = true
        dns_lookup_kdc = true
        kdc_timesync = 1
        ccache_type = 4
        forwardable = true
        proxiable = true
*****

*****
autofs used for NFS directories, with options
-fstype=nfs,intr,hard,fg,rsize=16384,wsize=16384,proto=tcp,timeo=600,retrans=3,port=2049,nfsvers=4,sec=krb5p,nodev,nosuid
*****

*****
# grep -v '^#' /etc/default/autofs 
MASTER_MAP_NAME=/etc/auto.master
TIMEOUT=300
BROWSE_MODE=yes
LOGGING=none
USE_MISC_DEVICE=yes
*****

*****
# grep -v '^#' /etc/default/nfs-common 
NEED_STATD=no
STATDOPTS=
NEED_GSSD=yes
NEED_IDMAPD=yes
*****

*****
/etc/idmapd.conf:
[General]

Verbosity = 0
Pipefs-Directory = /run/rpc_pipefs
# set your own domain here, if id differs from FQDN minus hostname
Domain = our.domain

[Mapping]

Nobody-User = nobody
Nobody-Group = nogroup

[Translation]
Method = nsswitch
*****

What we do have is ~ 23 UNIX groups assigned to most of the admin users. We're working on reducing those at the moment, but nastily we did not find a way to reproduce the panic, so we have no reliable way to tell right now if that fixes things (the procedure in the initial report does not trigger the BUG nor panic for us). Also, the servers that we equipped with kexec did not crash again since.

We'd appreciate to get the importance of this raised significantly, as our only _stable_ platform (i.e., doesn't randomly crash) is 12.04 right now, which will be EOL pretty soon.

Revision history for this message

urraca (urraca) wrote on 2017-03-07:

Any update on this?
We have reduced the # of assigned groups per user to a sane level (~7) and still see random crashes (4 since last Saturday).

Revision history for this message

urraca (urraca) wrote on 2017-03-28:

#10

Well, we were hoping that 4.4.0-66 would fix the issue, as https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1650336 looked pretty similar (and would have made sense).

Nastily, also patched and rebooted servers crash on us randomly.

This is getting urgent, as 12.04 is running out of support really soon!

Revision history for this message

Seth Forshee (sforshee) wrote on 2017-03-28:

#11

@urraca: I suspect that what you're seeing is not the same problem as the original report. Please file a new bug against the linux package and include the data from comment #8.

Revision history for this message

urraca (urraca) wrote on 2017-03-28:

#12

N.B. that according to the changelog of the 4.4.0-70 kernel package, the patch has only been applied to -67 (thus effectively to -70)!
We'll have to re-try if that one makes our environment stable... :/

Revision history for this message

urraca (urraca) wrote on 2017-04-05:

#13

@sforshee: before I open another bug, could you have a quick look at 1650336 (comment #10) and check if that might be related (it does smell like it big time really...)?

Revision history for this message

Seth Forshee (sforshee) wrote on 2017-04-05:

#14

@urraca: I don't believe they're directly related. There were no BUGs reported on that bug, and the issues on that bug are already fixed.

Revision history for this message

Cristiano Cumer (cristianomaria-cumer) wrote on 2018-10-02:

#15

Why not use gss-proxy instead of rpc.svcgssd for NFS also in Ubuntu, like other distress do?

Revision history for this message

Koen Dierckx (dierckxk) wrote on 2021-07-02:

#16

The workaround as suggested it to use gssproxy
Unfortunately that also has a bug
https://bugs.launchpad.net/ubuntu/+source/gssproxy/+bug/1788459

Revision history for this message

Koen Dierckx (dierckxk) wrote on 2021-07-02:

#17

Manually compiling and installing the latest version of gssproxy did solve the issue for me

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Dependencies.txt Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntunfs-utils package

kernel soft lockup on nfs server when using a kerberos mount

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
nfs-utils package