Comment 0 for bug 1873074

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Problem Report:
--------------

An user reported several nodes in their Kubernetes clusters
hit a kernel panic at about the same time, and periodically
(usually 35 days of uptime, and in same order nodes booted.)

The kernel panics message/stack trace are consistent across
nodes, in __fput() by iptables-save/restore from kube-proxy.

Example:

"""
[3016161.866702] kernel BUG at .../include/linux/fs.h:2583!
[3016161.866704] invalid opcode: 0000 [#1] SMP
...
[3016161.866780] CPU: 40 PID: 33068 Comm: iptables-restor Tainted: P OE 4.4.0-133-generic #159-Ubuntu
...
[3016161.866786] RIP: 0010:[...] [...] __fput+0x223/0x230
...
[3016161.866818] Call Trace:
[3016161.866823] [...] ____fput+0xe/0x10
[3016161.866827] [...] task_work_run+0x86/0xb0
[3016161.866831] [...] exit_to_usermode_loop+0xc2/0xd0
[3016161.866833] [...] syscall_return_slowpath+0x4e/0x60
[3016161.866839] [...] int_ret_from_sys_call+0x25/0x9f
"""

(uptime: 3016161 seconds / (24*60*60) = 34.90 days)

They have provided a crashdump (privately available) used
for analysis later in this bug report.

Note: the root cause turns out to be independent of K8s,
as explained in the Root Cause section.

Related Report:
--------------

This behavior matches this public bug of another user:
https://github.com/kubernetes/kubernetes/issues/70229

"""
I have several machines happen kernel panic,and these
machine have same dump trace like below:

KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-104-generic
...
PANIC: "kernel BUG at .../include/linux/fs.h:2582!"
...
COMMAND: "iptables-restor"
...
crash> bt
...
[exception RIP: __fput+541]
...
#8 [ffff880199f33e60] __fput at ffffffff812125ac
#9 [ffff880199f33ea8] ____fput at ffffffff812126ee
#10 [ffff880199f33eb8] task_work_run at ffffffff8109f101
#11 [ffff880199f33ef8] exit_to_usermode_loop at ffffffff81003242
#12 [ffff880199f33f30] syscall_return_slowpath at ffffffff81003c6e
#13 [ffff880199f33f50] int_ret_from_sys_call at ffffffff818449d0
...

The above showed command "iptables-restor" cause the kernel
panic and its pid is 16884,its parent process is kube-proxy.

Sometimes the process of kernel panic is "iptables-save" and
the dump trace are same.

The kernel panic always happens every 26 days(machine uptime)
"""

<< Adding further sections as comments to keep page short. >>