An user reported several nodes in their Kubernetes clusters
hit a kernel panic at about the same time, and periodically
(usually 35 days of uptime, and in same order nodes booted.)
The kernel panics message/stack trace are consistent across
nodes, in __fput() by iptables-save/restore from kube-proxy.
Problem Report:
--------------
An user reported several nodes in their Kubernetes clusters
hit a kernel panic at about the same time, and periodically
(usually 35 days of uptime, and in same order nodes booted.)
The kernel panics message/stack trace are consistent across save/restore from kube-proxy.
nodes, in __fput() by iptables-
Example:
""" linux/fs. h:2583! run+0x86/ 0xb0 usermode_ loop+0xc2/ 0xd0 return_ slowpath+ 0x4e/0x60 from_sys_ call+0x25/ 0x9f
[3016161.866702] kernel BUG at .../include/
[3016161.866704] invalid opcode: 0000 [#1] SMP
...
[3016161.866780] CPU: 40 PID: 33068 Comm: iptables-restor Tainted: P OE 4.4.0-133-generic #159-Ubuntu
...
[3016161.866786] RIP: 0010:[...] [...] __fput+0x223/0x230
...
[3016161.866818] Call Trace:
[3016161.866823] [...] ____fput+0xe/0x10
[3016161.866827] [...] task_work_
[3016161.866831] [...] exit_to_
[3016161.866833] [...] syscall_
[3016161.866839] [...] int_ret_
"""
(uptime: 3016161 seconds / (24*60*60) = 34.90 days)
They have provided a crashdump (privately available) used
for analysis later in this bug report.
Note: the root cause turns out to be independent of K8s,
as explained in the Root Cause section.
Related Report:
--------------
This behavior matches this public bug of another user: /github. com/kubernetes/ kubernetes/ issues/ 70229
https:/
"""
I have several machines happen kernel panic,and these
machine have same dump trace like below:
KERNEL: /usr/lib/ debug/boot/ vmlinux- 4.4.0-104- generic linux/fs. h:2582! " usermode_ loop at ffffffff81003242 return_ slowpath at ffffffff81003c6e from_sys_ call at ffffffff818449d0
...
PANIC: "kernel BUG at .../include/
...
COMMAND: "iptables-restor"
...
crash> bt
...
[exception RIP: __fput+541]
...
#8 [ffff880199f33e60] __fput at ffffffff812125ac
#9 [ffff880199f33ea8] ____fput at ffffffff812126ee
#10 [ffff880199f33eb8] task_work_run at ffffffff8109f101
#11 [ffff880199f33ef8] exit_to_
#12 [ffff880199f33f30] syscall_
#13 [ffff880199f33f50] int_ret_
...
The above showed command "iptables-restor" cause the kernel
panic and its pid is 16884,its parent process is kube-proxy.
Sometimes the process of kernel panic is "iptables-save" and
the dump trace are same.
The kernel panic always happens every 26 days(machine uptime)
"""
<< Adding further sections as comments to keep page short. >>