Server reboots every 4.1 weeks

Bug #1653498 reported by emmecerre
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned
Xenial
Confirmed
High
Unassigned

Bug Description

Every 4.1 uptime weeks (more or less), our 34 servers reboots with the logs below.

Description: Ubuntu 16.04.1 LTS
Release: 16.04

The servers hosts the stack flanneld (0.5.5) docker (1.11.2, build b9f10c9) kubernetes (v1.3.6) plus etcd (2.3.7)

Jan 02 06:40:32 prd-node021 kernel: ------------[ cut here ]------------
Jan 02 06:40:32 prd-node021 kernel: kernel BUG at /build/linux-xHzv4a/linux-4.4.0/include/linux/fs.h:2569!
Jan 02 06:40:32 prd-node021 kernel: invalid opcode: 0000 [#1] SMP
Jan 02 06:40:32 prd-node021 kernel: Modules linked in: nf_conntrack_netlink nfnetlink veth xt_statistic xt_nat xt_recent ipt_REJECT nf_reject_ipv4 xt_tcpudp tcp_diag inet_diag
Jan 02 06:40:32 prd-node021 kernel: raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc
Jan 02 06:40:32 prd-node021 kernel: CPU: 46 PID: 22749 Comm: iptables-restor Not tainted 4.4.0-47-generic #68-Ubuntu
Jan 02 06:40:32 prd-node021 kernel: Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 09/13/2016
Jan 02 06:40:32 prd-node021 kernel: task: ffff882f7ccb44c0 ti: ffff882fcb810000 task.ti: ffff882fcb810000
Jan 02 06:40:32 prd-node021 kernel: RIP: 0010:[<ffffffff8120f9ed>] [<ffffffff8120f9ed>] __fput+0x21d/0x220
Jan 02 06:40:32 prd-node021 kernel: RSP: 0018:ffff882fcb813e68 EFLAGS: 00010246
Jan 02 06:40:32 prd-node021 kernel: RAX: 0000000000000000 RBX: ffff8829dbd92700 RCX: 00000000308c6396
Jan 02 06:40:32 prd-node021 kernel: RDX: 0000000000000001 RSI: ffff88301f299f60 RDI: 0000000000000000
Jan 02 06:40:32 prd-node021 kernel: RBP: ffff882fcb813ea0 R08: 0000000000019f60 R09: ffffffff811b3a1d
Jan 02 06:40:32 prd-node021 kernel: R10: ffffea0064756680 R11: ffff8829dbd92710 R12: 0000000000000010
Jan 02 06:40:32 prd-node021 kernel: R13: ffff8817d3520518 R14: ffff881021a34da0 R15: ffff8817d3542540
Jan 02 06:40:32 prd-node021 kernel: FS: 00007fcdaa753700(0000) GS:ffff88301f280000(0000) knlGS:0000000000000000
Jan 02 06:40:32 prd-node021 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 02 06:40:32 prd-node021 kernel: CR2: 00007fcdaa758000 CR3: 0000001917993000 CR4: 00000000003406e0
Jan 02 06:40:32 prd-node021 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 02 06:40:32 prd-node021 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jan 02 06:40:32 prd-node021 kernel: Stack:
Jan 02 06:40:32 prd-node021 kernel: ffff8817d3520518 ffff8829dbd92710 ffff882f7ccb44c0 ffffffff82103a30
Jan 02 06:40:32 prd-node021 kernel: ffff8829dbd92700 0000000000000000 ffff882f7ccb4b38 ffff882fcb813eb0
Jan 02 06:40:32 prd-node021 kernel: ffffffff8120fa2e ffff882fcb813ef0 ffffffff8109ee01 ffff882f7ccb4b6c
Jan 02 06:40:32 prd-node021 kernel: Call Trace:
Jan 02 06:40:32 prd-node021 kernel: [<ffffffff8120fa2e>] ____fput+0xe/0x10
Jan 02 06:40:32 prd-node021 kernel: [<ffffffff8109ee01>] task_work_run+0x81/0xa0
Jan 02 06:40:32 prd-node021 kernel: [<ffffffff81003242>] exit_to_usermode_loop+0xc2/0xd0
Jan 02 06:40:32 prd-node021 kernel: [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
Jan 02 06:40:32 prd-node021 kernel: [<ffffffff81835150>] int_ret_from_sys_call+0x25/0x8f
Jan 02 06:40:32 prd-node021 kernel: Code: 0f 84 cf fe ff ff 48 8b 43 28 48 8b 80 80 00 00 00 48 85 c0 0f 84 bb fe ff ff 31 d2 48 89 de bf ff ff ff ff ff d0 e9 aa fe ff ff <0f>
Jan 02 06:40:32 prd-node021 kernel: RIP [<ffffffff8120f9ed>] __fput+0x21d/0x220
-- Reboot --
Jan 02 06:42:56 prd-node021 systemd-journald[819]: Runtime journal (/run/log/journal/) is 8.0M, max 1.8G, 1.8G free.
Jan 02 06:42:56 prd-node021 kernel: Initializing cgroup subsys cpuset

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-47-generic 4.4.0-47.68
ProcVersionSignature: Ubuntu 4.4.0-47.68-generic 4.4.24
Uname: Linux 4.4.0-47-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jan 2 06:42 seq
 crw-rw---- 1 root audio 116, 33 Jan 2 06:42 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.1-0ubuntu2.1
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Mon Jan 2 10:49:15 2017
HibernationDevice: RESUME=/dev/mapper/vg00-swap
InstallationDate: Installed on 2016-09-13 (110 days ago)
InstallationMedia: Ubuntu-Server 16.04.1 LTS "Xenial Xerus" - Beta amd64 (20160803)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
MachineType: HP ProLiant DL360 Gen9
PciMultimedia:

ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-47-generic root=/dev/mapper/vg00-root ro cgroup_enable=memory swapaccount=1
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-47-generic N/A
 linux-backports-modules-4.4.0-47-generic N/A
 linux-firmware 1.157.5
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 09/13/2016
dmi.bios.vendor: HP
dmi.bios.version: P89
dmi.board.name: ProLiant DL360 Gen9
dmi.board.vendor: HP
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrP89:bd09/13/2016:svnHP:pnProLiantDL360Gen9:pvr:rvnHP:rnProLiantDL360Gen9:rvr:cvnHP:ct23:cvr:
dmi.product.name: ProLiant DL360 Gen9
dmi.sys.vendor: HP

Revision history for this message
emmecerre (manuelcarlo-ranieri) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Colin Ian King (colin-king) wrote :

A possible interesting data point is that /proc/interrupts CAL interrupts (Function call interrupts) are reaching a 32 bit wrap around.

Revision history for this message
emmecerre (manuelcarlo-ranieri) wrote :

I thought about this, but I've no idea which buffer fills

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.10 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10-rc2

Revision history for this message
emmecerre (manuelcarlo-ranieri) wrote :

Hi Joseph Salisbury
I installed the upstream kernel but this kernel AUFS miss the support that's needed by our stack.
I also tried to compile the kernel with the upstream AUFS but the online AUFS is too old and patches do not apply correctly and the process goes in error.

I dunno if make sense to try some older kernel v4.8 or v4.9 (if the compile works... )

Revision history for this message
emmecerre (manuelcarlo-ranieri) wrote :

In the meantime, I compiled the upstream 4.9.2
kernel + ubuntu patches + aufs
Let me know if make sense the test with this kernel version

Revision history for this message
From-Nibly (thatoneemail) wrote :

What is the latest on this. We are running into a similar problem with kubernetes only the most recent occurrence was in 13 days. Did upgrading the kernel to 4.9.2 work?

Revision history for this message
emmecerre (manuelcarlo-ranieri) wrote :

We installed the 4.9.2 kernel a few days ago. We are still waiting for crash :)
For now, we have more or less 3.1 weeks of uptime.
I'll keep you posted.

Revision history for this message
emmecerre (manuelcarlo-ranieri) wrote :

The server with kernel 4.9.2 reboots after 2.874 weeks.

Revision history for this message
emmecerre (manuelcarlo-ranieri) wrote :

kernel-bug-exists-upstream

tags: added: kernel-bug-exists-upstream
tags: added: confirmed
Revision history for this message
Andrii (angr) wrote :
Download full text (3.5 KiB)

Hello,

We have the same issue on our k8s cluster.

Description: Ubuntu 16.04.3 LTS
Release: 16.04

Docker: 17.03.2-ce
Kubernetes: v1.10.11
Kernel: 4.4.0-145-generic, 4.4.0-109-generic

[5493817.706210] ------------[ cut here ]------------
[5493817.707602] kernel BUG at /build/linux-6VmqmP/linux-4.4.0/include/linux/fs.h:2585!
[5493817.708284] invalid opcode: 0000 [#1] SMP
[5493817.708935] Modules linked in: xt_set xt_multiport iptable_raw iptable_mangle ip_set_hash_net veth nf_conntrack_netlink xt_statistic xt_nat xt_recent ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set nfnetlink ip_vs ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables xt_comment xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs vmw_balloon vmw_vsock_vmci_transport vsock joydev input_leds serio_raw shpchp i2c_piix4 vmw_vmci mac_hid ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr sunrpc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq
[5493817.714295] async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel vmwgfx aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd ttm psmouse drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops vmxnet3 drm vmw_pvscsi ahci libahci pata_acpi fjes
[5493817.716942] CPU: 5 PID: 13487 Comm: iptables-save Not tainted 4.4.0-145-generic #171-Ubuntu
[5493817.717597] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[5493817.718885] task: ffff8800bb16d400 ti: ffff88045f38c000 task.ti: ffff88045f38c000
[5493817.719627] RIP: 0010:[<ffffffff8121df13>] [<ffffffff8121df13>] __fput+0x223/0x230
[5493817.720314] RSP: 0018:ffff88045f38fe78 EFLAGS: 00010246
[5493817.721030] RAX: 0000000000000000 RBX: ffff88005f20e200 RCX: 00000000b86bfbc9
[5493817.721714] RDX: 0000000000000001 RSI: ffffe8ffffd51350 RDI: 0000000000000000
[5493817.722383] RBP: ffff88045f38feb0 R08: 000060f7c0011350 R09: ffffffff811beded
[5493817.723038] R10: ffffea000a9f3f00 R11: ffff88005f20e210 R12: 0000000000000010
[5493817.723695] R13: ffff880816bd11a8 R14: ffff8808130a1520 R15: ffff880816aaed80
[5493817.724371] FS: 00007f6087f8b700(0000) GS:ffff88083fd40000(0000) knlGS:0000000000000000
[5493817.725147] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[5493817.725835] CR2: 00007f6085f6d1f0 CR3: 00000004621e2000 CR4: 0000000000060670
[5493817.726518] Stack:
[5493817.727156] ffff880816bd11a8 ffff88005f20e210 ffffffff82113f30 ffff8800bb16daa8
[5493817.727806] ffff8800bb16d400 0000000000000000 ffff8800bb16d400 ffff88045f38fec0
[5493817.728508] ffffffff8121df5e ffff88045f38fef0 ffffffff810a4895 0000000000000002
[5493817.729138] Call Trace:
[5493817.729845] [<ffffffff8121df5e>] ____fput+0xe/0x10
[5493817.730511] [<ffffffff810a4895>] task_work_run+0x95/0xb0
[5493817.731222] [<ffffffff81003532>] exit_to_usermode_loop+0xc2/0xd0
[5493817.731875] [<ffffffff81003c7e>...

Read more...

Revision history for this message
Andrii (angr) wrote :

The similar problem is described on github: https://github.com/kubernetes/kubernetes/issues/70229

Brad Figg (brad-figg)
tags: added: ubuntu-certified
Revision history for this message
chudihuang (chudihuang) wrote :
Download full text (7.4 KiB)

Hi,

We have the same issue on our k8s cluster.

Description: ubuntu16.04.1 LTSx86_64
Release: 16.04.1

Kernel: 4.4.0-104-generic

the dump file can be downloaded via following way:
wget http://129.226.115.161/dump.202006231820.tar.gz

I did some analysis, however i still didnot find the root cause:

Load the vmcore in crash (please refer to the hyperlink above). Crash should present details similar to the following:
crash> bt
PID: 11388 TASK: ffff880eb1f79e00 CPU: 29 COMMAND: "heartbeat"
 #0 [ffff8809131a7b08] machine_kexec at ffffffff8105c22b
 #1 [ffff8809131a7b68] crash_kexec at ffffffff8110e852
 #2 [ffff8809131a7c38] oops_end at ffffffff81031c49
 #3 [ffff8809131a7c60] die at ffffffff810320fb
 #4 [ffff8809131a7c90] do_trap at ffffffff8102f121
 #5 [ffff8809131a7ce0] do_error_trap at ffffffff8102f4a9
 #6 [ffff8809131a7da0] do_invalid_op at ffffffff8102fa10
 #7 [ffff8809131a7db0] invalid_op at ffffffff8184638e
    [exception RIP: __fput+541]
    RIP: ffffffff812126ad RSP: ffff8809131a7e68 RFLAGS: 00010246
    RAX: 0000000000000000 RBX: ffff880ef6915700 RCX: 0000000365fb1705
    RDX: 0000000000000001 RSI: ffff880fff55a020 RDI: 0000000000000000
    RBP: ffff8809131a7ea0 R8: 000000000001a020 R9: ffffffff811b591d
    R10: ffffea002b69b300 R11: ffff880ef6915710 R12: 0000000000000010
    R13: ffff880ed152aef8 R14: ffff8800bba18aa0 R15: ffff880ed1513a40
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
 #8 [ffff8809131a7e60] __fput at ffffffff812125ac
 #9 [ffff8809131a7ea8] ____fput at ffffffff812126ee
#10 [ffff8809131a7eb8] task_work_run at ffffffff8109f101
#11 [ffff8809131a7ef8] exit_to_usermode_loop at ffffffff81003242
#12 [ffff8809131a7f30] syscall_return_slowpath at ffffffff81003c6e
#13 [ffff8809131a7f50] int_ret_from_sys_call at ffffffff818449d0
    RIP: 000000000047f704 RSP: 000000c423b77c98 RFLAGS: 00000246
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000047f704
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000ca
    RBP: 000000c423b77ce0 R8: 0000000000000000 R9: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    R13: 0000000000000000 R14: 000000c423b78ee0 R15: 0000000000000008
    ORIG_RAX: 0000000000000003 CS: 0033 SS: 002b
crash>

crash> log
[19156101.592212] ------------[ cut here ]------------
[19156101.593103] kernel BUG at /build/linux-SwhOyu/linux-4.4.0/include/linux/fs.h:2582!
[19156101.594385] invalid opcode: 0000 [#1] SMP
[19156101.595083] Modules linked in: binfmt_misc af_packet_diag netlink_diag dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag veth br_netfilter ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_set xt_mark ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport ip_set dummy xt_comment xt_addrtype iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_tcpudp bridge stp llc nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack aufs isofs ppdev crct10dif_pclmul parport_pc crc32_pclmul input_leds joydev ghash_clmulni_intel parport serio_raw ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.