watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [java:5783]

Bug #1772264 reported by Sriharsha BS
20
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
New
Undecided
Unassigned

Bug Description

Hello Team,

I have a Customer who is experiencing this issue once every 2 days and here are the details of the bug :

May 14 05:24:21 localhost kernel: [6006808.160001] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [java:5783]
May 14 05:24:21 localhost kernel: [6006808.160055] Modules linked in: ufs msdos xfs ip6table_filter ip6_tables iptable_filter nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_security ip_tables x_tables udf crc_itu_t i2c_piix4 hv_balloon joydev i2c_core serio_raw ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper hid_hyperv cryptd hyperv_fb pata_acpi hv_utils cfbfillrect cfbimgblt ptp cfbcopyarea hid hyperv_keyboard hv_netvsc pps_core
May 14 05:24:21 localhost kernel: [6006808.160055] CPU: 5 PID: 5783 Comm: java Not tainted 4.13.0-1011-azure #14-Ubuntu
May 14 05:24:21 localhost kernel: [6006808.160055] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
May 14 05:24:21 localhost kernel: [6006808.160055] task: ffff8b91a48fc5c0 task.stack: ffffb5c4cd014000
May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 0010:fsnotify+0x1f9/0x4f0
May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 0018:ffffb5c4cd017e08 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
May 14 05:24:21 localhost kernel: [6006808.160055] RAX: 0000000000000001 RBX: ffff8ba0f6246020 RCX: 00000000ffffffff
May 14 05:24:21 localhost kernel: [6006808.160055] RDX: ffff8ba0f6246048 RSI: 0000000000000000 RDI: ffffffff9bc57020
May 14 05:24:21 localhost kernel: [6006808.160055] RBP: ffffb5c4cd017ea8 R08: 0000000000000000 R09: 0000000000000000
May 14 05:24:21 localhost kernel: [6006808.160055] R10: ffffe93042d21080 R11: 0000000000000000 R12: 0000000000000000
May 14 05:24:21 localhost kernel: [6006808.160055] R13: ffff8ba0f6246048 R14: 0000000000000000 R15: 0000000000000000
May 14 05:24:21 localhost kernel: [6006808.160055] FS: 00007f154838a700(0000) GS:ffff8ba0fd740000(0000) knlGS:0000000000000000
May 14 05:24:21 localhost kernel: [6006808.160055] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 14 05:24:21 localhost kernel: [6006808.160055] CR2: 00007f3fcc254000 CR3: 0000000165c38000 CR4: 00000000001406e0
May 14 05:24:21 localhost kernel: [6006808.160055] Call Trace:
May 14 05:24:21 localhost kernel: [6006808.160055] ? new_sync_write+0xe5/0x140
May 14 05:24:21 localhost kernel: [6006808.160055] vfs_write+0x15a/0x1b0
May 14 05:24:21 localhost kernel: [6006808.160055] ? syscall_trace_enter+0xcd/0x2f0
May 14 05:24:21 localhost kernel: [6006808.160055] SyS_write+0x55/0xc0
May 14 05:24:21 localhost kernel: [6006808.160055] do_syscall_64+0x61/0xd0
May 14 05:24:21 localhost kernel: [6006808.160055] entry_SYSCALL64_slow_path+0x25/0x25
May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 0033:0x7f489076a2dd. #“This indicates Softlockup error message stored in the RIP Register”
May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 002b:00007f15483872a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
May 14 05:24:21 localhost kernel: [6006808.160055] RAX: ffffffffffffffda RBX: 00007f1548389380 RCX: 00007f489076a2dd
May 14 05:24:21 localhost kernel: [6006808.160055] RDX: 0000000000001740 RSI: 00007f1548387310 RDI: 000000000000063c
May 14 05:24:21 localhost kernel: [6006808.160055] RBP: 00007f15483872d0 R08: 00007f1548387310 R09: 00007f418a55b0b8
May 14 05:24:21 localhost kernel: [6006808.160055] R10: 00000000005b31ee R11: 0000000000000293 R12: 0000000000001740
May 14 05:24:21 localhost kernel: [6006808.160055] R13: 00007f1548387310 R14: 000000000000063c R15: 00007f40000051e0

The customer is using Elastic and hence he submitted a issue in Elastic Search github post which they are pointing that this is a Kernel issue and not a elastic search Issue :

Attached github post for reference :
https://github.com/elastic/elasticsearch/issues/30667

For now, I have asked him to increase the kernel.watchdog_thresh parameter from 10 to 20 to relax the situation.

The customer wants to know for sure whether this is a Kernel bug. I also asked him to perform Kernel update. However, if he is confirmed that this is a bug in the current Kernel, he is willing to do so in all the 65 servers.

The customer also submitted a bug to the Java process team which seems to be causing the issue,
There reply was it is a kernel issue and the following launchpad link was given although I personally think that is not really the case here. However, I may be wrong :

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717

This is the Information regarding the Performance of Java process within the customer's CPU

Avg. Load: Avg=3, max=9
CPU: Avg=29, max=73
MEM: Avg=18, max=23

CGROUP:
ubuntu@prod-elasticsearch-data-008:~$ cat /proc/1399/cgroup
12:rdma:/
11:devices:/system.slice/elasticsearch.service
10:pids:/system.slice/elasticsearch.service
9:cpuset:/
8:blkio:/system.slice/elasticsearch.service
7:memory:/system.slice/elasticsearch.service
6:perf_event:/
5:cpu,cpuacct:/system.slice/elasticsearch.service
4:net_cls,net_prio:/
3:freezer:/
2:hugetlb:/
1:name=systemd:/system.slice/elasticsearch.service

/etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"

Kernel Version : 4.13.0-1011-azure #14-Ubuntu

Please let me know your thoughts given the above information. Also, if extra information required, I will be happy to gather and provide you

Regards,
Sriharsha B S,
Microsoft Azure Linux Team

Revision history for this message
Sriharsha BS (sribs) wrote :
Revision history for this message
Joshua R. Poulson (jrp) wrote :

Indeed, this problem should be fixed by the 4.13.0-1017 that's currently in -proposed on its way to -updates. The bug for the race condition should be https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1765564

Revision history for this message
Joshua R. Poulson (jrp) wrote :

FYI, the "4.13.0-38.43" set of fixes referenced in the bug you mentioned has been in Linux-azure since 4.13.0-1013. Looking at your trace, I think the fsnotify/VFS race condition I referenced in the previous comment may be more applicable.

Revision history for this message
Oded David (dudidu) wrote :

Hi,

We are currently running on production with Kernel Version : 4.13.0-1011-azure #14-Ubuntu. We are about to update to Kernel Version : 4.13.0-1016-azure #14-Ubuntu.

Is there any workaround on how to solve the issue? or any update on when the Updated kernel version will be released?

The issue affects our production causing us the machines to freeze and fail every couple of days

Revision history for this message
Oded David (dudidu) wrote :

Hi,

Can you please approve the issue got fixed in 4.13.0-1018-azure

Revision history for this message
Joshua R. Poulson (jrp) wrote :

The upstream race was fixed in fsnotify was fixed in 4.13.0-1018 (linux-azure). I think it is the same issue that you are seeing, but I do not have a repro of your exact issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.