Multiple netstat checks in OCF scripts should be fixed

Bug #1385755 reported by Vasilios Tzanoudakis
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Bogdan Dobrelya

Bug Description

5.1.1
1 fuel node
3 controller nodes + GRE on Dell 6220 chassis
2 Compute Nodes Dell R620.

The kernel dump i think happend during a mass creation of 90 instances in the same time and which completed successfully and they are still running just fine.

/var/log/kern.log of node-1.
The errors exists many many times on the kern.log with the same cpu: 19 PID
=================================================================================
[13359.264100] ------------[ cut here ]------------
[13359.264104] WARNING: CPU: 19 PID: 55147 at /build/buildd/linux-lts-saucy-3.11.0/fs/dcache.c:2561 prepend_path+0x1d4/0x1e0()
[13359.264104] Root dentry has weird name <>
[13359.264105] Modules linked in: xt_nat xt_conntrack xt_REDIRECT ipt_REJECT xt_tcpudp ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat iptable_mangle xt_mark veth xt_state xt_CT iptable_raw xt_comment xt_multiport ip6table_filter ip6_tables iptable_filter ip_tables x_tables openvswitch(OF) ttm drm_kms_helper drm sysimgblt mei_me sysfillrect syscopyarea mei wmi joydev dcdbas shpchp ioatdma lpc_ich mac_hid nf_conntrack_ipv6 acpi_pad nf_defrag_ipv6 nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 lp parport hid_generic usbhid hid ixgbe mpt2sas igb scsi_transport_sas raid_class ahci dca libahci i2c_algo_bit ptp mdio pps_core
[13359.264137] CPU: 19 PID: 55147 Comm: lsof Tainted: GF W O 3.11.0-18-generic #32~precise1-Ubuntu
[13359.264138] Hardware name: Dell Inc. PowerEdge C6220 II/09N44V, BIOS 2.4.2 04/15/2014
[13359.264139] 0000000000000a01 ffff880849167d48 ffffffff8173d60f 0000000000000007
[13359.264142] ffff880849167d98 ffff880849167d88 ffffffff8106540c ffff880849167df8
[13359.264144] ffff880849167eb8 0000000000000000 ffff880848ea9340 ffff880465469320
[13359.264147] Call Trace:
[13359.264150] [<ffffffff8173d60f>] dump_stack+0x46/0x58
[13359.264152] [<ffffffff8106540c>] warn_slowpath_common+0x8c/0xc0
[13359.264154] [<ffffffff810654f6>] warn_slowpath_fmt+0x46/0x50
[13359.264156] [<ffffffff811c9c24>] prepend_path+0x1d4/0x1e0
[13359.264158] [<ffffffff811cb5a6>] d_path+0xf6/0x180
[13359.264160] [<ffffffff8121c6cb>] proc_pid_readlink+0x9b/0xf0
[13359.264162] [<ffffffff811ba1d6>] SyS_readlinkat+0xf6/0x120
[13359.264164] [<ffffffff811ba21b>] SyS_readlink+0x1b/0x20
[13359.264167] [<ffffffff817521dd>] system_call_fastpath+0x1a/0x1f
[13359.264168] ---[ end trace 4c606bbadaee0c69 ]---
[13359.264870] ------------[ cut here ]------------

root@node-1:~# crm status
Last updated: Sat Oct 25 21:17:14 2014
Last change: Sat Oct 25 21:17:01 2014 via crm_attribute on node-5
Stack: classic openais (with plugin)
Current DC: node-1 - partition with quorum
Version: 1.1.10-42f2063
3 Nodes configured, 3 expected votes
22 Resources configured

Online: [ node-1 node-3 node-5 ]

 vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-1
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-1
 p_ceilometer-alarm-evaluator (ocf::mirantis:ceilometer-alarm-evaluator): Started node-5
 p_ceilometer-agent-central (ocf::mirantis:ceilometer-agent-central): Started node-1
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-1 ]
     Slaves: [ node-3 node-5 ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-1 node-3 node-5 ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1 node-3 node-5 ]
 p_heat-engine (ocf::mirantis:heat-engine): Started node-1
 Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
     Started: [ node-1 node-3 node-5 ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-1 node-3 node-5 ]
 p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started node-5
 p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-1

controller specs:
===============
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
stepping : 4
microcode : 0x424
cpu MHz : 2600.065
cache size : 15360 KB
physical id : 1
siblings : 12
core id : 5
cpu cores : 6
apicid : 43
initial apicid : 43
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 5201.75
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

Memory: 32 GB

Tags: scale
Revision history for this message
Andrey Korolyov (xdeller) wrote :

Would you mind to attach output of mount command on this exact node?

Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote :

root@node-1:~# mount
/dev/sda3 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
/dev/mapper/mongo-mongodb on /var/lib/mongo type ext4 (rw)
/dev/sda2 on /boot type ext3 (rw)

root@node-1:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 47G 11G 34G 24% /
udev 16G 12K 16G 1% /dev
tmpfs 3.2G 352K 3.2G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 16G 37M 16G 1% /run/shm
/dev/mapper/mongo-mongodb 484G 34G 426G 8% /var/lib/mongo
/dev/sda2 185M 58M 119M 33% /boot

Revision history for this message
Andrey Korolyov (xdeller) wrote :

This chunk of code may fire reaching some (high enough) nested catalog depth trying to resolve some path to which descriptor for a destination pid (for lsof) belongs. If you are sure that the filesystems are not corrupted yet (no poweroffs/resets with mounted fs), it would be interesting to find a deterministic way to reproduce this. In case if you are not sure that the filesystems are in good health, may be re-checking both FS (/ and mongo`s) will eliminate issue.

Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote :

Yes i am pretty sure as that the FS are clean. Looking closely the timestamp of the logs I think that is happend during puppet deployment of the node as a controller. After the initial installation of ubuntu on node. I will try to reproduce it.

Changed in fuel:
milestone: none → 6.0
importance: Undecided → Medium
status: New → Confirmed
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

From dup https://bugs.launchpad.net/mos/+bug/1387087:

https://bugs.launchpad.net/mos/+bug/1387087/+attachment/4247979/+files/atop-node-44.raw

As far as I can see from live monitoring with atop,
LVM | image-glance | busy 94% | read 0 | write 242 | KiB/r 0 | | KiB/w 3 | MBr/s 0.00 | MBw/s 0.38 | avq 1.01 | avio 7.59 ms |
DSK | sdb | busy 94% | read 0 | write 345 | KiB/r 0 | | KiB/w 2 | MBr/s 0.00 | MBw/s 0.38 | avq 1.01 | avio 5.32 ms |

There is no read/write load rate at all but sdb and image-glance LVM are *extremely* busy...
According to atop man "Disk: A busy percentage of 70% or higher is considered ‘critical’."

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → MOS Linux (mos-linux)
importance: Medium → High
tags: added: scale
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

According to kernel source, this function makes buffer with whole path. This stack trace indicates that path was too deep, so trace happened on buffer overflow. Additionally, this stack trace happens on controllers with mongodb. To mitigate it, I suggest

1. Try to deploy 100 nodes without mongo
2. Upgrade mongodb to 2.6.5 where it has redesigned file system algorithms. Also it has many bug fixes related to file system

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote : Re: [Bug 1385755] Re: ubuntu kernel cpu dump for node-1 only
Download full text (7.8 KiB)

> This stack trace indicates that path was too deep, so trace happened on buffer overflow.
 <http://lxr.free-electrons.com/source/fs/dcache.c?v=3.11#L2557>This
warning happens only if path->dentry references the root of the
filesystem:
2558 <http://lxr.free-electrons.com/source/fs/dcache.c?v=3.11#L2558>
      if (IS_ROOT
<http://lxr.free-electrons.com/ident?v=3.11;i=IS_ROOT>(dentry
<http://lxr.free-electrons.com/ident?v=3.11;i=dentry>) &&2559
<http://lxr.free-electrons.com/source/fs/dcache.c?v=3.11#L2559>
     (dentry <http://lxr.free-electrons.com/ident?v=3.11;i=dentry>->d_name.len
<http://lxr.free-electrons.com/ident?v=3.11;i=len> != 1 || dentry
<http://lxr.free-electrons.com/ident?v=3.11;i=dentry>->d_name.name
<http://lxr.free-electrons.com/ident?v=3.11;i=name>[0] != *'/'*))
{2560 <http://lxr.free-electrons.com/source/fs/dcache.c?v=3.11#L2560>
               WARN
<http://lxr.free-electrons.com/ident?v=3.11;i=WARN>(1, *"Root dentry
has weird name <%.*s>\n"*,2561
<http://lxr.free-electrons.com/source/fs/dcache.c?v=3.11#L2561>
              (int) dentry
<http://lxr.free-electrons.com/ident?v=3.11;i=dentry>->d_name.len
<http://lxr.free-electrons.com/ident?v=3.11;i=len>, dentry
<http://lxr.free-electrons.com/ident?v=3.11;i=dentry>->d_name.name
<http://lxr.free-electrons.com/ident?v=3.11;i=name>);2562
<http://lxr.free-electrons.com/source/fs/dcache.c?v=3.11#L2562>
 }

The warning might indicate filesystem problem (say, NFS failed to read
its dentries
due to network being down).

On the other hand there are no warnings about buffer being too short
(which is pretty
normal situation), instead the corresponding error code is returned.

On Wed, Oct 29, 2014 at 2:13 PM, Sergii Golovatiuk <<email address hidden>
> wrote:

> According to kernel source, this function makes buffer with whole path.
> This stack trace indicates that path was too deep, so trace happened on
> buffer overflow. Additionally, this stack trace happens on controllers
> with mongodb. To mitigate it, I suggest
>
> 1. Try to deploy 100 nodes without mongo
> 2. Upgrade mongodb to 2.6.5 where it has redesigned file system
> algorithms. Also it has many bug fixes related to file system
>
> --
> You received this bug notification because you are a member of MOS
> Linux, which is a bug assignee.
> https://bugs.launchpad.net/bugs/1385755
>
> Title:
> ubuntu kernel cpu dump for node-1 only
>
> Status in Fuel: OpenStack installer that works:
> Confirmed
>
> Bug description:
> 5.1.1
> 1 fuel node
> 3 controller nodes + GRE on Dell 6220 chassis
> 2 Compute Nodes Dell R620.
>
> The kernel dump i think happend during a mass creation of 90 instances
> in the same time and which completed successfully and they are still
> running just fine.
>
> /var/log/kern.log of node-1.
> The errors exists many many times on the kern.log with the same cpu: 19
> PID
>
> =================================================================================
> [13359.264100] ------------[ cut here ]------------
> [13359.264104] WARNING: CPU: 19 PID: 55147 at
> /build/buildd/linux-lts-saucy-3.11.0/fs/dcache.c:2561
> prepend_path+0x1d4/0x1e0()
> [13359.264104] Root dentry has weird name <>
>...

Read more...

Changed in fuel:
assignee: MOS Linux (mos-linux) → Alexei Sheplyakov (asheplyakov)
Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote : Re: ubuntu kernel cpu dump for node-1 only

Just a clarification regarding the connection of the compute nodes to the storage unit.

The 2 Compute nodes were both connected via dual 8gbit fiber channel using multipath (active-active) on an EMC VNX 5100 Storage.

I have created one osd for each compute node and I have assigned 1 raid groups for each OSD.

is there any upgrade package for mongo available? I am running nightly version of 20 oct.

Revision history for this message
Sergey Galkin (sgalkin) wrote :

Reproduced without MongDB on 100 nodes cluster

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> This chunk of code may fire reaching some (high enough) nested catalog depth trying to resolve some path to which descriptor for a destination pid (for lsof) belongs.

I don't think resolving (possibly recursive) symlink can yield such a trace. The above mentioned warning is issued
for root dentries only:

2553 global_root:
2554 /*
2555 * Filesystems needing to implement special "root names"
2556 * should do so with ->d_dname()
2557 */
2558 if (IS_ROOT(dentry) &&
2559 (dentry->d_name.len != 1 || dentry->d_name.name[0] != '/')) {
2560 WARN(1, "Root dentry has weird name <%.*s>\n",
2561 (int) dentry->d_name.len, dentry->d_name.name);
2562 }

On the other hand there are no warnings about buffer being too short, or symlink nesting level being too big
(which is pretty normal situation), instead the corresponding error code is returned.

The warning might indicate filesystem problem (say, NFS failed to read its dentries due to network being down).
That said I don't think the trace indicates root filesystem corruption (typically it results in a lot of other easily
visible problems).

However many synthetic filesystems (anonpipefs, sockfs, inotifyfd, eventfd, signalfd, internal tmpfs which implements sys V
shared memory, you name it) have no mount point (so they qualify as a "global root"), and their root dentries might have no sensible names.

Most likely lsof attempts to find out the path corresponding to such a file descriptor. Also stat might race with process exit
(so even if the filesystem in question implements a custom d_name for root dentry, it might give pointless values after
the process has exited).

The bottom line is that the above kernel trace is harmless, and MongoDB has very little to do with that trace.

Revision history for this message
Andrey Korolyov (xdeller) wrote :

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3451538a114d738a6528b1da58e19e7d8964c647

Looks like there is a fix (although it is old, it did not made way to 3.11). Nobody was good in a guessing.

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Andrey: the patch you've mentioned provides a sensible d_name for internal tmpfs (which implements sys V shared memory).

> Nobody was good in a guessing.

That's not exactly correct:

> many synthetic filesystems (anonpipefs, sockfs, inotifyfd, eventfd, signalfd, internal tmpfs which implements sys V
> shared memory, you name it) have no mount point (so they qualify as a "global root"), and their root dentries might
> have no sensible names.

Apparently this hypothesis is correct (and the trace is harmless and can be ignored)

Revision history for this message
Andrey Korolyov (xdeller) wrote :

That can be true, but looking on the last half of patch (which *was* incorporated in 3.11, sorry for giving wrong info from a glance), it comes in mind that the error should never appear after pseudofs behavior was fixed (in way you are proposing). Therefore current reason is still unclear to me.

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

$ apt-get source linux-lts-saucy-3.11.0
$ cd linux-lts-saucy-3.11.0
$ grep -e 'shmem_dname' -r mm
[nothing]

I don't think the "fix" has been included into 3.11 kernel (shipped by Ubuntu).

Revision history for this message
Andrey Korolyov (xdeller) wrote :

I meant fact that the two last chunks was included, since d_alloc_pseudo was added in same time.

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

Guys, are there any updates on this issue?

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Roman, the issue is a moot.
Basically lsof tries to figure out the path corresponding to a file descriptor which represents sys V shared memory segment.
The kernel complains about the root dentry of shmfs (an in-kernel tmpfs which implements sys V shared memory,
this tmpfs is not exposed to the userspace directly) having no expected d_name.
The warning is harmless and can be ignored.

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> The warning is harmless and can be ignored.

Marking the bug as invalid.

Changed in fuel:
status: Confirmed → Invalid
Changed in fuel:
status: Invalid → Confirmed
Revision history for this message
Tomasz 'Zen' Napierala (tzn) wrote :

We've just reporoduced this bug on the scale lab.
It is caused by "netstat -punt" command launched by corosync:
16762- 1 8850 8850 8850 ? -1 Ssl 0 4:51 /usr/sbin/corosync
16838- 8850 8875 8850 8850 ? -1 S 107 5:18 \_ /usr/lib/pacemaker/cib
16922- 8850 8876 8850 8850 ? -1 S 0 0:38 \_ /usr/lib/pacemaker/stonithd
17011- 8850 8877 8850 8850 ? -1 S 0 1:34 \_ /usr/lib/pacemaker/lrmd
17096- 8877 31182 31182 8850 ? -1 S 0 0:00 | \_ /bin/sh /usr/lib/ocf/resource.d/mirantis/heat-engine start
17220-31182 446 31182 8850 ? -1 S 0 0:00 | | \_ /bin/sh /usr/lib/ocf/resource.d/mirantis/heat-engine start
17348: 446 447 31182 8850 ? -1 R 0 0:00 | | \_ netstat -punt

netstat -p is also causing the same behaviur

I can reproduce it with 100% effectiveness.

It is all starting after massive scale VM creation.

Revision history for this message
Tomasz 'Zen' Napierala (tzn) wrote :
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Tomasz,

> I can reproduce it with 100% effectiveness.

Just because it's reproducible doesn't mean it's a bug.

lsof/netstat inspects the file descriptors (found under /proc/$pid/fd) and tries to figure out
the path corresponding to each of them. However some of the file descriptors belong to
the internal kernel filesystems/mounts (for instance, shmfs, anonpipefs, also file descriptors
created by eventfd(), epoll_create(), inotify_init(), etc). The kernel complains about the root
dentry of the corresponding (pseudo)filesystem having no proper d_name.
The warning is harmless and can be ignored.

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

We have a loop in heat-engine OCF script, action start. It causes multiple invocations of netstat -punt command then doing AMQP connection check for heat-engine (action monitor).
The solution is to not invoke AMQP checks at start action.

Changed in fuel:
status: Invalid → Confirmed
assignee: Alexei Sheplyakov (asheplyakov) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Actually, the same is applicable to neutron-dhcp, ceilometer-agent-central, vmware nova-compute/network OCF scripts.

summary: - ubuntu kernel cpu dump for node-1 only
+ Multiple netstat checks in OCF scripts should be fixed
Revision history for this message
Andrey Korolyov (xdeller) wrote :

What exact problem you are fixing in a late proposals?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/139053

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The fix is about to decrease significantly an amount of issued netstat commands

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/139053
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=cb75bea0d0c9474c9d25502130f735399e8a94e5
Submitter: Jenkins
Branch: master

commit cb75bea0d0c9474c9d25502130f735399e8a94e5
Author: Bogdan Dobrelya <email address hidden>
Date: Thu Dec 4 14:37:09 2014 +0100

    Do additional checks in OCF only for action monitor.

    W/o this change, the *monitor() function of OCF scripts
    do additional AMQP connection checks unconditionally, every
    time it has been invoked. And that could cause *too much* netstat
    invocations for host OS to handle them w/o an issues.

    By design, we want these additional checks performed only when
    OCF script being invoked as a monitor action, but not as a start
    or stop, for example.

    The solution is:
    * check if $__OCF_ACTION = monitor and do AMQP checks only then.
    * align sleep (3 sec) for action start loop amongst all of the OCF scripts
    * replace netstat to lsof tool due to /proc FS based approach
      for sockets parsing fits better for high loaded systems with
      many connections. Also, 'lsof -p $pid' will not search for all
      of the existing sockets - which could be tens of housands -
      but only for ones from /proc/$pid/ instead, that is very fast and
      provides no additional load for host OS.

    Closes-bug: #1385755

    Change-Id: I4eba9fd5e3eaa6448590808ea85bdf152b2ef176
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.