Action monitor for ns_haproxy OCF takes too long under load

Bug #1424959 reported by Leontiy Istomin on 2015-02-24
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
High
Bogdan Dobrelya
5.0.x
High
Bogdan Dobrelya
5.1.x
High
Bogdan Dobrelya
6.0.x
High
Bogdan Dobrelya
6.1.x
High
Bogdan Dobrelya

Bug Description

[root@fuel ~]# fuel --fuel-version
api: '1.0'
astute_sha: 1f87a9b9a47de7498b4061d15a8c7fb9435709d5
auth_required: true
build_id: 2015-02-13_16-48-42
build_number: '120'
feature_groups:
- mirantis
fuellib_sha: 7f8d4382abfcd4338964182ebfea1d539f963e66
fuelmain_sha: 4b6bcf60d712b74a3a53252bfbb5ca34aa7497c4
nailgun_sha: 1e3a40dd8a17abe1d38f42da1e0dc1a6d4572666
ostf_sha: f9c37d0876141e1550eb4e703a8e500cd463282f
production: docker
python-fuelclient_sha: 61431ed16fc00039a269424bdbaa410277eff609
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: 1f87a9b9a47de7498b4061d15a8c7fb9435709d5
      build_id: 2015-02-13_16-48-42
      build_number: '120'
      feature_groups:
      - mirantis
      fuellib_sha: 7f8d4382abfcd4338964182ebfea1d539f963e66
      fuelmain_sha: 4b6bcf60d712b74a3a53252bfbb5ca34aa7497c4
      nailgun_sha: 1e3a40dd8a17abe1d38f42da1e0dc1a6d4572666
      ostf_sha: f9c37d0876141e1550eb4e703a8e500cd463282f
      production: docker
      python-fuelclient_sha: 61431ed16fc00039a269424bdbaa410277eff609
      release: '6.1'
Baremetal,Centos, HA, Neutron-gre,Ceph-all, Debug, nova-quotas, 6.1_120
Controllers:3 Computes:47

deployment was successfull
I've added glance image due the bug https://bugs.launchpad.net/mos/+bug/1415954
glance image-create --name 'TestVM' --is-public true --container-format='bare' --disk-format='qcow2' --min-ram=64 --property murano_image_info='{"title": "Murano Demo", "type": "cirros.demo"}' --file '/opt/vm/cirros-x86_64-disk.img'
I've added the follwoing line to /usr/bin/keystone-all due the bug https://bugs.launchpad.net/mos/+bug/1413341
147 config.setup_logging()

I have face with the following exception during rally test:

2015-02-20 22:30:52.125 20140 ERROR rally.benchmark.runners.base [-] Failed to get the resource <Volume: 9abbe8cd-12ae-45a9-aa57-1b18bced545d>: Unable to establish connection: ('Connection aborted.', BadStatusLine("''",))
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base Traceback (most recent call last):
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/runners/base.py", line 77, in _run_scenario_once
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base method_name)(**kwargs) or scenario_output
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/cinder/volumes.py", line 105, in create_and_delete_volume
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base self._delete_volume(volume)
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/base.py", line 254, in func_atomic_actions
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base f = func(self, *args, **kwargs)
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/cinder/utils.py", line 107, in _delete_volume
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base check_interval=CONF.benchmark.cinder_volume_delete_poll_interval
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/utils.py", line 130, in wait_for_delete
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base resource = update_resource(resource)
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/utils.py", line 54, in _get_from_manager
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base raise exceptions.GetResourceFailure(resource=resource, err=e)
2015-02-20 22:30:52.125 20140 TRACE rally.benchmark.runners.base GetResourceFailure: Failed to get the resource <Volume: 9abbe8cd-12ae-45a9-aa57-1b18bced545d>: Unable to establish connection: ('Connection aborted.', BadStatusLine("''",))

At the same time I have found in messages log on primary controller node:
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy Stats started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy horizon started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy keystone-1 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy keystone-2 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-api-1 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-api-2 started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-metadata-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy cinder-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy glance-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy neutron started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy glance-registry started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy rabbitmq started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy mysqld started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy radosgw started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy heat-api started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy heat-api-cfn started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy heat-api-cloudwatch started.
<133>Feb 20 22:30:52 node-1 haproxy[13322]: Proxy nova-novncproxy started.

from syslog log file:
<129>Feb 20 22:30:53 node-1 haproxy[13334]: Server radosgw/node-1 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 1ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Feb 20 22:30:53 node-1 haproxy[13334]: Server radosgw/node-44 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 1ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Feb 20 22:30:53 node-1 haproxy[13334]: Server radosgw/node-49 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 1ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Feb 20 22:30:53 node-1 haproxy[13334]: proxy radosgw has no server available!

from pacemaker log:
Feb 20 22:30:54 [23083] node-1.domain.tld lrmd: info: log_finished: finished - rsc:p_haproxy action:start call_id:241 pid:13266 exit-code:0 exec-time:1963ms queue-time:0ms

[root@node-1 ~]# crm status
Last updated: Tue Feb 24 07:49:38 2015
Last change: Thu Feb 19 23:39:25 2015
Stack: corosync
Current DC: node-1.domain.tld (1) - partition with quorum
Version: 1.1.11-97629de
3 Nodes configured
27 Resources configured

Online: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]

 vip__public (ocf::fuel:ns_IPaddr2): Started node-1.domain.tld
 Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
 vip__management (ocf::fuel:ns_IPaddr2): Started node-44.domain.tld
 Clone Set: clone_p_openstack-heat-engine [p_openstack-heat-engine]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
 Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
 p_neutron-dhcp-agent (ocf::fuel:ocf-neutron-dhcp-agent): Started node-49.domain.tld
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
 Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     p_neutron-l3-agent (ocf::fuel:ocf-neutron-l3-agent): FAILED node-1.domain.tld (unmanaged)
     p_neutron-l3-agent (ocf::fuel:ocf-neutron-l3-agent): FAILED node-49.domain.tld (unmanaged)
     p_neutron-l3-agent (ocf::fuel:ocf-neutron-l3-agent): FAILED node-44.domain.tld (unmanaged)
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-1.domain.tld ]
     Slaves: [ node-44.domain.tld node-49.domain.tld ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.domain.tld node-44.domain.tld node-49.domain.tld ]

Failed actions:
    ping_vip__public_monitor_20000 on node-1.domain.tld 'unknown error' (1): call=232, status=Timed Out, last-rc-change='Tue Feb 24 02:10:48 2015', queued=0ms, exec=0ms
    p_neutron-l3-agent_stop_0 on node-1.domain.tld 'unknown error' (1): call=192, status=Timed Out, last-rc-change='Fri Feb 20 18:10:43 2015', queued=0ms, exec=60000ms
    p_neutron-l3-agent_stop_0 on node-1.domain.tld 'unknown error' (1): call=192, status=Timed Out, last-rc-change='Fri Feb 20 18:10:43 2015', queued=0ms, exec=60000ms
    p_neutron-metadata-agent_monitor_60000 on node-1.domain.tld 'unknown error' (1): call=648, status=Timed Out, last-rc-change='Tue Feb 24 02:10:48 2015', queued=0ms, exec=0ms
    p_neutron-openvswitch-agent_monitor_20000 on node-1.domain.tld 'unknown error' (1): call=717, status=Timed Out, last-rc-change='Tue Feb 24 06:45:48 2015', queued=0ms, exec=0ms
    p_neutron-dhcp-agent_monitor_20000 on node-1.domain.tld 'unknown error' (1): call=129, status=Timed Out, last-rc-change='Fri Feb 20 18:10:42 2015', queued=0ms, exec=0ms
    ping_vip__public_monitor_20000 on node-49.domain.tld 'unknown error' (1): call=70, status=Timed Out, last-rc-change='Sun Feb 22 03:30:41 2015', queued=0ms, exec=0ms
    p_neutron-openvswitch-agent_monitor_20000 on node-49.domain.tld 'unknown error' (1): call=774, status=Timed Out, last-rc-change='Tue Feb 24 06:48:14 2015', queued=0ms, exec=0ms
    p_neutron-dhcp-agent_monitor_20000 on node-49.domain.tld 'unknown error' (1): call=776, status=Timed Out, last-rc-change='Tue Feb 24 06:30:50 2015', queued=0ms, exec=0ms
    p_neutron-l3-agent_stop_0 on node-49.domain.tld 'unknown error' (1): call=164, status=Timed Out, last-rc-change='Fri Feb 20 15:40:32 2015', queued=0ms, exec=60000ms
    p_neutron-l3-agent_stop_0 on node-49.domain.tld 'unknown error' (1): call=164, status=Timed Out, last-rc-change='Fri Feb 20 15:40:32 2015', queued=0ms, exec=60000ms
    p_neutron-metadata-agent_monitor_60000 on node-49.domain.tld 'unknown error' (1): call=640, status=Timed Out, last-rc-change='Tue Feb 24 06:48:14 2015', queued=0ms, exec=0ms
    ping_vip__public_monitor_20000 on node-44.domain.tld 'unknown error' (1): call=557, status=Timed Out, last-rc-change='Mon Feb 23 15:40:47 2015', queued=0ms, exec=0ms
    p_neutron-openvswitch-agent_monitor_20000 on node-44.domain.tld 'unknown error' (1): call=732, status=Timed Out, last-rc-change='Tue Feb 24 05:40:51 2015', queued=0ms, exec=0ms
    p_neutron-dhcp-agent_monitor_20000 on node-44.domain.tld 'unknown error' (1): call=238, status=Timed Out, last-rc-change='Fri Feb 20 22:50:50 2015', queued=0ms, exec=0ms
    p_neutron-l3-agent_stop_0 on node-44.domain.tld 'unknown error' (1): call=164, status=Timed Out, last-rc-change='Fri Feb 20 10:40:19 2015', queued=0ms, exec=60001ms
    p_neutron-l3-agent_stop_0 on node-44.domain.tld 'unknown error' (1): call=164, status=Timed Out, last-rc-change='Fri Feb 20 10:40:19 2015', queued=0ms, exec=60001ms
    p_neutron-metadata-agent_monitor_60000 on node-44.domain.tld 'unknown error' (1): call=602, status=Timed Out, last-rc-change='Mon Feb 23 21:20:21 2015', queued=0ms, exec=0ms

Bogdan Dobrelya (bogdando) wrote :

According to the logs from the affected env, haproxy instances at node-44, node-49 wasn't restarted so the downtime was temporary and local for node-1. This should be considered as a medium impact I believe.

The instance of haproxy at node-1 was restarted due to timeout of monitor operation:
pacemaker.log:
Feb 20 22:30:50 [23083] node-1.domain.tld lrmd: warning: child_timeout_callback: p_haproxy_monitor_20000 process (PID 12917) timed out
Feb 20 22:30:51 [23083] node-1.domain.tld lrmd: warning: operation_finished: p_haproxy_monitor_20000:12917 - timed out after 10000ms
Feb 20 22:30:51 [23086] node-1.domain.tld crmd: error: process_lrm_event: Operation p_haproxy_monitor_20000: Timed Out (node=node-1.domain.tld, call=238, timeout=10000ms)

The RC is yet unknown but might be related with critical level of IO load:
atop 2015/02/20 22:30:33 - 2015/02/20 22:30:53:
LVM | os-root | busy 95% | read 1782 | write 432390 | KiB/r 35 | KiB/w 3 | MBr/s 3.10 | MBw/s 84.45 | avio 0.04 ms |
DSK | sda | busy 95% | read 1881 | write 3948 | KiB/r 34 | KiB/w 457 | MBr/s 3.16 | MBw/s 88.25 | avio 3.23 ms

Changed in mos:
milestone: none → 6.1
importance: Undecided → Medium
status: New → Incomplete
Leontiy Istomin (listomin) wrote :

atop shows that jbd2/dm-0-8 process uses 95% of IOps. It occurred on each controller node, but corosync restarts resources only on one of them

Bogdan Dobrelya (bogdando) wrote :

Rised to high, as it looks like haproxy RA is affected by performance issues and this could impact the cluster operations

summary: - corosync restarts resources during rally tests
+ Action monitor for ns_haproxy OCF takes too long under load
Changed in mos:
importance: Medium → High
Bogdan Dobrelya (bogdando) wrote :

Here is lrmd, crmd logs from node-1 in attachment. There are multiple timeouts for monitor actions started from ./node-1.domain.tld/lrmd.log:2015-02-20T11:40:26.774728+00:00, and there are multiple "PID will not die" events as well.

Also, atop logs show that haproxy monitor actions sometimes can take more that 20 seconds.

Bogdan Dobrelya (bogdando) wrote :
Changed in mos:
assignee: Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)
status: Incomplete → Confirmed
Changed in mos:
assignee: Fuel Library Team (fuel-library) → svasheka (svasheka)
Bogdan Dobrelya (bogdando) wrote :

The solution is to increase monitor interval and timeout, add logs for haproxy OCF and optimize its monitor action, if possible

Changed in mos:
status: Confirmed → Triaged
assignee: svasheka (svasheka) → Bogdan Dobrelya (bogdando)
Bogdan Dobrelya (bogdando) wrote :

"PID will not die" issue can be related to https://bugs.launchpad.net/fuel/+bug/1425579

Bogdan Dobrelya (bogdando) wrote :
Changed in mos:
status: Triaged → In Progress
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 5.0-updates and 5.1.1-updates - this is not customer-found issue and it's too risky to backport as the change is big (77 LOC for OCF script)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers