Collector services are only running on one controller node in HA deployment

Bug #1593137 reported by Simon Pasquier
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StackLight
Fix Released
Critical
Simon Pasquier

Bug Description

Tested on a HA deployment with MOS9 (#486) and master branch for the plugins.

After the deployment, the hekad services are running on only one controller:

Last updated: Thu Jun 16 07:50:36 2016 Last change: Thu Jun 16 07:42:06 2016 by root via crm_resource on node-5.test.domain.local
Stack: corosync
Current DC: node-1.test.domain.local (version 1.1.14-70404b0) - partition with quorum
3 nodes and 52 resources configured

Online: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]

 Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 vip__management (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-5.test.domain.local
 vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-5.test.domain.local
 vip__public (ocf::fuel:ns_IPaddr2): Started node-2.test.domain.local
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 sysinfo_node-5.test.domain.local (ocf::pacemaker:SysInfo): Started node-5.test.domain.local
 sysinfo_node-2.test.domain.local (ocf::pacemaker:SysInfo): Started node-2.test.domain.local
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-5.test.domain.local ]
     Slaves: [ node-1.test.domain.local node-2.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-1.test.domain.local ]
     Slaves: [ node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_neutron-openvswitch-agent [neutron-openvswitch-agent]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_neutron-openvswitch-agent [neutron-openvswitch-agent] Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_neutron-l3-agent [neutron-l3-agent]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_neutron-metadata-agent [neutron-metadata-agent]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_neutron-dhcp-agent [neutron-dhcp-agent]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_dns [p_dns]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 sysinfo_node-1.test.domain.local (ocf::pacemaker:SysInfo): Started node-1.test.domain.local
 Clone Set: clone_metric_collector [metric_collector]
     Started: [ node-5.test.domain.local ]
 Clone Set: clone_log_collector [log_collector]
     Started: [ node-5.test.domain.local ]
 Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_ntp [p_ntp]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]

metric_collector and log_collector resources should be running on node-1, node-2 and node-5.

Tags: mos9
Changed in lma-toolchain:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → LMA-Toolchain Fuel Plugins (mos-lma-toolchain)
milestone: none → 1.0.0
milestone: 1.0.0 → 0.10.0
tags: added: mos9
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-plugin-lma-collector (master)

Fix proposed to branch: master
Review: https://review.openstack.org/330393

Changed in lma-toolchain:
assignee: LMA-Toolchain Fuel Plugins (mos-lma-toolchain) → Simon Pasquier (simon-pasquier)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-plugin-lma-collector (master)

Reviewed: https://review.openstack.org/330393
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=1e578546a9189856f6b632f6a9e254494ee04962
Submitter: Jenkins
Branch: master

commit 1e578546a9189856f6b632f6a9e254494ee04962
Author: Simon Pasquier <email address hidden>
Date: Thu Jun 16 11:06:44 2016 +0200

    Fix the Pacemaker resources on MOS 9

    This change removes the migration-threshold and failure-timeout
    parameters for the collector resources. Otherwise Pacemaker will forbid
    the resource from the node if it fails too many times. And the
    declaration of the resources in Pacemaker needs to happen after the
    services have been installed too.

    Change-Id: Ia8d96ccce4a25e4a1919419cba9b415bd06c65d1
    Closes-Bug: #1593137

Changed in lma-toolchain:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-plugin-lma-collector (master)

Fix proposed to branch: master
Review: https://review.openstack.org/332864

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-plugin-lma-collector (master)

Reviewed: https://review.openstack.org/332864
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=97a3c6dfa957b70c00aeff9d8709d9c49c137b32
Submitter: Jenkins
Branch: master

commit 97a3c6dfa957b70c00aeff9d8709d9c49c137b32
Author: Simon Pasquier <email address hidden>
Date: Wed Jun 22 17:10:48 2016 +0200

    Don't return OCF_ERR_INSTALLED in the OCF script

    OCF_ERR_INSTALLED is considered by Pacemaker as a hard error meaning
    that it won't try to restart the resource. When deploying the
    collectors, the configuration may not be ready at the same time on all
    the nodes.

    Change-Id: I878d38747e9b84bfebb8e605cb01510c70cf8633
    Closes-Bug: #1593137

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/333915
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=295e76f4576f9796ca315307049a4b013e98406a
Submitter: Jenkins
Branch: master

commit 295e76f4576f9796ca315307049a4b013e98406a
Author: Simon Pasquier <email address hidden>
Date: Fri Jun 24 15:28:21 2016 +0200

    Restore failure-timeout parameter for collectors

    Without this parameter being set, Pacemaker doesn't cleanup and restart
    the resource if it fails too many times.

    Change-Id: I185e317969aec389e883e575c120d3a902d677e7
    Closes-Bug: #1593137

Revision history for this message
Swann Croiset (swann-w) wrote :
Download full text (6.7 KiB)

Bug still occurs:

pacemaker log:

Jun 30 14:29:02 [17627] node-2.test.domain.local cib: info: cib_perform_op: Diff: --- 0.110.0 2
Jun 30 14:29:02 [17627] node-2.test.domain.local cib: info: cib_perform_op: Diff: +++ 0.110.1 (null)
Jun 30 14:29:02 [17627] node-2.test.domain.local cib: info: cib_perform_op: + /cib: @num_updates=1
Jun 30 14:29:02 [17627] node-2.test.domain.local cib: info: cib_perform_op: ++ /cib/status/node_state[@id='4']/lrm[@id='4']/lrm_resources: <lrm_resource id="metric_collector" ty
pe="ocf-metric_collector" class="ocf" provider="fuel"/>
Jun 30 14:29:02 [17627] node-2.test.domain.local cib: info: cib_perform_op: ++ <lrm_rsc_op id="metric_collector_las
t_0" operation_key="metric_collector_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="46:714:7:1993769d-31f0-48f1-b7ca-4a9c69a466
f2" transition-magic="0:7;46:714:7:1993769d-31f0-48f1-b7ca-4a9c69a466f2" on_node="node-4.test.domain.local" call-id="160" rc-code="7
Jun 30 14:29:02 [17627] node-2.test.domain.local cib: info: cib_perform_op: ++ </lrm_resource>
Jun 30 14:29:02 [17629] node-2.test.domain.local lrmd: info: process_lrmd_get_rsc_info: Resource 'metric_collector' not found (18 active resources)
Jun 30 14:29:02 [17629] node-2.test.domain.local lrmd: info: process_lrmd_get_rsc_info: Resource 'metric_collector:0' not found (18 active resources)
Jun 30 14:29:02 [17629] node-2.test.domain.local lrmd: info: process_lrmd_rsc_register: Added 'metric_collector' to the rsc list (19 active resources)
Jun 30 14:29:02 [17632] node-2.test.domain.local crmd: info: do_lrm_rsc_op: Performing key=45:714:7:1993769d-31f0-48f1-b7ca-4a9c69a466f2 op=metric_collector_monitor_0
Jun 30 14:29:02 [17629] node-2.test.domain.local lrmd: warning: services_os_action_execute: Cannot execute '/usr/lib/ocf/resource.d/fuel/ocf-metric_collector': No such file or di
rectory (2)
Jun 30 14:29:02 [17632] node-2.test.domain.local crmd: warning: services_os_action_execute: Cannot execute '/usr/lib/ocf/resource.d/fuel/ocf-metric_collector': No such file or di
rectory (2)
Jun 30 14:29:02 [17632] node-2.test.domain.local crmd: error: generic_get_metadata: Failed to retrieve meta-data for ocf:fuel:ocf-metric_collector
<28>Jun 30 14:29:02 node-2 lrmd[17629]: warning: Cannot execute '/usr/lib/ocf/resource.d/fuel/ocf-metric_collector': No such file or directory (2)
Jun 30 14:29:02 [17632] node-2.test.domain.local crmd: warning: get_rsc_metadata: No metadata found for ocf-metric_collector::ocf:fuel: Input/output error (-5)
Jun 30 14:29:02 [17632] node-2.test.domain.local crmd: error: build_operation_update: No metadata for fuel::ocf:ocf-metric_collector
<28>Jun 30 14:29:02 node-2 crmd[17632]: warning: Cannot execute '/usr/lib/ocf/resource.d/fuel/ocf-metric_collector': No such file or directory (2)
<27>Jun 30 14:29:02 node-2 crmd[17632]: erro...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-plugin-lma-collector (master)

Fix proposed to branch: master
Review: https://review.openstack.org/336579

Swann Croiset (swann-w)
Changed in lma-toolchain:
status: Fix Committed → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/339062

Changed in lma-toolchain:
assignee: Simon Pasquier (simon-pasquier) → Swann Croiset (swann-w)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/339131

Changed in lma-toolchain:
assignee: Swann Croiset (swann-w) → Simon Pasquier (simon-pasquier)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-plugin-lma-collector (master)

Reviewed: https://review.openstack.org/339062
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=ef2fd45a3931c20d271b5c60e61de488b90a0303
Submitter: Jenkins
Branch: master

commit ef2fd45a3931c20d271b5c60e61de488b90a0303
Author: Swann Croiset <email address hidden>
Date: Thu Jul 7 17:20:51 2016 +0200

    Cleanup collector resources at the very end

    Dirty operator action to (re)start all collectors on Controller nodes.

    Change-Id: Ibe6b270f45ccd9b3add2f06bef21c58707116a15
    Closes-bug: #1593137

Changed in lma-toolchain:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-plugin-lma-collector (master)

Reviewed: https://review.openstack.org/336579
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=9ae5de5e432e4b9ced0f18a001157162920d49f5
Submitter: Jenkins
Branch: master

commit 9ae5de5e432e4b9ced0f18a001157162920d49f5
Author: Swann Croiset <email address hidden>
Date: Fri Jul 1 15:47:51 2016 +0200

    Set migration-threshold to 3 for collectors

    Related-bug: #1593137

    Change-Id: I7b3808afdfb43d0dcc74debb0333ae1d1942029f

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-plugin-lma-collector (master)

Reviewed: https://review.openstack.org/339131
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=882584043f7cc5a9d66847da0204a1f8a5749b73
Submitter: Jenkins
Branch: master

commit 882584043f7cc5a9d66847da0204a1f8a5749b73
Author: Simon Pasquier <email address hidden>
Date: Thu Jul 7 18:39:04 2016 +0200

    Install the OCF script beforehand

    This change also removes the previous hack that cleans up the collector
    resources at the end of the deployment.

    Change-Id: I1ca237181d30802035bf6a0526cdd41f83e39acd
    Closes-Bug: #1593137

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-plugin-lma-collector (stable/0.10)

Fix proposed to branch: stable/0.10
Review: https://review.openstack.org/341594

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-plugin-lma-collector (stable/0.10)

Reviewed: https://review.openstack.org/341594
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=658c40e10c31fb8a8c9fec86df48c68206549219
Submitter: Jenkins
Branch: stable/0.10

commit 658c40e10c31fb8a8c9fec86df48c68206549219
Author: Simon Pasquier <email address hidden>
Date: Thu Jul 7 18:39:04 2016 +0200

    Install the OCF script beforehand

    This change also removes the previous hack that cleans up the collector
    resources at the end of the deployment.

    Change-Id: I1ca237181d30802035bf6a0526cdd41f83e39acd
    Closes-Bug: #1593137
    (cherry picked from commit 882584043f7cc5a9d66847da0204a1f8a5749b73)

Changed in lma-toolchain:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.