Pacemaker service is unable to be started in time when zabbix plugins enabled

Bug #1493483 reported by Ksenia Svechnikova
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel Plugins
Invalid
Undecided
Unassigned
Fuel for OpenStack
Fix Released
Critical
Ksenia Svechnikova

Bug Description

build_number: "286"

Scenario:

TUN HW environment 3x(controller, mongo)+2x(cinder,compute) with EMC

1. Create cluster with neutron + TUN
2. Configure emc and zabbix plugins (emc_vnx, zabbix_monitoring,zabbix_monitoring_emc ,
3. Configure networks with bonds and create VLAN tagged interfaces
4. Start deploy

Deploy fails, controllers are in error state
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---|-------------|------------------|---------|---------------|-------------------|-------------------|---------------|--------|---------
4 | provisioned | Untitled (45:54) | 2 | 192.168.5.113 | ec:f4:bb:cd:45:54 | cinder, compute | | True | 2
1 | ready | Untitled (42:94) | 2 | 192.168.5.110 | ec:f4:bb:cd:42:94 | controller, mongo | | True | 2
5 | provisioned | Untitled (45:4c) | 2 | 192.168.5.114 | ec:f4:bb:cd:45:4c | cinder, compute | | True | 2
3 | error | Untitled (43:00) | 2 | 192.168.5.111 | ec:f4:bb:cd:43:00 | controller, mongo | | True | 2
2 | error | Untitled (41:20) | 2 | 192.168.5.112 | ec:f4:bb:cd:41:20 | controller, mongo | | True | 2

Deploy on controller fails with puppet errors according pacemaker service:

pacemaker.log:
Sep 08 15:53:46 [104918] node-3.domain.tld pacemakerd: error: pcmk_child_exit: Child process cib (104920) exited: Network is down (100)
Sep 08 15:53:46 [104918] node-3.domain.tld pacemakerd: warning: pcmk_child_exit: Pacemaker child process cib no longer wishes to be respa
wned. Shutting ourselves down.
Sep 08 15:53:46 [104918] node-3.domain.tld pacemakerd: error: pcmk_child_exit: Child process attrd (104923) exited: Network is down (10
0)
Sep 08 15:53:46 [104918] node-3.domain.tld pacemakerd: warning: pcmk_child_exit: Pacemaker child process attrd no longer wishes to be res
pawned. Shutting ourselves down.

...

2015-09-08 15:53:51 +0000 /Stage[main]/Corosync/Service[pacemaker]/ensure (notice): ensure changed 'stopped' to 'running'
2015-09-08 15:53:51 +0000 /Stage[main]/Corosync/Service[pacemaker] (debug): The container Class[Corosync] will propagate my refresh event
2015-09-08 15:53:51 +0000 /Stage[main]/Corosync/Service[pacemaker] (info): Unscheduling refresh on Service[pacemaker]
2015-09-08 15:53:51 +0000 /Stage[main]/Corosync/Service[pacemaker] (info): Evaluated in 5.63 seconds
2015-09-08 15:53:51 +0000 /Stage[main]/Main/Pcmk_nodes[pacemaker] (info): Starting to evaluate the resource
2015-09-08 15:53:51 +0000 Pcmk_nodes[pacemaker](provider=ruby) (debug): Call: corosync_nodes
2015-09-08 15:53:51 +0000 Puppet (debug): Waiting 600 seconds for Pacemaker to become online

Revision history for this message
Ksenia Svechnikova (kdemina) wrote :
Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote :

Regarding message "network down" - actually, with manual checks - ping on mgmt network between nodes works.

summary: - Pacemaker service is unable to started in time
+ Pacemaker service is unable to be started in time
Revision history for this message
Ksenia Svechnikova (kdemina) wrote : Re: Pacemaker service is unable to be started in time

The same env, but without mongo, cinder and plugins was successfully deployed

Revision history for this message
Andrey Maximov (maximov) wrote :

well, moving to plugins project if this is zabbix specific

summary: - Pacemaker service is unable to be started in time
+ Pacemaker service is unable to be started in time when zabbix plugins
+ enabled
Revision history for this message
Dmitry Klenov (dklenov) wrote :

Andrew, it might be not only plugin-specific. 3 components were disabled, each of them can be the cause.

Revision history for this message
Andrey Maximov (maximov) wrote :

Dmitry, according to Nastya mongo and cinder deployments work, they are part of system tests. We will analyze snapshot just in case.

Revision history for this message
Stanislav Makar (smakar) wrote :

It has just been reproduced again

the problem is very interesting:
 - the root cause that pacemaker could not start on node-13 node-14
  in logs:
   <27>Sep 9 13:01:12 node-13 pacemakerd[132731]: error: mcp_read_config: Couldn't create logfile: /var/log/pacemaker.log
<27>Sep 9 13:01:12 node-13 pacemakerd[132731]: error: pcmk_child_exit: Child process cib (132733) exited: Network is down (100)
<27>Sep 9 13:01:12 node-13 pacemakerd[132731]: error: pcmk_child_exit: Child process attrd (132736) exited: Network is down (100)

 - but:
    node-12 is deployed without problem, pacemaker started
    connectivity between nodes is present

Solution: go to env and start it manually and it is working

Bond is not problem here due to br-mgmt is on eth5 vlan interface

root@node-14:~# ip a s br-mgmt
12: br-mgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether ec:f4:bb:cd:45:4d brd ff:ff:ff:ff:ff:ff
    inet 10.168.0.13/24 brd 10.168.0.255 scope global br-mgmt
       valid_lft forever preferred_lft forever
    inet6 fe80::7c0b:2fff:fe12:7aef/64 scope link
       valid_lft forever preferred_lft forever
root@node-14:~# brctl show br-mgmt
bridge name bridge id STP enabled interfaces
br-mgmt 8000.ecf4bbcd454d no eth5.3500

trying to catch it again and debug

Revision history for this message
Stanislav Makar (smakar) wrote :

Have found in corosync logs:

2015-09-09T12:28:02.559580+00:00 err: [MAIN ] cs_ipcs_connection_accept Denied connection attempt from 109:116
2015-09-09T12:28:02.559580+00:00 err: [QB ] handle_new_connection Invalid IPC credentials (106837-106967-2).
2015-09-09T12:28:02.560794+00:00 err: [MAIN ] cs_ipcs_connection_accept Denied connection attempt from 109:116
2015-09-09T12:28:02.560794+00:00 err: [QB ] handle_new_connection Invalid IPC credentials (106837-106964-2).
2015-09-09T12:28:02.657110+00:00 notice: [TOTEM ] memb_state_operational_enter A new membership (10.168.0.9:12) was formed. Members joined:
14
2015-09-09T12:28:02.657454+00:00 warning: [CPG ] message_handler_req_exec_cpg_downlist downlist left_list: 0 received in state 0
2015-09-09T12:28:02.660066+00:00 notice: [QUORUM] log_view_list Members[3]: 13 12 14
2015-09-09T12:28:02.660066+00:00 notice: [MAIN ] corosync_sync_completed Completed service synchronization, ready to provide service.
2015-09-09T12:44:38.500115+00:00 err: [MAIN ] cs_ipcs_connection_accept Denied connection attempt from 109:116
2015-09-09T12:44:38.500115+00:00 err: [QB ] handle_new_connection Invalid IPC credentials (106837-119699-2).
2015-09-09T12:44:38.500227+00:00 err: [MAIN ] cs_ipcs_connection_accept Denied connection attempt from 109:116
2015-09-09T12:44:38.500227+00:00 err: [QB ] handle_new_connection Invalid IPC credentials (106837-119702-2).
2015-09-09T13:01:12.245988+00:00 err: [MAIN ] cs_ipcs_connection_accept Denied connection attempt from 109:116
2015-09-09T13:01:12.247179+00:00 err: [QB ] handle_new_connection Invalid IPC credentials (106837-132733-2).
2015-09-09T13:01:12.247179+00:00 err: [MAIN ] cs_ipcs_connection_accept Denied connection attempt from 109:116

Sometimes during first pacemaker start such problem is happened due to Corosync and pacemaker are started under different users

109:116 is pacemaker uid/gid pair

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/221950

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Stanislav Makar (smakar) wrote :

patch fixes the problem, deployment goes further but fails on problem with connectivity
on storage network between node-13 and node-12,node-14. Node node-12 and node-14 has connection between
each other hence they are deployed successfully.

Env consists of 3 controllers with different bond2 configuration

--------- node-12
root@node-12:~# brctl show br-storage
bridge name bridge id STP enabled interfaces
br-storage 8000.a0369f4e5ce6 no bond2.104
root@node-12:~# cat /proc/net/bonding/bond2 | grep "Mode\|Int"
Bonding Mode: fault-tolerance (active-backup)
MII Polling Interval (ms): 100
Slave Interface: eth5
Slave Interface: eth7

--------- node-13
root@node-13:~# brctl show br-storage
bridge name bridge id STP enabled interfaces
br-storage 8000.a0369f4e641e no bond2.104
root@node-13:~# cat /proc/net/bonding/bond2 | grep "Mode\|Int"
Bonding Mode: fault-tolerance (active-backup)
MII Polling Interval (ms): 100
Slave Interface: eth3
root@node-13:~#

-------- node-14:
root@node-14:~# root@node-13:~# cat /proc/net/bonding/bond2 | grep "Mode\|Int"^C
root@node-14:~# brctl show br-storage
bridge name bridge id STP enabled interfaces
br-storage 8000.a0369f4e5e4a no bond2.104
root@node-14:~# cat /proc/net/bonding/bond2 | grep "Mode\|Int"
Bonding Mode: fault-tolerance (active-backup)
MII Polling Interval (ms): 100
Slave Interface: eth1
root@node-14:~#

Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

The issue with bond was caused by the previous reset of env during the debugging current issue. The issue corresponding bond miss configuration is https://bugs.launchpad.net/fuel/+bug/1493412.

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

Ksenia, please also take ISO 287> as 286 have a critical bug. I am marking this case as incomplete until it's reproduced on more latest ISO

no longer affects: fuel/7.0.x
Changed in fuel:
status: In Progress → Invalid
status: Invalid → Incomplete
Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

It would be better to leave the issue in Progress, as the fix is already on review. This issue was reproduced before the this bond issue has occurred. Now we will apply the workaround for PXE interface flapping and verify the fix

Revision history for this message
slava valyavskiy (slava-val-al) wrote :

Yep, we faced issue from https://bugs.launchpad.net/fuel/+bug/1493412 bug. But, in our current case this bug leads to misconfiguration of ALL bond interfaces, not only for admin interface.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Folks

The proposed fix is not related to the issue at all, as I mentioned in my comment to the review. We agreed with Ksenia that she will try to reproduce the issue with zabbix plugins disabled to figure out whether zabbix somehow affects the deployment.

Changed in fuel:
assignee: Stanislav Makar (smakar) → Ksenia Demina (kdemina)
Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

This issue is confirmed in Pacemaker: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1439649

> What has been observed is that *some* of the nodes in the cluster have the pacemaker process successfully communicate with the > corosync process, while others get this invalid credentials error that is seen.

Could you please re-review the fix, according to the knowledge of their workaround?

Reproducing the issue on deploy without zabbix is in progress

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/7.0)

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/222356

Changed in fuel:
status: Incomplete → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/221950
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=c46031c63853e837c24c81946ab4020bd4f4cd1b
Submitter: Jenkins
Branch: master

commit c46031c63853e837c24c81946ab4020bd4f4cd1b
Author: Stanislav Makar <email address hidden>
Date: Wed Sep 9 20:21:18 2015 +0000

    Fix the problem with first pacemaker start

    Sometimes during first start pacemaker can not connect to corosync via IPC.
    The root cause of this is that corosync is run under root user and pacemaker
    processes are run under hacluster/haclient user/group. Adding uidgid
    with names of pacemaker user/group helps to fix it.

    Change-Id: I1c103ec8e67f7af170dc2d548c3b49b5f1234b23
    Closes-bug: #1493483

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

The same Error for ISO#288 (it doesn't contain the fix) without Zabbix plugin.

I want to underline that this is HW with EMC VNX arrays in Cinder and EMC plugin was enabled in last env settings

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/7.0)

Reviewed: https://review.openstack.org/222356
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=ccfd57ddefe2d33b6b8fc9dbad7f8a070dbabcf3
Submitter: Jenkins
Branch: stable/7.0

commit ccfd57ddefe2d33b6b8fc9dbad7f8a070dbabcf3
Author: Stanislav Makar <email address hidden>
Date: Wed Sep 9 20:21:18 2015 +0000

    Fix the problem with first pacemaker start

    Sometimes during first start pacemaker can not connect to corosync via IPC.
    The root cause of this is that corosync is run under root user and pacemaker
    processes are run under hacluster/haclient user/group. Adding uidgid
    with names of pacemaker user/group helps to fix it.

    Change-Id: I1c103ec8e67f7af170dc2d548c3b49b5f1234b23
    Closes-bug: #1493483

tags: added: on-verification
Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

ISO RC3
Verify on the same env with the same configuration. No errors during deployemnt

Changed in fuel:
status: Fix Committed → Fix Released
Changed in fuel-plugins:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.