Pacemaker shows healthy status for rabbitmq node meanwhile the node is actually down/split brain

Bug #1472230 reported by Tatyanka
78
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Bogdan Dobrelya
7.0.x
Won't Fix
High
Denis Puchkin
8.0.x
Fix Released
High
Bogdan Dobrelya
Mitaka
Fix Released
High
Bogdan Dobrelya

Bug Description

http://jenkins-product.srt.mirantis.net:8080/job/7.0.system_test.ubuntu.cic_maintenance_mode/18/testReport/junit/(root)/auto_cic_maintenance_mode/auto_cic_maintenance_mode/

๐’๐ญ๐ž๐ฉ๐ฌ:
1. Create cluster
2. Add 3 node with controller and mongo roles
3. Add 2 node with compute and cinder roles
4. Deploy the cluster
5. Run ostf
6. Check that umm feature is enabled (umm status )
7. Call unexpected reboot: reboot --force >/dev/null &
8. Check that node is rebooted, back online and enter into auto mode(umm status):
9. Disable umm mode: umm off and wait umm stops
10. Check while node back to the online status in psc
11. Run ostf ha

๐€๐œ๐ญ๐ฎ๐š๐ฅ ๐ซ๐ž๐ฌ๐ฎ๐ฅ๐ญ:
OSTF rabbit tests are failed:

Failed 2 OSTF tests; should fail 0 tests. Names of failed tests: [{u'RabbitMQ availability (failure)': u'Number of controllers is not equal to number of cluster nodes.'}, {u'RabbitMQ replication (failure)': u'Failed to connect to 5673 port on host 10.109.20.4 Please refer to OpenStack logs for more details.'}]

rabbit failed node is node-2:

๐‚๐ฅ๐ฎ๐ฌ๐ญ๐ž๐ซ ๐ฌ๐ญ๐š๐ญ๐ฎ๐ฌ ๐จ๐Ÿ ๐ง๐จ๐๐ž '๐ซ๐š๐›๐›๐ข๐ญ@๐ง๐จ๐๐ž-๐Ÿ' ...
Error: unable to connect to node 'rabbit@node-2': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@node-2']

rabbit@node-2:
  * connected to epmd (port 4369) on node-2
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on node-2
  * suggestion: start the node

current node details:
- node name: 'rabbitmqctl11900@node-2'
- home dir: /var/lib/rabbitmq
- cookie hash: soeIWU2jk2YNseTyDSlsEA==

๐€๐ง๐ ๐ฌ๐ž๐ž๐ฆ๐ฌ ๐ซ๐š๐›๐›๐ข๐ญ ๐ข๐ฌ ๐ง๐จ๐ญ ๐ซ๐ฎ๐ง๐ง๐ข๐ง๐  ๐จ๐ง ๐ข๐ญ:
root@node-2:~# ps uuax| grep erla
rabbitmq 8081 0.0 0.0 8132 1088 ? S 09:01 0:01 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon
root 14648 0.0 0.0 10460 936 pts/0 S+ 10:23 0:00 grep --color=auto erla
root@node-2:~# ps uuax| grep beam
root 14729 0.0 0.0 10460 936 pts/0 S+ 10:24 0:00 grep --color=auto beam
root@node-2:~# ps uuax| grep rabb
rabbitmq 3438 0.0 0.4 90432 11844 ? Ss 09:00 0:00 /usr/bin/python /usr/bin/rabbit-fence.py
rabbitmq 8081 0.0 0.0 8132 1088 ? S 09:01 0:01 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon

๐€๐ง๐ ๐จ๐œ๐Ÿ ๐ญ๐จ๐จ:
root@node-2:~# OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/fuel/rabbitmq-server status ; echo $?
7

๐๐ฎ๐ญ ๐ฉ๐š๐œ๐ž๐ฆ๐š๐ค๐ž๐ซ ๐ฌ๐ก๐จ๐ฐ ๐ข๐ญ ๐š๐ฌ ๐ก๐ž๐š๐ฅ๐ญ๐ก๐ฒ ๐š๐ง๐ ๐จ๐ง๐ฅ๐ข๐ง๐ž ๐š๐ง๐ ๐ž๐ฏ๐ž๐ง ๐๐จ ๐ง๐จ ๐ญ๐ซ๐ฒ ๐ญ๐จ ๐ซ๐ž-๐ฎ๐ฉ:

Online: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]

Full list of resources:

 Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 vip__management (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__public_vrouter (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__management_vrouter (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__public (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-1.test.domain.local ]
     Slaves: [ node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_dns [p_dns]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 p_ceilometer-agent-central (ocf::fuel:ceilometer-agent-central): Started node-1.test.domain.local
 p_ceilometer-alarm-evaluator (ocf::fuel:ceilometer-alarm-evaluator): Started node-2.test.domain.local

 ๐‘ด๐’‚๐’”๐’•๐’†๐’“/๐‘บ๐’๐’‚๐’—๐’† ๐‘บ๐’†๐’•: ๐’Ž๐’‚๐’”๐’•๐’†๐’“_๐’‘_๐’“๐’‚๐’ƒ๐’ƒ๐’Š๐’•๐’Ž๐’’-๐’”๐’†๐’“๐’—๐’†๐’“ [๐’‘_๐’“๐’‚๐’ƒ๐’ƒ๐’Š๐’•๐’Ž๐’’-๐’”๐’†๐’“๐’—๐’†๐’“]
    ๐‘ด๐’‚๐’”๐’•๐’†๐’“๐’”: [ ๐’๐’๐’…๐’†-1.๐’•๐’†๐’”๐’•.๐’…๐’๐’Ž๐’‚๐’Š๐’.๐’๐’๐’„๐’‚๐’ ]
   ๐‘บ๐’๐’‚๐’—๐’†๐’”: [ ๐’๐’๐’…๐’†-2.๐’•๐’†๐’”๐’•.๐’…๐’๐’Ž๐’‚๐’Š๐’.๐’๐’๐’„๐’‚๐’ ๐’๐’๐’…๐’†-5.๐’•๐’†๐’”๐’•.๐’…๐’๐’Ž๐’‚๐’Š๐’.๐’๐’๐’„๐’‚๐’ ]

Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_ntp [p_ntp]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]

PCSD Status:
  10.109.22.4: Offline
  10.109.22.5: Offline
  10.109.22.8: Offline

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2014.2.2-7.0"
  api: "1.0"
  build_number: "26"
  build_id: "2015-07-06_18-08-24"
  nailgun_sha: "d040c5cebc9cdd24ef20cb7ecf0a337039baddec"
  python-fuelclient_sha: "315d8bf991fbe7e2ab91abfc1f59b2f24fd92f45"
  astute_sha: "9cbb8ae5adbe6e758b24b3c1021aac1b662344e8"
  fuel-library_sha: "251c54e8de2f41aacd260751e7a891e9fbffc45d"
  fuel-ostf_sha: "a752c857deafd2629baf646b1b3188f02ff38084"
  fuelmain_sha: "4f2dff3bdc327858fa45bcc2853cfbceae68a40c"

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
summary: - Pecemaker shows healty status fro rabbitmq meanwhile onde node of
+ Pecemaker shows healthy status fro rabbitmq meanwhile onัƒ node of
rabbitmq failed
description: updated
Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: Pecemaker shows healthy status fro rabbitmq meanwhile onัƒ node of rabbitmq failed

According to logs, monitor have been returning "not running" and pacemaker did not trigger any stop/start events as this situation considered OK (the resource may be not running after a graceful stop, for example). The solution is to return generic error instead of not running when the script logic expects the resource to be restarted by pacemaker.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
status: New → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

https://review.openstack.org/199059
the patch should also be backported for supported releases

description: updated
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The patch https://review.openstack.org/190137 will adress this issue as well. rabbitmqctl exit code 2 should be reported to pacemaker as a generic error.

summary: - Pecemaker shows healthy status fro rabbitmq meanwhile onัƒ node of
- rabbitmq failed
+ Pacemaker shows healthy status for rabbitmq node meanwhile the node is
+ actually down
tags: added: rabbitmq
Revision history for this message
Mike Scherbakov (mihgen) wrote : Re: Pacemaker shows healthy status for rabbitmq node meanwhile the node is actually down

Bogdan - is it regression?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

No, this is not a regression

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The patch https://review.openstack.org/190137 merged. This issue is expected to be resolved as well.

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/199059
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=5097d94f5d56fd6126ca9b7c1227961536c94399
Submitter: Jenkins
Branch: master

commit 5097d94f5d56fd6126ca9b7c1227961536c94399
Author: Bogdan Dobrelya <email address hidden>
Date: Tue Jul 7 13:32:25 2015 +0200

    Fix error return codes for rabbit OCF

    W/o this fix the situation is possible when
    rabbit OCF returns OCF_NOT_RUNNING in the hope of
    future restart of the resource by pacemaker.

    But in fact, pacemaker will not trigger restart action
    if monitor returns "not running". This is an issue
    as we want resource restarted.

    The solution is to return OCF_ERR_GENERIC instead of
    OCF_NOT_RUNNING when we expect the resource to be restarted
    (which is action stop plus action start).

    Closes-bug: #1472230

    Change-Id: I10c6e43d92cb23596636d86932674b36864d1595
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message
Tatyanka (tatyana-leontovich) wrote : Re: Pacemaker shows healthy status for rabbitmq node meanwhile the node is actually down

verified 140 iso

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This bugfix introduced a regression, please look into bug # 1484280

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The issue was reproduced again. Node-6's crm_mon -fotAW -1 output for the rabbit resource indicates the last monitor check for the resource was a day before:
   p_rabbitmq-server: migration-threshold=1000000
    + (164) probe: last-rc-change='Wed Aug 26 21:06:49 2015' last-run='Wed Aug 26 21:06:49 2015' exec-time=4229ms queue-time=0ms rc=0 (ok)
    + (169) monitor: interval=103000ms last-rc-change='Wed Aug 26 21:07:23 2015' exec-time=6375ms queue-time=0ms rc=0 (ok)
    + (170) monitor: interval=30000ms last-rc-change='Wed Aug 26 21:07:30 2015' exec-time=5573ms queue-time=6361ms rc=0 (ok)

While the current date is Thu Aug 27 12:35:43 UTC 2015 and the /var/log/remote/node-6.domain.tld/lrmd.log is full of generic errors returned from the monitor action:
2015-08-27T12:39:29.046644+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): rabbit node is running out of the cluster
2015-08-27T12:39:29.051209+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): get_status() returns generic error 1

And manually issued monitor check returns generic error as well.

But something is definitely wrong with pacemaker as it shows status OK and doesn't update the monitor statistics...
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-1.domain.tld ]
     Slaves: [ node-6.domain.tld node-7.domain.tld ]

Changed in fuel:
status: Fix Released → Confirmed
summary: Pacemaker shows healthy status for rabbitmq node meanwhile the node is
- actually down
+ actually down/split brain
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The last records in pacemaker log for the monitor action ok, are:
 Aug 26 21:07:30 [14594] node-6.domain.tld crmd: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_103000: ok (node=node-6.domain.tld, call=169, rc=0, cib-update=136, confirmed=false)
 Aug 26 21:07:35 [14594] node-6.domain.tld crmd: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: ok (node=node-6.domain.tld, call=170, rc=0, cib-update=137, confirmed=false)

after that, there was no more tries to monitor the resource, and pacemaker thinks it is running OK. Looks like a pacemaker bug.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note, these two was not confirmed.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This snippet shows all rabbit monitor events logged and the pacemaker failures list reported with pcs status http://pastebin.com/GSD3RamW

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to logs and the snippet above, something went really wrong after 2015-08-27T11:16:00
here is a snippet of suspicious Stonith/Shutdown events http://pastebin.com/GkMHkbeG

And all of the events around as well.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I believe the pacemaker reached the "broken" state because there was no STONITH configured to fence the bad node-6. Hence, I returning the status of this bug back to fix released. The reproduced case seems not related to original bug.

Changed in fuel:
status: Confirmed → Fix Released
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Yes, I can see now there were few events after which it seems something became very broken in the pacemaker cluster, although looks and being reported healthy:

/var/log/remote/node-6.domain.tld/crmd.log:2015-08-27T11:15:52.760185+00:00 notice: notice: peer_update_callback: Our peer on the DC (node-1.domain.tld) is dead
/var/log/remote/node-6.domain.tld/crmd.log:2015-08-27T11:16:22.346080+00:00 warning: warning: reap_dead_nodes: Our DC node (node-7.domain.tld) left the cluster

And probably w/o STONITH enabled this situation could lead to such type of bugs. We probably should address this in the ops guide

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Related ops guide update on this topic https://review.openstack.org/218150

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-docs (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/218150
Reason: this is wrong info

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

With the patch https://review.openstack.org/#/c/223548, this bug maybe valid again. Returning to the Fix committed, additional verification required

Changed in fuel:
status: Fix Released → Fix Committed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

AS we discussed with @Vladimir Kuklin, the test case may be not good as well. In terms of nodes RAM configuration.

tags: added: on-verification
Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :

verified on ISO #297 by system /fuelweb_test/tests/tests_strength/test_cic_maintenance_mode/test auto_cic_maintenance_mode

2015-09-17 19:17:25,132 - INFO decorators.py:46 -- Saving logs to "/home/agrechanichenko/fuel-qa/logs/pass_auto_cic_maintenance_mode-fuel-snapshot-2015-09-17_19-07-06.tar.xz" file
ok

----------------------------------------------------------------------
Ran 5 tests in 16307.860s

OK

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

@Artem,

this issue is floating and I've just got it on bare-metal lab after primary controller shutdown:

 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-29.mirantis.com ]
     Slaves: [ node-28.mirantis.com node-35.mirantis.com ]
     Stopped: [ node-30.mirantis.com ]

Pacemaker says that RabbitMQ is running on node-35, but it's actually down:

root@node-35:~# ps auxfw | grep [r]abbit
rabbitmq 7332 0.0 0.0 90832 12956 ? Ss 08:58 0:03 /usr/bin/python /usr/bin/rabbit-fence.py
root@node-35:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-35' ...
Error: unable to connect to node 'rabbit@node-35': nodedown

rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-29' ...
[{nodes,[{disc,['rabbit@node-28','rabbit@node-29','rabbit@node-35']}]},
 {running_nodes,['rabbit@node-29']},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]}]

There are no issues with server resources (most of controllers have 16+ GB RAM, 8 CPUs and SSD drives): http://paste.openstack.org/show/472715/

Also, the fix https://review.openstack.org/#/c/223548 was merged to master (8.0) only, the patch for 7.0 https://review.openstack.org/#/c/223552/ is still on review.

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Increased to Critical, because we have patch for backport, issue had reproduced twice on Tema's and Matt's envs.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

User impact is following: after failover of some controller node, RabbitMQ cluster could be rebuilt and run without some live controllers, which means that high availability of AMQP could be broken. For example:

1) Cloud has 5 controller nodes
2) One controller node goes down
3) RabbitMQ cluster re-assembles, but service is running only on one controller
4) Controller node with alive RabbitMQ goes down

Result: AMQP messages are lost, some cloud operations failed

Diagnostic snapshot: https://drive.google.com/file/d/0BzaZINLQ8-xkanF2Z3cxYVljVVU/view?usp=sharing

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Nastya,

"we have patch for backport" << Which patch are you referring to?

Thanks,
Dims

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Nastya,

This one was already merged - https://review.openstack.org/#/c/223552/ hence asking

-- Dims

tags: removed: on-verification
tags: added: on-verification
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Fix released, issue is not reproduced.

Checked on an environment with 3 controllers, 2 computes.

For each controller, in the order: controller โ„–3, controller โ„–2, controller โ„–1, and then again controller โ„–3:

1. Enable umm mode: `umm on` (node will be automatically rebooted)
2. Wait until node is rebooted
3. Disable umm mode: `umm off`
4. Wait for all required resources are started by pacemaker on the node: `pcs status`
5. Run OSTF HA tests
6. Repeat from step โ„–1 for next controller.

Result: OSTF HA tests are passed successfully for each controller.

[root@nailgun ~]# fuel --fuel-version
DEPRECATION WARNING: /etc/fuel/client/config.yaml exists and will be used as the source for settings. This behavior is deprecated. Please specify the path to your custom settings file in the FUELCLIENT_CUSTOM_SETTINGS environment variable.
api: '1.0'
astute_sha: 6c5b73f93e24cc781c809db9159927655ced5012
auth_required: true
build_id: '301'
build_number: '301'
feature_groups:
- mirantis
fuel-agent_sha: 50e90af6e3d560e9085ff71d2950cfbcca91af67
fuel-library_sha: 5d50055aeca1dd0dc53b43825dc4c8f7780be9dd
fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
fuel-ostf_sha: 2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c
fuelmain_sha: a65d453215edb0284a2e4761be7a156bb5627677
nailgun_sha: 4162b0c15adb425b37608c787944d1983f543aa8
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 486bde57cda1badb68f915f66c61b544108606f3
release: '7.0'
release_versions:
  2015.1.0-7.0:
    VERSION:
      api: '1.0'
      astute_sha: 6c5b73f93e24cc781c809db9159927655ced5012
      build_id: '301'
      build_number: '301'
      feature_groups:
      - mirantis
      fuel-agent_sha: 50e90af6e3d560e9085ff71d2950cfbcca91af67
      fuel-library_sha: 5d50055aeca1dd0dc53b43825dc4c8f7780be9dd
      fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
      fuel-ostf_sha: 2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c
      fuelmain_sha: a65d453215edb0284a2e4761be7a156bb5627677
      nailgun_sha: 4162b0c15adb425b37608c787944d1983f543aa8
      openstack_version: 2015.1.0-7.0
      production: docker
      python-fuelclient_sha: 486bde57cda1badb68f915f66c61b544108606f3
      release: '7.0'

tags: removed: on-verification
Dmitry Pyzhov (dpyzhov)
tags: added: area-library
tags: added: rca-done
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Bad news, folks. As we know, the fix to this issue https://review.openstack.org/#/c/199059 was undone by https://review.openstack.org/#/c/223548 (see the comment https://bugs.launchpad.net/fuel/+bug/1472230/comments/21 ) .

And now we have 2 or 3 bugs with the same issue being reproduced again. Raising to critical and attaching them here as duplicates

Changed in fuel:
status: Fix Released → Confirmed
importance: High → Critical
no longer affects: fuel/8.0.x
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The bug is tricky and I have to analyze all of the new duplicates carefully to find all of the root causes, as there are likely many of them.

Changed in fuel:
status: Confirmed → In Progress
tags: added: ha tricky
removed: rca-done
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Two new duplicates point to the same flow of events, like https://bugs.launchpad.net/fuel/+bug/1529875/comments/6 or https://bugs.launchpad.net/fuel/+bug/1530228/comments/5. Which is: at some point, after the rabbit OCF monitor reported an error followed by several "not running" reports, pacemaker starts thinking everything is fine with the resource and shows it as running in the status. That is very strange, I have no idea why it happens.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Looks like a major bug in pacemaker or in the OCF script as well. I'm still investigating but using the dummy OCF, see details here http://clusterlabs.org/pipermail/users/2016-January/002045.html

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The bug is likely in the ocf-shellfuncs, see for details and for the w/a as well https://github.com/ClusterLabs/resource-agents/issues/734

Changed in fuel:
status: In Progress → Triaged
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I can confirm that atop logs from both https://bugs.launchpad.net/fuel/+bug/1530228 and https://bugs.launchpad.net/fuel/+bug/1529875 contain the same patterns with misbehaving ocf-shellfuncs spawning 4-5 nested monitors. This should be a RC of the issue, but I'm not sure if only that.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Added MOS packaging team to rebuild the resource-agents for MOS mirrors to include the patch suggested by kskmori in https://github.com/ClusterLabs/resource-agents/issues/734

Changed in mos:
assignee: nobody → MOS Packaging Team (mos-packaging)
status: New → Triaged
Changed in fuel:
status: Triaged → In Progress
Changed in mos:
milestone: none → 8.0
importance: Undecided → Critical
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to fuel-infra/jeepyb-config (master)

Related fix proposed to branch: master
Change author: Ivan Udovichenko <email address hidden>
Review: https://review.fuel-infra.org/15971

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to fuel-infra/jeepyb-config (master)

Reviewed: https://review.fuel-infra.org/15971
Submitter: Mateusz Matuszkowiak <email address hidden>
Branch: master

Commit: 96b80af93620e7dc6672db594d4930cd813e5639
Author: Ivan Udovichenko <email address hidden>
Date: Tue Jan 5 15:32:47 2016

Add resource-agents project [MOS 8.0]

- resource-agents

Current version in Ubuntu Trusty repository
doesn't satisfy required needs:
http://packages.ubuntu.com/trusty/resource-agents
1:3.9.3+git20121009-3ubuntu2

We need version 3.9.5 with applied patch on-top of it.

Change-Id: I4feccdc6d5bbd44e1b66b7e73c4e371338416efb
Related-Bug: #1472230

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Cannot reproduce this issue with the resource-agents fix for the shell fork bomb

Changed in fuel:
status: In Progress → Invalid
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Moving to invalid, nothing to fix for the Fuel side, the package shall be patched instead

Revision history for this message
Ivan Udovichenko (iudovichenko) wrote :

Link to a resource-agents source package with patch: https://review.fuel-infra.org/#/c/15974/

Changed in mos:
status: Triaged → In Progress
Changed in mos:
assignee: MOS Packaging Team (mos-packaging) → Ivan Udovichenko (iudovichenko)
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to packages/trusty/resource-agents (8.0)

Reviewed: https://review.fuel-infra.org/15974
Submitter: Pkgs Jenkins <email address hidden>
Branch: 8.0

Commit: 35a347684dfde00ef5746aaa3291d1a76dae7c7d
Author: Ivan Udovichenko <email address hidden>
Date: Wed Jan 6 15:40:40 2016

Update resource-agents package [MOS 8.0]

Version: 1:3.9.5+git+a626847-1 experimental (rc-buggy) [1]
Add MIRA0001-Check-Bash-shell-presence.patch patch [2]

[1] https://packages.debian.org/experimental/resource-agents
[2] https://github.com/ClusterLabs/resource-agents/issues/734

Related-Bug: #1472230

Change-Id: I6c1d547d4341a6f22491d94f24811fb48a9f204c

Changed in mos:
status: In Progress → Fix Committed
Changed in mos:
status: Fix Committed → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

There were a bunch of fixes to related bugs accepted, so I'd better put the Fuel status to Fix committed. The Invalid looks not really good :/

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to packages/trusty/resource-agents (9.0)

Related fix proposed to branch: 9.0
Change author: Ivan Udovichenko <email address hidden>
Review: https://review.fuel-infra.org/16099

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to packages/trusty/resource-agents (7.0)

Related fix proposed to branch: 7.0
Change author: Ivan Udovichenko <email address hidden>
Review: https://review.fuel-infra.org/16123

Changed in mos:
status: Confirmed → Fix Committed
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Download full text (3.7 KiB)

Reproduced on 429 iso(fix was not included in it) but scenario differs, so just add here case to be verified on iso with fix:
1. Deploy 1 controller
2. Add 2 controller - redeploy - run ostf
3. Add 2 controller + 1 compute + cinder - redeploy - run ostf leave env for 24 h - run ostf:
Ostf test failed, crm status says that all resources looks like:
Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
 vip__management (ocf::fuel:ns_IPaddr2): Started node-7.test.domain.local
 vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-7.test.domain.local
 vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-7.test.domain.local
 vip__public (ocf::fuel:ns_IPaddr2): Started node-7.test.domain.local
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-7.test.domain.local ]
     Slaves: [ node-10.test.domain.local node-11.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
 Clone Set: clone_p_dns [p_dns]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-7.test.domain.local ]
     Slaves: [ node-11.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
     Stopped: [ node-10.test.domain.local ]
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
 Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
 Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
 Clone Set: clone_p_neutron-dhcp-agent [p_neutron-dhcp-agent]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
 Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
 Clone Set: clone_p_ntp [p_ntp]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domai...

Read more...

tags: added: on-verification
Revision history for this message
Alexander Zatserklyany (zatserklyany) wrote :
Download full text (3.4 KiB)

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "464"

./utils/jenkins/system_tests.sh -t test -w $(pwd) -j fuelweb_test -i $ISO_PATH -o --group=auto_cic_maintenance_mode -V ${VENV_PATH} -K
...
----------------------------------------------------------------------
Ran 5 tests in 19547.357s

OK

crm status
Warning: Permanently added 'node-1' (ECDSA) to the list of known hosts.
Last updated: Thu Jan 21 12:56:23 2016
Last change: Thu Jan 21 10:35:05 2016
Stack: corosync
Current DC: node-4.test.domain.local (4) - partition with quorum
Version: 1.1.12-561c4cf
3 Nodes configured
48 Resources configured

Online: [ node-1.test.domain.local node-4.test.domain.local node-5.test.domain.local ]

 sysinfo_node-1.test.domain.local (ocf::pacemaker:SysInfo): Started node-1.test.domain.local
 Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-1.test.domain.local node-4.test.domain.local node-5.test.domain.local ]
 vip__management (ocf::fuel:ns_IPaddr2): Started node-4.test.domain.local
 vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__public (ocf::fuel:ns_IPaddr2): Started node-4.test.domain.local
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-1.test.domain.local ]
     Slaves: [ node-4.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-4.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_dns [p_dns]
     Started: [ node-1.test.domain.local node-4.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-1.test.domain.local node-4.test.domain.local node-5.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-4.test.domain.local ]
     Slaves: [ node-1.test.domain.local node-5.test.domain.local ]
 p_ceilometer-agent-central (ocf::fuel:ceilometer-agent-central): Started node-4.test.domain.local
 p_ceilometer-alarm-evaluator (ocf::fuel:ceilometer-alarm-evaluator): Started node-4.test.domain.local
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-1.test.domain.local node-4.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
     Started: [ node-1.test.domain.local node-4.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     Started: [ node-1.test.domain.local node-4.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_neutron-dhcp-agent [p_neutron-dhcp-agent]
     Started: [ node-1.test.domain.local node-4.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-1.test.domain.local node-4.test.domain.local node-5.test.domain.local ]
 sysinfo_node-5.test.domain.local (ocf::pacemaker:SysInfo): Started node-5.test.domain.local
 sysinfo_node-4.test.domain.local (ocf::pacemaker...

Read more...

no longer affects: mos
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

One more case to verify - controller deletion:
reproduced on 429 iso
scenario:
https://mirantis.testrail.com/index.php?/tests/view/2465653&group_by=tests:status_id&group_order=asc&group_id=8

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to packages/trusty/resource-agents (master)

Related fix proposed to branch: master
Change author: Ivan Udovichenko <email address hidden>
Review: https://review.fuel-infra.org/16570

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to packages/trusty/resource-agents (master)

Reviewed: https://review.fuel-infra.org/16570
Submitter: Pkgs Jenkins <email address hidden>
Branch: master

Commit: c699ba7e0c4b413fb7bbfa87382081d9a5422c5d
Author: Ivan Udovichenko <email address hidden>
Date: Fri Jan 29 16:22:02 2016

Update resource-agents package [MOS 8.0]

Version: 1:3.9.5+git+a626847-1 experimental (rc-buggy) [1]
Add MIRA0001-Check-Bash-shell-presence.patch patch [2]

[1] https://packages.debian.org/experimental/resource-agents
[2] https://github.com/ClusterLabs/resource-agents/issues/734

Related-Bug: #1472230

Change-Id: I6c1d547d4341a6f22491d94f24811fb48a9f204c
(cherry picked from commit 35a347684dfde00ef5746aaa3291d1a76dae7c7d)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Qa team, please note that the patch https://review.fuel-infra.org/#/c/16099/ for the master (9.0) is going to be merged *only* today, so you can check the fix then

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I'm not sure what is the full way for the patch to get into the 9.0 ISO, but you can check if it is there by the link
https://product-ci.infra.mirantis.net/view/9.0-liberty/job/9.0-liberty.all/lastSuccessfulBuild/artifact/listing.txt

Like for the 8.0 branch , for example https://product-ci.infra.mirantis.net/job/8.0.all/lastSuccessfulBuild/artifact/listing.txt it is already there...

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to packages/trusty/resource-agents (9.0)

Reviewed: https://review.fuel-infra.org/16099
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0

Commit: 2ffa57f7362a5b03153c8aa2c3bc2798532560e8
Author: Ivan Udovichenko <email address hidden>
Date: Thu Jan 14 10:08:48 2016

Update resource-agents package [MOS 8.0]

Version: 1:3.9.5+git+a626847-1 experimental (rc-buggy) [1]
Add MIRA0001-Check-Bash-shell-presence.patch patch [2]

[1] https://packages.debian.org/experimental/resource-agents
[2] https://github.com/ClusterLabs/resource-agents/issues/734

Related-Bug: #1472230

Change-Id: I6c1d547d4341a6f22491d94f24811fb48a9f204c
(cherry picked from commit 35a347684dfde00ef5746aaa3291d1a76dae7c7d)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The nested monitors are still may be reproduced, see the bug https://bugs.launchpad.net/fuel/+bug/1541029.

The node-2 atop.log contain up to the 6 nested monitors, which might be a root cause of segfaulting and failures of a node:
PRG node-2 1454369647 2016/02/02 00:34:07 20 21797 (rabbitmq-server) S 0 0 21797 1 -2147483648 1454369645 (/bin/bash /usr/lib/ocf/resource.d/fuel/rabbitmq-server monitor) 12140 0 1 0 0 0 0 0 0 0 0
PRG node-2 1454369647 2016/02/02 00:34:07 20 22765 (rabbitmq-server) S 0 0 22765 1 -2147483648 1454369647 (/bin/bash /usr/lib/ocf/resource.d/fuel/rabbitmq-server monitor) 21797 0 1 0 0 0 0 0 0 0 0
PRG node-2 1454369647 2016/02/02 00:34:07 20 22768 (rabbitmq-server) S 0 0 22768 1 -2147483648 1454369647 (/bin/bash /usr/lib/ocf/resource.d/fuel/rabbitmq-server monitor) 22765 0 1 0 0 0 0 0 0 0 0
PRG node-2 1454369647 2016/02/02 00:34:07 20 22770 (rabbitmq-server) S 0 0 22770 1 -2147483648 1454369647 (/bin/bash /usr/lib/ocf/resource.d/fuel/rabbitmq-server monitor) 22768 0 1 0 0 0 0 0 0 0 0
PRG node-2 1454369727 2016/02/02 00:35:27 20 27082 (rabbitmq-server) S 0 0 27082 1 -2147483648 1454369727 (/bin/bash /usr/lib/ocf/resource.d/fuel/rabbitmq-server monitor) 12140 0 1 0 0 0 0 0 0 0 0
PRG node-2 1454369727 2016/02/02 00:35:27 20 27179 (rabbitmq-server) S 0 0 27179 1 -2147483648 1454369727 (/bin/bash /usr/lib/ocf/resource.d/fuel/rabbitmq-server monitor) 27082 0 1 0 0 0 0 0 0 0 0

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This bug seems indestructible

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

This bug doesn't meet 'critical' status. Moving it to high.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The nested monitors seem is not the RC, we can consider this bug as closed.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Bogdan, could the issue be a duplicate of #1559949?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Mikhail Samoylov (msamoylov) wrote :
Alexey Galkin (agalkin)
Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Denis Puchkin (dpuchkin) wrote :

Won't Fix for 7.0-updates because this is too large change to be accepted to stable branch

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.