Pacemaker shows healthy status for rabbitmq node meanwhile the node is actually down/split brain
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Released
|
High
|
Bogdan Dobrelya | ||
7.0.x |
Won't Fix
|
High
|
Denis Puchkin | ||
8.0.x |
Fix Released
|
High
|
Bogdan Dobrelya | ||
Mitaka |
Fix Released
|
High
|
Bogdan Dobrelya |
Bug Description
๐๐ญ๐๐ฉ๐ฌ:
1. Create cluster
2. Add 3 node with controller and mongo roles
3. Add 2 node with compute and cinder roles
4. Deploy the cluster
5. Run ostf
6. Check that umm feature is enabled (umm status )
7. Call unexpected reboot: reboot --force >/dev/null &
8. Check that node is rebooted, back online and enter into auto mode(umm status):
9. Disable umm mode: umm off and wait umm stops
10. Check while node back to the online status in psc
11. Run ostf ha
๐๐๐ญ๐ฎ๐๐ฅ ๐ซ๐๐ฌ๐ฎ๐ฅ๐ญ:
OSTF rabbit tests are failed:
Failed 2 OSTF tests; should fail 0 tests. Names of failed tests: [{u'RabbitMQ availability (failure)': u'Number of controllers is not equal to number of cluster nodes.'}, {u'RabbitMQ replication (failure)': u'Failed to connect to 5673 port on host 10.109.20.4 Please refer to OpenStack logs for more details.'}]
rabbit failed node is node-2:
๐๐ฅ๐ฎ๐ฌ๐ญ๐๐ซ ๐ฌ๐ญ๐๐ญ๐ฎ๐ฌ ๐จ๐ ๐ง๐จ๐๐ '๐ซ๐๐๐๐ข๐ญ@๐ง๐จ๐๐-๐' ...
Error: unable to connect to node 'rabbit@node-2': nodedown
DIAGNOSTICS
===========
attempted to contact: ['rabbit@node-2']
rabbit@node-2:
* connected to epmd (port 4369) on node-2
* epmd reports: node 'rabbit' not running at all
* suggestion: start the node
current node details:
- node name: 'rabbitmqctl119
- home dir: /var/lib/rabbitmq
- cookie hash: soeIWU2jk2YNseT
๐๐ง๐ ๐ฌ๐๐๐ฆ๐ฌ ๐ซ๐๐๐๐ข๐ญ ๐ข๐ฌ ๐ง๐จ๐ญ ๐ซ๐ฎ๐ง๐ง๐ข๐ง๐ ๐จ๐ง ๐ข๐ญ:
root@node-2:~# ps uuax| grep erla
rabbitmq 8081 0.0 0.0 8132 1088 ? S 09:01 0:01 /usr/lib/
root 14648 0.0 0.0 10460 936 pts/0 S+ 10:23 0:00 grep --color=auto erla
root@node-2:~# ps uuax| grep beam
root 14729 0.0 0.0 10460 936 pts/0 S+ 10:24 0:00 grep --color=auto beam
root@node-2:~# ps uuax| grep rabb
rabbitmq 3438 0.0 0.4 90432 11844 ? Ss 09:00 0:00 /usr/bin/python /usr/bin/
rabbitmq 8081 0.0 0.0 8132 1088 ? S 09:01 0:01 /usr/lib/
๐๐ง๐ ๐จ๐๐ ๐ญ๐จ๐จ:
root@node-2:~# OCF_ROOT=
7
๐๐ฎ๐ญ ๐ฉ๐๐๐๐ฆ๐๐ค๐๐ซ ๐ฌ๐ก๐จ๐ฐ ๐ข๐ญ ๐๐ฌ ๐ก๐๐๐ฅ๐ญ๐ก๐ฒ ๐๐ง๐ ๐จ๐ง๐ฅ๐ข๐ง๐ ๐๐ง๐ ๐๐ฏ๐๐ง ๐๐จ ๐ง๐จ ๐ญ๐ซ๐ฒ ๐ญ๐จ ๐ซ๐-๐ฎ๐ฉ:
Online: [ node-1.
Full list of resources:
Clone Set: clone_p_vrouter [p_vrouter]
Started: [ node-1.
vip__management (ocf::fuel:
vip__public_
vip__managemen
vip__public (ocf::fuel:
Master/Slave Set: master_p_conntrackd [p_conntrackd]
Masters: [ node-1.
Slaves: [ node-2.
Clone Set: clone_p_haproxy [p_haproxy]
Started: [ node-1.
Clone Set: clone_p_dns [p_dns]
Started: [ node-1.
Clone Set: clone_p_mysql [p_mysql]
Started: [ node-1.
p_ceilometer-
p_ceilometer-
๐ด๐๐๐๐๐/๐บ๐๐๐๐ ๐บ๐๐: ๐๐๐๐๐๐_
๐ด๐๐๐๐๐๐: [ ๐๐๐
๐-1.
๐บ๐๐๐๐๐: [ ๐๐๐
๐-2.
Clone Set: clone_p_heat-engine [p_heat-engine]
Started: [ node-1.
Clone Set: clone_p_ntp [p_ntp]
Started: [ node-1.
Clone Set: clone_ping_
Started: [ node-1.
PCSD Status:
10.109.22.4: Offline
10.109.22.5: Offline
10.109.22.8: Offline
VERSION:
feature_groups:
- mirantis
production: "docker"
release: "7.0"
openstack_
api: "1.0"
build_number: "26"
build_id: "2015-07-
nailgun_sha: "d040c5cebc9cdd
python-
astute_sha: "9cbb8ae5adbe6e
fuel-library_sha: "251c54e8de2f41
fuel-ostf_sha: "a752c857deafd2
fuelmain_sha: "4f2dff3bdc3278
Changed in fuel: | |
assignee: | Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando) |
status: | New → In Progress |
description: | updated |
summary: |
Pacemaker shows healthy status for rabbitmq node meanwhile the node is - actually down + actually down/split brain |
tags: | added: on-verification |
tags: | removed: on-verification |
tags: | added: on-verification |
tags: | added: area-library |
tags: | added: rca-done |
Changed in mos: | |
assignee: | MOS Packaging Team (mos-packaging) → Ivan Udovichenko (iudovichenko) |
Changed in mos: | |
status: | In Progress → Fix Committed |
Changed in mos: | |
status: | Fix Committed → Confirmed |
Changed in mos: | |
status: | Confirmed → Fix Committed |
tags: | added: on-verification |
no longer affects: | mos |
Changed in fuel: | |
status: | Fix Committed → Fix Released |
According to logs, monitor have been returning "not running" and pacemaker did not trigger any stop/start events as this situation considered OK (the resource may be not running after a graceful stop, for example). The solution is to return generic error instead of not running when the script logic expects the resource to be restarted by pacemaker.