[swarm 9.2] dead rabbit node was not removed, result is ExecResult

Bug #1636538 reported by Dmitry Belyaninov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Sergii Rizvan
Mitaka
Fix Released
High
Sergii Rizvan
Newton
Invalid
High
Sergii Rizvan
Ocata
Invalid
High
Sergii Rizvan

Bug Description

Detailed bug description:
There are two failed test cases:

https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/107/testReport/(root)/ha_neutron_check_dead_rabbit/ha_neutron_check_dead_rabbit/

dead rabbit node was not removed, result is ExecResult(
 cmd=grep -P 'Forgetting cluster node rabbit@\S*\bnode-2\b' /var/log/remote/node-3.test.domain.local/rabbit-fence.log,
  stdout=
'',
 stderr=
'',
 exit_code=1
)

https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/107/testReport/(root)/ha_neutron_check_alive_rabbit/ha_neutron_check_alive_rabbit/
alive rabbit node was not ignored, result is ExecResult .....

It seems that there is some common problem with rabbit exclusion.

Steps to reproduce:
run the test(s)
Expected results:
pass
Actual result:
fail
Reproducibility:
 <put your information here>
Workaround:
 <put your information here>
Impact:
 <put your information here>
Description of the environment:
 Operation system: <put your information here>
 Versions of components: <put your information here>
 Reference architecture: <put your information here>
 Network model: <put your information here>
 Related projects installed: <put your information here>
Additional information:
 <put your information here>

Tags: swarm-fail
Changed in fuel:
milestone: 9.2 → 11.0
tags: added: area-library
Changed in fuel:
status: New → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please link logs

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Fuel QA Team (fuel-qa)
tags: added: area-qa
removed: area-library
Oleksandr (oivashchenko)
tags: added: on-verification
Revision history for this message
Oleksandr (oivashchenko) wrote :

https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/124/consoleFull
I checked it on controller-node and found same problems.

[root@nailgun ~]# fuel node
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---+--------+---------------------+---------+------------+-------------------+------------+---------------+--------+---------
 3 | ready | slave-05_compute | 1 | 10.109.0.7 | 64:5a:53:d8:d6:ed | compute | | 1 | 1
 2 | ready | slave-01_controller | 1 | 10.109.0.3 | 64:01:71:f5:61:a3 | controller | | 1 | 1
 1 | ready | slave-02_controller | 1 | 10.109.0.4 | 64:4d:4a:0c:37:dd | controller | | 1 | 1
 5 | ready | slave-03_controller | 1 | 10.109.0.5 | 64:ae:f0:bc:0e:96 | controller | | 1 | 1
 4 | ready | slave-04_compute | 1 | 10.109.0.6 | 64:f6:a6:d7:ca:61 | compute | | 1 | 1
 6 | ready | slave-06_cinder | 1 | 10.109.0.8 | 64:c6:9c:80:88:90 | cinder | | 1 | 1
[root@nailgun ~]# cat /var/log/remote/node-5.test.domain.local/rabbit-fence.log
2016-11-10T22:35:21.873637+00:00 info: 2016-11-10 22:35:21,878 INFO Starting rabbit fence script main loop
2016-11-10T22:35:21.992104+00:00 info: 2016-11-10 22:35:21,997 INFO Caught SIGTERM, terminating...
2016-11-10T22:35:22.085073+00:00 info: 2016-11-10 22:35:22,090 INFO Starting rabbit fence script main loop
[root@nailgun ~]# ssh node-5
Last login: Fri Nov 11 10:25:58 2016 from 10.109.0.2
root@node-5:~# pcs status | grep rabbit
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server] (unmanaged)
     p_rabbitmq-server (ocf::fuel:rabbitmq-server): Started node-2.test.domain.local (unmanaged)
     p_rabbitmq-server (ocf::fuel:rabbitmq-server): Started node-5.test.domain.local (unmanaged)

tags: removed: on-verification
Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Fuel Sustaining (fuel-sustaining-team)
tags: removed: area-qa
tags: added: swarm-fail
Sergii Rizvan (srizvan)
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Sergii Rizvan (srizvan)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The message dbus sends when a node left had been changed unnoticed and I can't figure out the regression patch.

When the corosync service killed, the command "dbus-monitor --system" shows :

signal sender=:1.0 -> dest=(null destination) serial=2107 path=/com/ubuntu/Upstart/jobs/startpar_2dbridge/corosync_2d_2dstarted; interface=com.ubuntu.Upstart0_6.Instance; member=StateChanged
   string "killed"

while it is expected another thing completely.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/corosync (9.0)

Fix proposed to branch: 9.0
Change author: Sergii Rizvan <email address hidden>
Review: https://review.fuel-infra.org/29058

Revision history for this message
Sergii Rizvan (srizvan) wrote :

Root case of the issue is next. After adding upstart job for corosync-notifyd (Change-Id: I6b3abb5a65a0a73db1642800dfa01b1a46192197), /usr/sbin/corosync-notifyd had been launching without '-b' option. This option is needed in order to run corosync-notifyd as a daemon. So in fact corosync-notifyd wasn't running on corosync nodes, that's why no signals have been sent to dbus and rabbit-fence.py script wasn't able to react on such situations.
Introduced patch adds '-d' option in OPTIONS environment variable to /etc/default/corosync-notifyd

Changed in fuel:
status: Incomplete → Invalid
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/corosync (9.0)

Reviewed: https://review.fuel-infra.org/29058
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0

Commit: 34fd45a048e9a1a57ccf117515379f5978d41503
Author: Sergii Rizvan <email address hidden>
Date: Thu Dec 8 15:31:55 2016

Added OPTIONS environment variable for corosync-notifyd

For corosync-notifyd '-d' option is needed in order to send
DBUS signals on all events. That's why OPTIONS environment
variable has been added to /etc/default/corosync-notifyd

Change-Id: Iaf6675cd023849a89139453bdc7d98f03e983b37
Closes-Bug: #1636538

tags: added: on-verification
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :

Passed on #157 run

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.