Rabbit master re-election happens after failover of controller with rabbit slave

Bug #1536974 reported by Tatyanka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Medium
Bogdan Dobrelya
8.0.x
Won't Fix
Medium
MOS Oslo
Mitaka
Invalid
Medium
Bogdan Dobrelya

Bug Description

Steps to Reproduce:

Check 3 in 1 rabbit failover

Scenario:
1. SSH to controller and get rabbit master
2. Destroy not rabbit master node
3. Check that rabbit master stay as was
4. Run ostf ha
5. Turn on destroyed slave
6. Check rabbit master is the same
7. Run ostf ha
8. Destroy rabbit master node
9. Check that new rabbit-master appears
10. Run ostf ha
11. Power on destroyed node
12. Check that new rabbit-master was not elected
13. Run ostf ha
[root@nailgun ~]# fuel node
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---|--------|---------------------|---------|--------------|-------------------|------------|---------------|--------|---------
1 | ready | slave-05_compute | 1 | 10.109.30.7 | 64:17:61:36:a1:00 | compute | | True | 1
4 | ready | slave-03_controller | 1 | 10.109.30.8 | 64:f1:8c:b6:e2:34 | controller | | False | 1
3 | ready | slave-02_controller | 1 | 10.109.30.4 | 64:f1:6b:07:1c:82 | controller | | True | 1
2 | ready | slave-04_compute | 1 | 10.109.30.9 | 64:94:54:46:44:dc | compute | | True | 1
6 | ready | slave-06_cinder | 1 | 10.109.30.10 | 64:cb:f7:d7:58:9b | cinder | | True | 1
5 | ready | slave-01_controller | 1 | 10.109.30.3 | 64:fa:d4:f6:da:45 | controller | | True | 1

Actual result:
Test failed on step 3 according to master was re-elected.

After deployment node 3 was elected as rabbit master - in test we destroy node -4 (that was run as rabbit slave)
And after this master -reelection happens and node-5 become the master
At the same time OSTF ha suit is passed, so please, clarify if the requirement like: If failover happens on node that is master slave, master rabbit node should become the same is valid? If not lets reassign this issue to qa team for test updates

 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-5.test.domain.local ]
     Slaves: [ node-3.test.domain.local ]
[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "466"
  build_id: "466"
  fuel-nailgun_sha: "f81311bbd6fee2665e3f96dcac55f72889b2f38c"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "6823f1d4005a634b8436109ab741a2194e2d32e0"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "fe03d887361eb80232e9914eae5b8d54304df781"
  fuel-ostf_sha: "ab5fd151fc6c1aa0b35bc2023631b1f4836ecd61"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "fac143f4dfa75785758e72afbdc029693e94ff2b"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "727f7076f04cb0caccc9f305b149a2b5b5c2af3a"

Tags: area-library
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
summary: - Rabbt master re-election happens after failover of controller with
+ Rabbit master re-election happens after failover of controller with
rabbit slave
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

set to medium according to ha cluster is healthy after master re-election, and according to log it happens due to timeouts in rabbitmqctl

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

I want to investigate what caused 'rabbitmqctl list_channels' timeout, but sadly there is no rabbitmq logs in snapshot.
Is it possible to run this scenario with 'max_rabbitmqctl_timeouts' set to some high value and give me access to the env after 'list_channels' will become unresponsive?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Alexey, the bug is not valid as it is expected that a rabbit node is restarted when the rabbitmqctl hanged. But would be nice to apply your extended-debug patch and investigate *why* it hangs so sadly often - as a separate effort please.

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

Yes, but 'list_channels' should no hang in the first place. The best way to investigate this condition would be to set 'max_rabbitmqctl_timeouts' to some big value and do some erlang debugging. But looks like it won't be possible during 8.0 timeframe.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.