ha_ceph_neutron_rabbit_master_destroy test failed after controller destroy with Can not ping instance by floating ip 10.109.1.129

Bug #1516631 reported by Andrey Sledzinskiy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Fuel QA Team

Bug Description

Next test is failing on different stages of pinging instance by floating IP - https://github.com/openstack/fuel-qa/blob/master/fuelweb_test/tests/tests_strength/test_failover_base.py#L673

Steps to reproduce:
1. Create next cluster - Neutron Vlan, ceph for volumes and images, 1 controller+ceph, 2 controllers, 1 compute, 1 compute+ceh
2. Deploy cluster
3. Run OSTF - everything is working
4. Create instance and assign floating ip to it
5. Get controller with rabbit master
crm resource status master_p_rabbitmq-server
6. Shutdown this controller
7. Wait 15 minutes
8. Try to ping floating IP of instance

Actual result - instance isn't pingable due to neutron agents unable to connect to rabbit

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
status: New → Confirmed
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Reproduced locally:
after shutdown node-1 only controller is shown up in pacemaker:
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Slaves: [ node-5.test.domain.local ]

rabbit on node-5 isn't alive:
root@node-5:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-5' ...
Error: unable to connect to node 'rabbit@node-5': nodedown

description: updated
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → MOS Oslo (mos-oslo)
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

env is availble to be investigated

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Some strange thing happens in Pacemaker: according to 'pcs status', both node-2 and node-5 are online and in the cluster: http://paste.openstack.org/show/479577/

But 'pcs resource' does not show RabbitMQ status for node-2. Only node-5 is listed as slave: http://paste.openstack.org/show/479576/

At the same time lrmd.log for node-2 shows that LRMD daemon calls 'monitor' operation and it returns OCF_ERR_GENERIC. But pacemaker just ignores that. Also, lrmd.log for node-5 shows that OCF script constantly tries to join RabbitMQ on node-5 to the one on node-2, but fails since RabbitMQ on node-2 is stuck.

Still, the main problem here is that Pacemaker does not act on node-2, though OCF script returns OCF_ERR_GENERIC meaning there are problems here.

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

I have looked only into the problems of rabbitmq itself, there were 2 bugs there:
- One is already fixed upstream - https://github.com/rabbitmq/rabbitmq-common/pull/18
- Another is described at https://github.com/rabbitmq/rabbitmq-server/issues/349 , I'll fix it in a nearest future.

tags: added: swarm-blocker
Dmitry Pyzhov (dpyzhov)
tags: added: area-mos
removed: area-qa
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Moved to High, because we have to fix it in 8.0 release.

Changed in fuel:
importance: Medium → High
Revision history for this message
Sergey Shevorakov (sshevorakov) wrote :

Tag swarm-blocker is set due to this bug fails 3 test cases (1% of all others).

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

I believe that this issue should be fixed with https://review.fuel-infra.org/#/c/14586/
Is it still reproduces with this fix applied?

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

QA team, per Alexey's comment above, could you please check if the issue is still reproducible and if yes, provide us new repro?

Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Fuel QA Team (fuel-qa)
Revision history for this message
Vladimir Khlyunev (vkhlyunev) wrote :

https://product-ci.infra.mirantis.net/view/8.0_swarm/job/8.0.system_test.ubuntu.ha_destructive_ceph_neutron/ passed 2 last times, looks like the fix is fine - but need a deeper verification.

Changed in fuel:
status: Confirmed → Fix Committed
tags: added: on-verification
Revision history for this message
ElenaRossokhina (esolomina) wrote :

Verified (iso#427)
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "427"
  build_id: "427"
  fuel-nailgun_sha: "9ebbaa0473effafa5adee40270da96acf9c7d58a"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "df16d41cd7a9445cf82ad9fd8f0d53824711fcd8"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "fae42170a54b98d8e8c8db99b0fbb312633c693c"
  fuel-ostf_sha: "214e794835acc7aa0c1c5de936e93696a90bb57a"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "85de57080a18fda18e5325f06eaf654b1b931592"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "e8e36cff332644576d7853c80b8a53d5b955420a"

tags: removed: on-verification
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.