[swarm] corosync stability check test fails

Bug #1636561 reported by Dmitry Belyaninov
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Incomplete
High
Fuel QA Team
Nominated for Ocata by Oleksiy Molchanov
Mitaka
Confirmed
High
Fuel QA Team
Newton
Incomplete
High
Fuel QA Team

Bug Description

Detailed bug description:
There is failed test case:

https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/107/testReport/(root)/ha_corosync_stability_check/ha_corosync_stability_check/

  File "/home/jenkins/workspace/9.x.system_test.ubuntu.ha_neutron_destructive/fuelweb_test/tests/tests_strength/test_failover_base.py", line 1193, in <lambda>
    ' count-{0}'.format(count)), timeout=20)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/asserts.py", line 55, in assert_equal
    raise ASSERTION_ERROR(message)
AssertionError: Corosync was not started, see debug log, count-0

Steps to reproduce:
run the test
Expected results:
pass
Actual result:
fail
Reproducibility:
 <put your information here>
Workaround:
 <put your information here>
Impact:
 <put your information here>
Description of the environment:
 Operation system: <put your information here>
 Versions of components: <put your information here>
 Reference architecture: <put your information here>
 Network model: <put your information here>
 Related projects installed: <put your information here>
Additional information:
 <put your information here>

Changed in fuel:
milestone: 9.2 → 11.0
status: New → Confirmed
tags: added: area-library
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Logs for failed deployment are already unavailable. Latest deployments don't contain such failure. Marking as Incomplete, please reopen as soon as this issue happens.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Alexey. Kalashnikov (akalashnikov) wrote :
Revision history for this message
Alexey. Kalashnikov (akalashnikov) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

The failed logs do not contain information about corosync split-brain. Please create another bug and attach the logs there.

summary: - [swarm] 9.2 Test failed with "Corosync was not started"
+ [swarm] corosync stability check test fails
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Please reopen this as soon as this issue reproduces again.

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Reproduced on 9.2 snapshot #684:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/162/testReport/(root)/ha_corosync_stability_check/ha_corosync_stability_check/

 Scenario:
        1. On the first controller kill corosync
        2. Verify for all controllers "pcs status nodes" reports
        3. Start corosync on the first controller
        4. Repeat steps 1-3 500 times

tags: added: swarm-fail
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

I tried to reproduce this manually. No luck after several thousands corosync kill/restart actions.

Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

And there is root cause:
our test scenario works like this:

1. Get a controller
2. Call 'killall -s TERM corosync' on it
3. Get controller fqdn
4. Go to other controller
5. Call 'pcs status' on it
6. Get offline nodes from output
7. Check that our first controller is in 'Offline' nodes from output
8. If last is not True, throw an exception about splitbrain

But problem is that between calling 'killall' on the first controller and 'pcs status' on the second, corosync got cluster synced back and on second node we have all the nodes online in one cluster. And it is awesome, actually.
So, we should fix the test itself, I believe. My proposal is to change the check to [0] that all nodes is or online and in one cluster or first controller is offline.

[0] https://github.com/openstack/fuel-qa/blob/master/fuelweb_test/tests/tests_strength/test_failover_base.py#L1139-L1140

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Fuel QA Team (fuel-qa)
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

@Dmitry

I cannot reproduce this issue manually. Here is the log. We can see that the node gets offline status is set within one second. So, I suppose there is some race condition. I would suggest to check for corosync process status on the first node and wait while corosync finally terminates and check pcs nodes status only after that. It would also be really useful to have timestamps for logs steps to be able to debug things.
root@node-6:~# killall -TERM corosync; date +%s.%N
1484752636.103218131

root@node-7
Pacemaker Nodes:
 Online: node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local
 Standby:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Offline:
1484752636.759762323
Pacemaker Nodes:
 Online: node-5.test.domain.local node-7.test.domain.local
 Standby:
 Offline: node-6.test.domain.local
Pacemaker Remote Nodes:
 Online:
 Standby:
 Offline:
1484752637.373758520

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.