Fuel for OpenStack

[swarm] corosync stability check test fails

Series newton
Bug #1636561

Bug #1636561 reported by Dmitry Belyaninov on 2016-10-25

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Incomplete	High	Fuel QA Team	Fuel for OpenStack 11.0
Nominated for Ocata by Oleksiy Molchanov
	Mitaka	Confirmed	High	Fuel QA Team	Fuel for OpenStack 9.2
	Newton	Incomplete	High	Fuel QA Team	Fuel for OpenStack 10.1

Bug Description

Detailed bug description:
There is failed test case:

https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/107/testReport/(root)/ha_corosync_stability_check/ha_corosync_stability_check/

  File "/home/jenkins/workspace/9.x.system_test.ubuntu.ha_neutron_destructive/fuelweb_test/tests/tests_strength/test_failover_base.py", line 1193, in <lambda>
    ' count-{0}'.format(count)), timeout=20)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/asserts.py", line 55, in assert_equal
    raise ASSERTION_ERROR(message)
AssertionError: Corosync was not started, see debug log, count-0

Steps to reproduce:
run the test
Expected results:
pass
Actual result:
fail
Reproducibility:
<put your information here>
Workaround:
<put your information here>
Impact:
<put your information here>
Description of the environment:
Operation system: <put your information here>
Versions of components: <put your information here>
Reference architecture: <put your information here>
Network model: <put your information here>
Related projects installed: <put your information here>
Additional information:
<put your information here>

Tags:

Oleksiy Molchanov (omolchanov) on 2016-10-26

Changed in fuel:
milestone:	9.2 → 11.0
status:	New → Confirmed
tags:	added: area-library

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-12-02:

Logs for failed deployment are already unavailable. Latest deployments don't contain such failure. Marking as Incomplete, please reopen as soon as this issue happens.

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Alexey. Kalashnikov (akalashnikov) wrote on 2016-12-20:

Reproduced on today swarm 9.2 snapshot #655
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/157/testReport/(root)/ha_corosync_stability_check/ha_corosync_stability_check/

Revision history for this message

Alexey. Kalashnikov (akalashnikov) wrote on 2016-12-20:

Logs:
https://drive.google.com/a/mirantis.com/file/d/0B0EB6QSDWt2vb1A5QWl6Q3NNenM/view?usp=sharing

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2016-12-20:

The failed logs do not contain information about corosync split-brain. Please create another bug and attach the logs there.

summary:

- [swarm] 9.2 Test failed with "Corosync was not started"
+ [swarm] corosync stability check test fails

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-12-26:

Please reopen this as soon as this issue reproduces again.

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-12-27:

Reproduced on 9.2 snapshot #684:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/162/testReport/(root)/ha_corosync_stability_check/ha_corosync_stability_check/

Scenario:
        1. On the first controller kill corosync
        2. Verify for all controllers "pcs status nodes" reports
        3. Start corosync on the first controller
        4. Repeat steps 1-3 500 times

tags:

added: swarm-fail

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-12-27:

fail_error_ha_corosync_stability_check-fuel-snapshot-2016-12-26_23-46-01.tar Edit (57.3 MiB, application/x-tar)

Revision history for this message

Stanislaw Bogatkin (sbogatkin) wrote on 2016-12-28:

I tried to reproduce this manually. No luck after several thousands corosync kill/restart actions.

Revision history for this message

Stanislaw Bogatkin (sbogatkin) wrote on 2016-12-28:

And there is root cause:
our test scenario works like this:

1. Get a controller
2. Call 'killall -s TERM corosync' on it
3. Get controller fqdn
4. Go to other controller
5. Call 'pcs status' on it
6. Get offline nodes from output
7. Check that our first controller is in 'Offline' nodes from output
8. If last is not True, throw an exception about splitbrain

But problem is that between calling 'killall' on the first controller and 'pcs status' on the second, corosync got cluster synced back and on second node we have all the nodes online in one cluster. And it is awesome, actually.
So, we should fix the test itself, I believe. My proposal is to change the check to [0] that all nodes is or online and in one cluster or first controller is offline.

[0] https://github.com/openstack/fuel-qa/blob/master/fuelweb_test/tests/tests_strength/test_failover_base.py#L1139-L1140

Oleksiy Molchanov (omolchanov) on 2017-01-13

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Fuel QA Team (fuel-qa)

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2017-01-17:

#10

Seems it was fixed by
https://bugs.launchpad.net/fuel/+bug/1637522

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2017-01-17:

#11

@Dima, nope the new failure https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/179/console

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2017-01-17:

#12

https://github.com/openstack/fuel-qa/blob/stable/mitaka/fuelweb_test/tests/tests_strength/test_failover_base.py#L1185-L1192

SB raises before corosync start/restart.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2017-01-18:

#13

@Dmitry

I cannot reproduce this issue manually. Here is the log. We can see that the node gets offline status is set within one second. So, I suppose there is some race condition. I would suggest to check for corosync process status on the first node and wait while corosync finally terminates and check pcs nodes status only after that. It would also be really useful to have timestamps for logs steps to be able to debug things.
root@node-6:~# killall -TERM corosync; date +%s.%N
1484752636.103218131

root@node-7
Pacemaker Nodes:
Online: node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local
Standby:
Offline:
Pacemaker Remote Nodes:
Online:
Standby:
Offline:
1484752636.759762323
Pacemaker Nodes:
Online: node-5.test.domain.local node-7.test.domain.local
Standby:
Offline: node-6.test.domain.local
Pacemaker Remote Nodes:
Online:
Standby:
Offline:
1484752637.373758520