ha proxy check is failed after 'restart ceph-all'

Bug #1614914 reported by Yury Tregubov
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
Yury Tregubov

Bug Description

Subj is seen On MOS9.1 somewhere after snapshot #107.
It's not 100% reproducible, but it's quite stable.
We've caught it on nearly every 3'd CI run during last two weeks.

The problem itself is that OSTF are failed after execution of 'restart ceph-all' on all controllers and ceph nodes in the following way

Test "Check state of haproxy backends on controllers" status is failure; Some haproxy backend has down state.. Please refer to OpenStack logs for more details.

No errors were found in logs.
Diagnostic snapshot is attached.

To reproduce the fault:

- deploy env with 3 controllers and 2 ceph+compute nodes.
- revert it
- run 'restart ceph-all' on each node in the env
- run OSTF tests

The root cause is that fuel-qa restarts the whole ceph cluster at once,
and launches OSTF tests immediately after restarting the cluster.
However ceph is NOT designed to withstand the *whole cluster* outage,
so there's a time interval during which (ceph) cluster can't serve clients' requests. fuel-qa should either

- tolerate temporarily unavailable cluster
- restart ceph daemons one by one giving each instance (monitor, OSD) enough time to join the cluster

Revision history for this message
Yury Tregubov (ytregubov) wrote :
Changed in mos:
assignee: nobody → MOS Ceph (mos-ceph)
status: New → Confirmed
milestone: none → 9.1
importance: Undecided → High
tags: added: area-mos
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> - deploy env with 3 controllers and 2 ceph+compute nodes
> - revert it

Could you please explain how one could possibly "revert" a ceph cluster?

> - run 'restart ceph-all' on each node in the env

restarting ceph services on all nodes simultaneously is not a good idea, instead one
should restart them node by node giving ceph enough time to establish quorum, perform
peering (and possibly recovery) before proceeding to the next node.

> - run OSTF tests

What is the time interval before restarting ceph and running the test?
(establishing quorum and peering/recovery takes some time depending on the number of
the nodes, # of placement groups, etc)

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> restarting ceph services on all nodes simultaneously is not a good idea

And the test does exactly that:

ceph-mon.node-1.log (@node-1):

2016-08-19 09:47:20.571625 7f57e392d700 -1 mon.node-1@1(peon) e3 *** Got Signal Terminated ***

ceph-mon.node-4.log (@node-4):

2016-08-19 09:47:08.387261 7f6c36390700 -1 mon.node-4@0(leader) e3 *** Got Signal Terminated ***

ceph-mon.node-5.log (@node-5):

2016-08-19 09:47:12.558032 7f68292a0700 -1 mon.node-5@2(peon) e3 *** Got Signal Terminated ***

ceph-osd.3.log (@node-2):

2016-08-19 09:47:09.011077 7f44e91e5700 -1 osd.3 51 *** Got signal Terminated ***

ceph-osd.0.log (@node-2):

2016-08-19 09:47:09.010648 7f348e780700 -1 osd.0 52 *** Got signal Terminated ***

ceph-osd.1.log (@node-3):

2016-08-19 09:47:13.110102 7f068618b700 -1 osd.1 54 *** Got signal Terminated ***

ceph-osd.2.log (@node-3):

2016-08-19 09:47:13.110372 7fdcbe48d700 -1 osd.2 54 *** Got signal Terminated ***

If one wants the (ceph) cluster to be always available restarting ceph services should be properly
orchestrated:

1) restart monitors one by one, giving each monitor enough time to join the quorum
2) restart OSDs one by one, giving each OSD enough time to perform peering (and possibly recovery)
   Also it's wise to set noout noscrub nodeep-scrub OSD flags before restarting OSDs (and clear
   them after all OSDs have been restarted) to prevent unnecessary data migration between OSDs

Otherwise the cluster will be unavailable during some time (eventually it will recover).

should be properly orche

tags: added: blocker-for-qa
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

The test restarts the whole cluster at once, hence there's a time interval during which
the cluster can't serve clients' requests. This is not a bug. The test should be rewritten so it either

- tolerates temporarily unavailable cluster
- restarts ceph daemons one by one giving each instance (monitor, OSD) enough time to join the cluster

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Ceph cluster operates as designed, ceph team is not responsible for a broken test in fuel-qa (or whatever it is)

Changed in mos:
assignee: MOS Ceph (mos-ceph) → nobody
Changed in mos:
assignee: nobody → Fuel QA Team (fuel-qa)
status: Confirmed → New
tags: added: area-qa
removed: area-mos
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Why did you think that issue in fuel-qa? Unfortunately from description I didn't get what env was deployed with fuel-qa.

Changed in mos:
status: New → Incomplete
assignee: Fuel QA Team (fuel-qa) → Yury Tregubov (ytregubov)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Let's try to apply the same workaround here:
https://review.gerrithub.io/#/c/290013/

Changed in mos:
status: Incomplete → Confirmed
Revision history for this message
Alexander Nagovitsyn (gluk12189) wrote :
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Why did you think that issue in fuel-qa?

https://bugs.launchpad.net/mos/+bug/1614914/comments/4 (I've copied that text to the bug description)

description: updated
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :
Changed in mos:
status: Confirmed → In Progress
Changed in mos:
status: In Progress → Fix Committed
Changed in mos:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.