Mirantis OpenStack

ha proxy check is failed after 'restart ceph-all'

Bug #1614914 reported by Yury Tregubov on 2016-08-19

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Mirantis OpenStack	Fix Released	High	Yury Tregubov	Mirantis OpenStack 9.1

Bug Description

Subj is seen On MOS9.1 somewhere after snapshot #107.
It's not 100% reproducible, but it's quite stable.
We've caught it on nearly every 3'd CI run during last two weeks.

The problem itself is that OSTF are failed after execution of 'restart ceph-all' on all controllers and ceph nodes in the following way

Test "Check state of haproxy backends on controllers" status is failure; Some haproxy backend has down state.. Please refer to OpenStack logs for more details.

No errors were found in logs.
Diagnostic snapshot is attached.

To reproduce the fault:

- deploy env with 3 controllers and 2 ceph+compute nodes.
- revert it
- run 'restart ceph-all' on each node in the env
- run OSTF tests

The root cause is that fuel-qa restarts the whole ceph cluster at once,
and launches OSTF tests immediately after restarting the cluster.
However ceph is NOT designed to withstand the *whole cluster* outage,
so there's a time interval during which (ceph) cluster can't serve clients' requests. fuel-qa should either

- tolerate temporarily unavailable cluster
- restart ceph daemons one by one giving each instance (monitor, OSD) enough time to join the cluster

See original description

Tags:

Revision history for this message

Yury Tregubov (ytregubov) wrote on 2016-08-19:

Diagnositc snapshot of MOS9.1 #153 Edit (76.1 MiB, application/x-tar)

Denis Meltsaykin (dmeltsaykin) on 2016-08-19

Changed in mos:
assignee:	nobody → MOS Ceph (mos-ceph)
status:	New → Confirmed
milestone:	none → 9.1
importance:	Undecided → High
tags:	added: area-mos

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-19:

> - deploy env with 3 controllers and 2 ceph+compute nodes
> - revert it

Could you please explain how one could possibly "revert" a ceph cluster?

> - run 'restart ceph-all' on each node in the env

restarting ceph services on all nodes simultaneously is not a good idea, instead one
should restart them node by node giving ceph enough time to establish quorum, perform
peering (and possibly recovery) before proceeding to the next node.

> - run OSTF tests

What is the time interval before restarting ceph and running the test?
(establishing quorum and peering/recovery takes some time depending on the number of
the nodes, # of placement groups, etc)

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-19:

> restarting ceph services on all nodes simultaneously is not a good idea

And the test does exactly that:

ceph-mon.node-1.log (@node-1):

2016-08-19 09:47:20.571625 7f57e392d700 -1 mon.node-1@1(peon) e3 *** Got Signal Terminated ***

ceph-mon.node-4.log (@node-4):

2016-08-19 09:47:08.387261 7f6c36390700 -1 mon.node-4@0(leader) e3 *** Got Signal Terminated ***

ceph-mon.node-5.log (@node-5):

2016-08-19 09:47:12.558032 7f68292a0700 -1 mon.node-5@2(peon) e3 *** Got Signal Terminated ***

ceph-osd.3.log (@node-2):

2016-08-19 09:47:09.011077 7f44e91e5700 -1 osd.3 51 *** Got signal Terminated ***

ceph-osd.0.log (@node-2):

2016-08-19 09:47:09.010648 7f348e780700 -1 osd.0 52 *** Got signal Terminated ***

ceph-osd.1.log (@node-3):

2016-08-19 09:47:13.110102 7f068618b700 -1 osd.1 54 *** Got signal Terminated ***

ceph-osd.2.log (@node-3):

2016-08-19 09:47:13.110372 7fdcbe48d700 -1 osd.2 54 *** Got signal Terminated ***

If one wants the (ceph) cluster to be always available restarting ceph services should be properly
orchestrated:

1) restart monitors one by one, giving each monitor enough time to join the quorum
2) restart OSDs one by one, giving each OSD enough time to perform peering (and possibly recovery)
Also it's wise to set noout noscrub nodeep-scrub OSD flags before restarting OSDs (and clear
them after all OSDs have been restarted) to prevent unnecessary data migration between OSDs

Otherwise the cluster will be unavailable during some time (eventually it will recover).

should be properly orche

Sergey Shevorakov (sshevorakov) on 2016-08-24

tags:

added: blocker-for-qa

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-24:

The test restarts the whole cluster at once, hence there's a time interval during which
the cluster can't serve clients' requests. This is not a bug. The test should be rewritten so it either

- tolerates temporarily unavailable cluster
- restarts ceph daemons one by one giving each instance (monitor, OSD) enough time to join the cluster

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-24:

Ceph cluster operates as designed, ceph team is not responsible for a broken test in fuel-qa (or whatever it is)

Changed in mos:
assignee:	MOS Ceph (mos-ceph) → nobody

Denis Meltsaykin (dmeltsaykin) on 2016-08-24

Changed in mos:
assignee:	nobody → Fuel QA Team (fuel-qa)
status:	Confirmed → New
tags:	added: area-qa removed: area-mos

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-08-24:

Why did you think that issue in fuel-qa? Unfortunately from description I didn't get what env was deployed with fuel-qa.

Changed in mos:
status:	New → Incomplete
assignee:	Fuel QA Team (fuel-qa) → Yury Tregubov (ytregubov)

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-05:

Let's try to apply the same workaround here:
https://review.gerrithub.io/#/c/290013/

Changed in mos:
status:	Incomplete → Confirmed

Revision history for this message

Alexander Nagovitsyn (gluk12189) wrote on 2016-09-06:

this bug consistently reproduced here.
https://ci.fuel-infra.org/job/mitaka.fuel-ostf.pkgs.ubuntu.gate_ostf_update/407/console

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-09-07:

> Why did you think that issue in fuel-qa?

https://bugs.launchpad.net/mos/+bug/1614914/comments/4 (I've copied that text to the bug description)

description:

updated

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-07:

#10

The fix is on review:

https://review.gerrithub.io/#/c/290327/

Changed in mos:
status:	Confirmed → In Progress

Timur Nurlygayanov (tnurlygayanov) on 2016-09-07

Changed in mos:
status:	In Progress → Fix Committed

Timur Nurlygayanov (tnurlygayanov) on 2016-09-12

Changed in mos:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Diagnositc snapshot of MOS9.1 #153 Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.