ha proxy check is failed after 'restart ceph-all'
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mirantis OpenStack |
Fix Released
|
High
|
Yury Tregubov |
Bug Description
Subj is seen On MOS9.1 somewhere after snapshot #107.
It's not 100% reproducible, but it's quite stable.
We've caught it on nearly every 3'd CI run during last two weeks.
The problem itself is that OSTF are failed after execution of 'restart ceph-all' on all controllers and ceph nodes in the following way
Test "Check state of haproxy backends on controllers" status is failure; Some haproxy backend has down state.. Please refer to OpenStack logs for more details.
No errors were found in logs.
Diagnostic snapshot is attached.
To reproduce the fault:
- deploy env with 3 controllers and 2 ceph+compute nodes.
- revert it
- run 'restart ceph-all' on each node in the env
- run OSTF tests
The root cause is that fuel-qa restarts the whole ceph cluster at once,
and launches OSTF tests immediately after restarting the cluster.
However ceph is NOT designed to withstand the *whole cluster* outage,
so there's a time interval during which (ceph) cluster can't serve clients' requests. fuel-qa should either
- tolerate temporarily unavailable cluster
- restart ceph daemons one by one giving each instance (monitor, OSD) enough time to join the cluster
Changed in mos: | |
assignee: | nobody → MOS Ceph (mos-ceph) |
status: | New → Confirmed |
milestone: | none → 9.1 |
importance: | Undecided → High |
tags: | added: area-mos |
tags: | added: blocker-for-qa |
Changed in mos: | |
assignee: | nobody → Fuel QA Team (fuel-qa) |
status: | Confirmed → New |
tags: |
added: area-qa removed: area-mos |
Changed in mos: | |
status: | In Progress → Fix Committed |
Changed in mos: | |
status: | Fix Committed → Fix Released |
> - deploy env with 3 controllers and 2 ceph+compute nodes
> - revert it
Could you please explain how one could possibly "revert" a ceph cluster?
> - run 'restart ceph-all' on each node in the env
restarting ceph services on all nodes simultaneously is not a good idea, instead one
should restart them node by node giving ceph enough time to establish quorum, perform
peering (and possibly recovery) before proceeding to the next node.
> - run OSTF tests
What is the time interval before restarting ceph and running the test?
(establishing quorum and peering/recovery takes some time depending on the number of
the nodes, # of placement groups, etc)