2016-09-07 11:29:53 |
Alexei Sheplyakov |
description |
Subj is seen On MOS9.1 somewhere after snapshot #107.
It's not 100% reproducible, but it's quite stable.
We've caught it on nearly every 3'd CI run during last two weeks.
The problem itself is that OSTF are failed after execution of 'restart ceph-all' on all controllers and ceph nodes in the following way
Test "Check state of haproxy backends on controllers" status is failure; Some haproxy backend has down state.. Please refer to OpenStack logs for more details.
No errors were found in logs.
Diagnostic snapshot is attached.
To reproduce the fault:
- deploy env with 3 controllers and 2 ceph+compute nodes.
- revert it
- run 'restart ceph-all' on each node in the env
- run OSTF tests |
Subj is seen On MOS9.1 somewhere after snapshot #107.
It's not 100% reproducible, but it's quite stable.
We've caught it on nearly every 3'd CI run during last two weeks.
The problem itself is that OSTF are failed after execution of 'restart ceph-all' on all controllers and ceph nodes in the following way
Test "Check state of haproxy backends on controllers" status is failure; Some haproxy backend has down state.. Please refer to OpenStack logs for more details.
No errors were found in logs.
Diagnostic snapshot is attached.
To reproduce the fault:
- deploy env with 3 controllers and 2 ceph+compute nodes.
- revert it
- run 'restart ceph-all' on each node in the env
- run OSTF tests
The root cause is that fuel-qa restarts the whole ceph cluster at once,
and launches OSTF tests immediately after restarting the cluster.
However ceph is NOT designed to withstand the *whole cluster* outage,
so there's a time interval during which (ceph) cluster can't serve clients' requests. fuel-qa should either
- tolerate temporarily unavailable cluster
- restart ceph daemons one by one giving each instance (monitor, OSD) enough time to join the cluster |
|