Activity log for bug #1645313

Date Who What changed Old value New value Message
2016-11-28 12:58:39 Denis Klepikov bug added bug
2016-11-28 12:59:44 Denis Klepikov tags area-qa area-qa support
2016-11-28 13:10:20 Denis Klepikov description To increase clusters stability please create test which will kill or stop cluster services randomly. What services should be stopped/killed: all OpenStack services 1 service at one time. What we should to control during this test: Stopped/killed service back online if we have automatically recovering procedure Cloud monitoring system successfully reported about problem with detailed explanation what service on what cluster nodes failed. In case of automatically recovering procedure exists - cloud monitoring system should report about recovering. Test log should contain time-stamp when the stop/kill command was sended (what service on what node), time-stamp when service was stopped (what service on what node), time-stamp when cloud monitoring system was able to report a problem (what service on what node), time-stamp when service was recovered (if automatically recovering procedure exists) (what service on what node). Time difference in seconds between points p1-p2, p2-p3, p3-p4, p1-p3 should be logged too: point 1 - service was stopped point 2 - cloud monitoring system was able to report a problem point 3 - service was recovered (if automatically recovering procedure exists) or manually (only for services without automatically recovering procedure) point 4 - cloud monitoring system reported about service recovering For services with automatically recovering procedure time difference should be p1-p2<p1-p3. In case if some services do not have automatically recovering procedure - service should be started back by this test only after cloud monitoring system reported a problem related to this service. What is the profit? This test will help up to check: Do all services recovered as expected? Does service’s recovering time expected? What is the time-shift of automatically recovering for each service? Does cloud monitoring system report us about issues into cloud (what service on what node)? What is the time-shift between real problem and reporting (what service on what node)? What is the time-shift between service recovering and reporting (what service on what node)? To increase clusters stability please create test which will kill or stop cluster services randomly under cluster load https://blueprints.launchpad.net/fuel/+spec/create-load-before-tests. What services should be stopped/killed: all OpenStack services 1 service at one time. What we should to control during this test: Stopped/killed service back online if we have automatically recovering procedure Cloud monitoring system successfully reported about problem with detailed explanation what service on what cluster nodes failed. In case of automatically recovering procedure exists - cloud monitoring system should report about recovering. Test log should contain time-stamp when the stop/kill command was sended (what service on what node), time-stamp when service was stopped (what service on what node), time-stamp when cloud monitoring system was able to report a problem (what service on what node), time-stamp when service was recovered (if automatically recovering procedure exists) (what service on what node). Time difference in seconds between points p1-p2, p2-p3, p3-p4, p1-p3 should be logged too: point 1 - service was stopped point 2 - cloud monitoring system was able to report a problem point 3 - service was recovered (if automatically recovering procedure exists) or manually (only for services without automatically recovering procedure) point 4 - cloud monitoring system reported about service recovering For services with automatically recovering procedure time difference should be p1-p2<p1-p3. In case if some services do not have automatically recovering procedure - service should be started back by this test only after cloud monitoring system reported a problem related to this service. What is the profit? This test will help up to check: Do all services recovered as expected? Does service’s recovering time expected? What is the time-shift of automatically recovering for each service? Does cloud monitoring system report us about issues into cloud (what service on what node)? What is the time-shift between real problem and reporting (what service on what node)? What is the time-shift between service recovering and reporting (what service on what node)?
2017-11-20 17:24:18 Oleksiy Molchanov fuel: milestone 9.x-updates
2017-11-20 17:24:21 Oleksiy Molchanov fuel: importance Undecided Medium
2017-11-20 17:24:26 Oleksiy Molchanov fuel: status New Confirmed