[devops] Document and train US team for system tests run recovery

Bug #1322114 reported by Mike Scherbakov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Medium
Fuel DevOps

Bug Description

We need a detailed step by step instruction (if it is not possible now, then let's create separate tickets to address technical side of this question) on what should US engineer to do in order to recover system tests run.

As we know, system tests can take up to 7-8 hours of run. So if for some reason any CI element which prepares its run on multiple servers fails, then no tests will run. We need to have a recovery plan for such situation. At the current moment, system tests depend at least on the following:
1) ISO build
2) smoke for ISO
3) CI job which fetches ISO on servers

If any of these 3 fail currently, no tests will run. We could create something like:
a) in case of failure any of 3 things above, fuel-core-team receives an email alert
b) US engineer follows the link to the instructions what to do, link sent in the email alert
c) US engineers follows simple step-by-step instructions on how to restart the process, possibly using backup server / whatsoever.

We need to have most frequent scenarios of failures documented and action plan should differ depending on it. Possible failures:
a) ISO build fails because it has some package failure. Roman Vyalov already has CI job which can restore packages mirror to some point in the past, needs to be completed and documented - so in this case we should redirect to the instructions which Roman will provide
b) something else fails because new devops or build script was updated, but has bug. In similar cases like this, ideal variant to me is to have replicated backup builds on other hardware nodes, which do the same work, but being updated with 1-2 days delay - while main CI jobs stay on master (I'm talking about build scripts, seed client, etc.). So we could simply guide to try to use backup build, which would still build same ISO - but with usage of older CI instructions.

I'll leave it for DevOps team to think further and invent its own way, but I think I've provided main idea in enough details.

Igor Shishkin (teran)
tags: added: docs techtalk
Changed in fuel:
status: New → Confirmed
Igor Shishkin (teran)
Changed in fuel:
importance: High → Medium
Igor Shishkin (teran)
Changed in fuel:
milestone: 5.1 → 6.0
Revision history for this message
Igor Shishkin (teran) wrote :

I think here we have the way we already using for some of aside guys, like we have permanent meetings and keep the info clear between teams.
So I think we should find someone who will be ready to contact us let's say once a week and have grants equal our once on all infrastructure servers.

Dima?

Changed in fuel:
milestone: 6.0 → 6.1
tags: added: non-release
Changed in fuel:
milestone: 6.1 → 7.0
Igor Shishkin (teran)
Changed in fuel:
milestone: 7.0 → 8.0
Dmitry Pyzhov (dpyzhov)
tags: added: area-devops
Revision history for this message
Igor Shishkin (teran) wrote :

Mike, since we have slightly changed structure now I don't think it's still relevant.
Please feel free to provide your thoughts. Marking incomplete for now.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Igor Shishkin (teran) wrote :

Marking invalid according to our policy.
Please get it back to new if it's still relevant.

Igor Shishkin (teran)
Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.