Backup & Restore HTTPS: Standby controller failed to come available after restore/unlock action in Regular system

Bug #1850714 reported by Senthil Mukundakumar
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Ovidiu Poncea

Bug Description

Brief Description
-----------------
In Regular system Backup & Restore, the active controller did become active after restore (1849379 - verified).

The standby controller after restore and unlock, remain in failed state using 'system host-list'.

[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | disabled | failed |
| 3 | compute-0 | worker | locked | disabled | offline |
| 4 | compute-1 | worker | locked | disabled | offline |
+----+--------------+-------------+----------------+-------------+--------------+

SM on controller-1 waiting for config to be completed. Meanwhile the puppet manifest failed with errors:

2019-10-30T16:00:12.002 Notice: 2019-10-30 16:00:12 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[starlingx]/Exec[Adding StarlingX helm repo: starlingx]/returns: Error: Looks like "http://127.0.0.1:8080/helm_charts/starlingx" is not a valid chart repository or cannot be reached: Get http://127.0.0.1:8080/helm_charts/starlingx/index.yaml: dial tcp 127.0.0.1:8080: connect: connection refused
2019-10-30T16:00:12.005 Error: 2019-10-30 16:00:12 +0000 helm repo add starlingx http://127.0.0.1:8080/helm_charts/starlingx returned 1 instead of one of [0]
2019-10-30T16:00:12.102 Error: 2019-10-30 16:00:12 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[starlingx]/Exec[Adding StarlingX helm repo: starlingx]/returns: change from notrun to 0 failed: helm repo add starlingx http://127.0.0.1:8080/helm_charts/starlingx returned 1 instead of one of [0]
2019-10-30T16:00:12.124 Notice: 2019-10-30 16:00:12 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[stx-platform]/Exec[Adding StarlingX helm repo: stx-platform]/returns: Error: Looks like "http://127.0.0.1:8080/helm_charts/stx-platform" is not a valid chart repository or cannot be reached: Get http://127.0.0.1:8080/helm_charts/stx-platform/index.yaml: dial tcp 127.0.0.1:8080: connect: connection refused
2019-10-30T16:00:12.126 Error: 2019-10-30 16:00:12 +0000 helm repo add stx-platform http://127.0.0.1:8080/helm_charts/stx-platform returned 1 instead of one of [0]
2019-10-30T16:00:12.218 Error: 2019-10-30 16:00:12 +0000 /Stage[main]/Platform::Helm::Repositories/Platform::Helm::Repository[stx-platform]/Exec[Adding StarlingX helm repo: stx-platform]/returns: change from notrun to 0 failed: helm repo add stx-platform http://127.0.0.1:8080/helm_charts/stx-platform returned 1 instead of one of [0]

Severity
--------
Critical: Unable to restore standby controller in Regular system

Steps to Reproduce
------------------
1. Bring up the Regular system system
2. Backup the system using ansible locally
3. Re-install the controller with the same load
4. Restore the active controller
5. Unlock active controller
6. Boot standby controller via PXE
7. unlock standby controller

Expected Behavior
------------------
The standby controller should be successfully restored and become available

Actual Behavior
----------------
Standby controller failed to become available after unlock

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Regular System

Branch/Pull Time/Commit
-----------------------
 BUILD_ID="2019-10-27_20-00-00"

Test Activity
-------------
Feature Testing

description: updated
Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
summary: - Backup & Restore: Standby controller failed to come abailable after
+ Backup & Restore: Standby controller failed to come available after
restore/unlock action in Regular system
Yang Liu (yliu12)
summary: - Backup & Restore: Standby controller failed to come available after
- restore/unlock action in Regular system
+ Backup & Restore HTTPS: Standby controller failed to come available
+ after restore/unlock action in Regular system
tags: added: stx.retestneeded
Ghada Khalil (gkhalil)
tags: added: stx.update
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.3.0 / medium priority - bug related to B&R which is an stx.3.0 feature deliverable

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.3.0
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Although I have an idea of what's happening I need to reproduce it exactly, for this I need:
- The configuration of this lab - the commands or configuration files used to set it up initially
- A collect from before the restore and one at the moment of the failure

At least what lab was used and a link to the folder with configuration files. It's hard to determine the state of the setup at the time of the error and to bring another setup into this state.

Thanks

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/693246

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/693246
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=2f7062ffa9055c1ca4885699c0ac262a50f46fea
Submitter: Zuul
Branch: master

commit 2f7062ffa9055c1ca4885699c0ac262a50f46fea
Author: Ovidiu Poncea <email address hidden>
Date: Wed Nov 6 21:57:24 2019 +0200

    B&R: Fix ssl server-cert for standby controller

    Copy server-cert.pem from backup archive to shared filesystem
    so that mate controller can find it and allow unlock to
    proceed.

    Change-Id: I96c7dd11797fcd3a463db1c6a266c2860c35c5ab
    Closes-Bug: 1850714
    Signed-off-by: Ovidiu Poncea <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Senthil Mukundakumar (smukunda) wrote :

Verified in wcp_71_75 using 2020-02-21_04-10-00

tags: removed: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Ovidiu/Frank, This LP is marked as gating for stx.3.0. Please cherry-pick the code changes to the stx.3.0 branch if applicable or add a note explaining why it shouldn't be cherry-picked.

tags: added: stx.4.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.