B&R: On restore, inventory update fails when controller-1 host is online results in unlock failure when wipe_ceph_osds=true

Bug #1852065 reported by Ovidiu Poncea
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Ovidiu Poncea

Bug Description

Bug Description : On duplex deployments, after reinstalling controller-1 sysinv agent does not report inventory nor does it connect to the rabbitmq of controller-0.

When wipe_ceph_osds=true partitions of OSD drives are wiped and this wipe is reported to sysinv-conductor thorough sysinv-agent. W/o this report, OSD partitions are still in the database and, on unlock, puppet manifests tries to create them and fails.

Problem is caused by https://review.opendev.org/691713 merged on 04.11.2019 as, on restore, ast it does not recreate /opt/platform/sysinv/.../sysinv.conf.default. W/o this file other nodes (except controller-0) will not be able to start their sysinv-agents correctly (service starts but it does nothing).

Two solutions:
A. Copy the file from backup on restore (this file is backed up)
B. Fix code in https://review.opendev.org/691713 so that it recreates this file on restore, same as before the commit

Severity
--------
Major - B&R no longer works with wipe_ceph_osds=true on DX (tested). Also, on standard, reinstalling new hosts will be denied as, if sysinv.conf.default is not presend, new nodes installed will not report their inventory => we won't be able to install new nodes at all on restored setups (supposition)

Steps to Reproduce
------------------
1. Install an AIO-DX deployment, do a backup
2. Reinstall controller-0
3. Run ansible restore with wipe_ceph_osds=true
4. Unlock controller-0 & wait for it to be available
5. Re-install controller-1
6. unlock controller-1 => it fails to apply manifests as it tries to create the ceph osd partitions which are no longer present

Expected Behavior
------------------
When wipe_ceph_osds is set to true we should see that the partitions for the OSD nodes are removed from the database.

Actual Behavior
----------------
As per description

Reproducibility
---------------
100% Reproduce-able

System Configuration
--------------------
AIO-DX

Branch/Pull Time/Commit
-----------------------
StarlingX_Upstream_build release branch build as of 2018-11-04

Changed in starlingx:
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.3.0 / medium - issue introduced by recent code changes and affects stx.3.0 B&R feature functionality

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
Ghada Khalil (gkhalil)
tags: added: stx.3.0 stx.update
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/694774

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/694774
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=dcdeea0254149233f1c0e5a6536a561cf7453cec
Submitter: Zuul
Branch: master

commit dcdeea0254149233f1c0e5a6536a561cf7453cec
Author: Ovidiu Poncea <email address hidden>
Date: Mon Nov 18 15:48:23 2019 +0200

    Fix missing content in /opt/platform/sysinv/19.09

    Due to changes in https://review.opendev.org/#/c/692439 and
    https://review.opendev.org/#/c/691714, sysinv/19.09 subfolder
    of /opt/platform is no longer created, nor its content.
    This breaks the initial assumptions that:
    1. This folder exists => is ok to just create files there
    2. The content of this folder is recreated on each unlock

    To return #1 assumption we now create the folder in advance
    and for #2 we back-up and restore its content.

    Change-Id: I8dd686a66fcc62bbb05b72fda56e86c353d25fee
    Closes-Bug: 1851424
    Closes-Bug: 1852065
    Closes-Bug: 1852127
    Signed-off-by: Ovidiu Poncea <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.