Backup & Restore: AIO-DX+worker Controller failed to become active after restore

Bug #1854169 reported by Senthil Mukundakumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Stefan Dinescu

Bug Description

Brief Description
-----------------

In AIO-DX+worker configuration, the active controller failed to become active after restore and unlock.

/var/log/puppet/latest/puppet.log
error with running `drbdadm create-md drbd-pgsql -W--peer-max-bio-size=128k`

Severity
--------
Critical: Unable to restore active controller

Steps to Reproduce
------------------
1. Bring up the AIO-DX+worker system system
2. Backup the system using ansible locally
3. Re-install the controller with the same load
4. Restore the active controller
5. Unlock active controller

Expected Behavior
------------------
The active controller should be successfully restored and become active

Actual Behavior
----------------
Active controller failed to become active after unlock

Reproducibility
---------------
Reproducible

System Configuration
--------------------
AIO-DX+worker

Branch/Pull Time/Commit
-----------------------
 BUILD_ID="2019-11-25_20-00-00"

Test Activity
-------------
Feature Testing

Ghada Khalil (gkhalil)
tags: added: stx.update
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Hi Senthil,
We need more details:
1. The lab that has the issue was on (thanks for the email, it is wp_8_12 and it had IPv6 support)
2. The collect after backup and before the reinstall of controller-0
3. The backup archive

We will do a test on an AIO-DX IPv6 to see if it works & will need a DX+Worker lab in IPv6 mode to restest.

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.3.0 / high priority - issue w/ B&R feature which is an stx.3.0 deliverable

Changed in starlingx:
assignee: nobody → Senthil Mukundakumar (smukunda)
importance: Undecided → High
tags: added: stx.3.0
Revision history for this message
Senthil Mukundakumar (smukunda) wrote :

1. wp_8_12 it is reproducible in IPV6 configuration
2. Both backup and collect file copied to /folk/cgts_logs/logs/LP_1854169

Changed in starlingx:
status: Incomplete → New
Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → Triaged
assignee: Senthil Mukundakumar (smukunda) → Ovidiu Poncea (ovidiu.poncea)
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

I can't find any issue with it in the logs, it also should not be related to IPv6 since "drbdadm create-md drbd-pgsql -W--peer-max-bio-size=128" since it's a drive initialization command nor to the fact that we have workers added to a DX, it is also in the middle of a script which seems to had all prerequisites met.

We need to test this on a HW deployment. I have a DX IPv6 lab reserved for next week. If it doesn't reproduce we'll have to go on the same lab that had the issue.

Changed in starlingx:
assignee: Ovidiu Poncea (ovidiu.poncea) → Stefan Dinescu (stefandinescu)
Revision history for this message
Stefan Dinescu (stefandinescu) wrote :

This only happens on the restore of a system that had its drbd-synced partitions resized from the default values before the backup was created.

This can only be seen on a restore, because on a normal install you cannot change the partition sizes from the defaults before unlocking controller-0, but on a restore the changed value was already in the database, thus causing the issue.

The fix will be to manually resize the partition from the restore ansible playbook to the correct values found in the sysinv database.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/699990

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/699991

Ghada Khalil (gkhalil)
tags: added: stx.4.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Added stx.4.0 as this is likely an issue in that release as well.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/719782

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/719782
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=4fc8bdcf4a011864aabe9df561e2c9bd2165c481
Submitter: Zuul
Branch: master

commit 4fc8bdcf4a011864aabe9df561e2c9bd2165c481
Author: Stefan Dinescu <email address hidden>
Date: Tue Apr 14 09:59:54 2020 +0000

    Add B&R information comments to DRBD manifest

    This commit adds a series of comments to the DRBD manifest
    so that users doing any changes to this manifest know also
    update the list of DRBD devices in the restore playbook.

    Change-Id: Iae1d9d98391759669871b016721418922aa134ce
    Partial-bug: 1854169
    Signed-off-by: Stefan Dinescu <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/699990
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=b0e76a69277441b6becec6533214bdbbb38e6058
Submitter: Zuul
Branch: master

commit b0e76a69277441b6becec6533214bdbbb38e6058
Author: Stefan Dinescu <email address hidden>
Date: Thu Dec 19 15:47:00 2019 +0200

    Allow yaml formatting for controllerfs-list

    In oder to be easily parsed by ansible, the controllerfs-list
    command should support yaml output format.

    Change-Id: Ic766980645d618d54d34bd04d82339fd2cd36562
    Depends-On: https://review.opendev.org/#/c/719782/
    Partial-bug: 1854169
    Signed-off-by: Stefan Dinescu <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/699991
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=5cdd394cb10c2c2d94174fdc32beb989290c6de9
Submitter: Zuul
Branch: master

commit 5cdd394cb10c2c2d94174fdc32beb989290c6de9
Author: Stefan Dinescu <email address hidden>
Date: Thu Dec 19 15:23:23 2019 +0200

    Resize DRBD resources when doing a restore

    In cases where we do a backup of a system that has non-default
    sizes for drbd-backed partitions, the restore fails when first
    unlocking controller-0.

    The normal resize procedure requires all controller nodes to
    be unlocked and available because the puppet manifest does
    not support resizing at unlock.

    To prevent the issue from occuring, as part of the restore
    procedure, we should resize the partitions on controller-0
    with the proper sizes found in sysinv. Controller-1 will
    automatically create the partitions with the proper sizes
    from the very start, so it will not need any resizes.

    Change-Id: Ia73452ce721514d393b486a659730d0ca7c0d7e5
    Closes-bug: 1854169
    Depends-on: https://review.opendev.org/#/c/699990
    Signed-off-by: Stefan Dinescu <email address hidden>

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Stefan/Frank, This LP is marked as gating for stx.3.0. Please cherry-pick the code changes to the stx.3.0 branch if applicable or add a note to indicate otherwise.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729809

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729812

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729825

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (16.7 KiB)

Reviewed: https://review.opendev.org/729825
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=d4617fbad74a05f2af81ee85a47565083991e6f8
Submitter: Zuul
Branch: f/centos8

commit 4134023ab84d8a635b118d5e3ff26ade3bbe535b
Author: Sharath Kumar K <email address hidden>
Date: Thu May 7 10:08:11 2020 +0200

    Tox and Zuul job for the bandit code scan in stx/stx-puppet

    Setting up the bandit tool for the scanning of HIGH severity issues
    in the python codes under Starlingx/stx-puppet folder.
    Expecting this merge will enable zuul job for CI/CD of bandit scan.

    Configuration files:
    1. tox.ini for adding bandit environment and command.
    2. test-requirements.txt for adding bandit version.
    3. .zuul.yaml file for adding bandit job and configuring under
       check job to run code scan every time before code commit.

    Test:
    Run tox -e bandit command inside the fault folder to validate the
    bandit scan and result.

    Story: 2007541
    Task: 39687
    Depends-On: https://review.opendev.org/#/c/721294/

    Change-Id: I2982268db2b5e75feeb287bc95420fedc9b0d816
    Signed-off-by: Sharath Kumar K <email address hidden>

commit 65daac29e4635f32a57e80cd18f96fd59dc8ebe0
Author: Bin Qian <email address hidden>
Date: Tue May 12 22:39:21 2020 -0400

    DC cert manifest should only apply to controller nodes

    DC cert manifest should only apply to controller nodes on system
    controller.
    This fix is for DC with worker nodes in central cloud.

    Change-Id: I4233509a6f0afb3013c01e81dea6f655d9e15371
    Closes-Bug: 1878260
    Signed-off-by: Bin Qian <email address hidden>

commit 04a3cb8cbad9b1700286c5de67aa5d974cf54400
Author: Elena Taivan <email address hidden>
Date: Wed Apr 29 08:44:13 2020 +0000

    Changing permissions for conversion folder

    Adding writing permissions to '/opt/conversion' mountpoint
    so openstack image conversion can happen there.

    Change-Id: Id1a91db6570dcbed3b8068e79e72f5bb800f24ad
    Partial-bug: 1819688
    Signed-off-by: Elena Taivan <email address hidden>

commit 4e9153cf234e714e4bbc9a9eb3d9b55b2828145a
Author: Tao Liu <email address hidden>
Date: Mon May 4 14:30:30 2020 -0500

    Move subcloud audit to separate process

    Subcloud audit is being removed from the dcmanager-manager
    process and it is running in dcmanager-audit process.

    This update adds associated puppet config.

    Story: 2007267
    Task: 39640
    Depends-On: https://review.opendev.org/#/c/725627/

    Change-Id: Idd2e675126a01d6113597646ddd9eb4a0bc5be44
    Signed-off-by: Tao Liu <email address hidden>

commit b793518f65ae932f3974ff85b797f505b5ef1c2a
Author: Robert Church <email address hidden>
Date: Wed Apr 29 12:49:04 2020 -0400

    Ensure containerd binds to the loopback interface

    Set the stream_server_address to bind to the loopback interface with a
    value of "127.0.0.1" for IPv4 and "::1" for IPv6.

    Without setting the stream_server_address in config.toml, containerd was
    binding to the OAM interface. Under most situations this resulted in
    containe...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (37.5 KiB)

Reviewed: https://review.opendev.org/729812
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=539d476456277c22d0dcbc3cbbc832e623242264
Submitter: Zuul
Branch: f/centos8

commit 320cc40de8518787c2be234d7fdf88ec0a462df2
Author: Don Penney <email address hidden>
Date: Wed May 13 13:06:11 2020 -0400

    Add auto-versioning to starlingx/config packages

    This update makes use of the PKG_GITREVCOUNT variable to auto-version
    the packages in this repo.

    Change-Id: I3a2c8caeb4b4647608978b1f2ccfcf0661508803
    Depends-On: https://review.opendev.org/727837
    Story: 2006166
    Task: 39766
    Signed-off-by: Don Penney <email address hidden>

commit d9f2aea0fb228ed69eb9c9262e29041eedabc15d
Author: Sharath Kumar K <email address hidden>
Date: Wed Apr 22 16:22:22 2020 +0200

    De-branding in starlingx/config: CGCS -> StarlingX

    1. Rename CGCS to StarlingX for .spec files

    Test:
    After the de-brand change, bootimage.iso has been built in the flock
    Layer and installed on the dev machine to validate the changes.

    Please note, doing de-brand changes in batches, this is batch9 changes.

    Story: 2006387
    Task: 39524

    Change-Id: Ia1fe0f2baafb78c974551100f16e6a7d99882f15
    Signed-off-by: Sharath Kumar K <email address hidden>

    De-branding in starlingx/config: CGCS -> StarlingX

    1. Rename CGCS to StarlingX for .spec file
    2. Rename TIS to StarlingX for .service files

    Test:
    After the de-brand change, bootimage.iso has been built in the flock
    Layer and installed on the dev machine to validate the changes.

    Please note, doing de-brand changes in batches, this is batch10 changes.

    Story: 2006387
    Task: 36202

    Change-Id: I404ce0da2621495175ad31489e9ad6f7b0211e26
    Signed-off-by: Sharath Kumar K <email address hidden>

commit d141e954fa6bbf688929ec90d1b6604a97792c43
Author: Teresa Ho <email address hidden>
Date: Tue Mar 31 10:08:57 2020 -0400

    Sysinv extensions for FPGA support

    This update adds cli and restapi to support FPGA device
    programming.

    CLI commands:
    system device-image-apply
    system device-image-create
    system device-image-delete
    system device-image-list
    system device-image-remove
    system device-image-show
    system device-image-state-list
    system device-label-list
    system host-device-image-update
    system host-device-image-update-abort
    system host-device-label-assign
    system host-device-label-list
    system host-device-label-remove

    Story: 2006740
    Task: 39498

    Change-Id: I556c2e7a51b3931b5a66ab27b67f51e3a8aebd9f
    Signed-off-by: Teresa Ho <email address hidden>

commit 491cca42ed854d2cb3ee3646b93c56a4f45f563c
Author: Elena Taivan <email address hidden>
Date: Wed Apr 29 11:25:26 2020 +0000

    Qcow2 conversion to raw can be done using 'image-conversion' filesystem

    1. Conversion filesystem can be added before/after
       stx-openstack is applied
    2. If conversion filesystem is added after stx-openstack
       is applied, changes to stx-openstack will only take effec...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (f/centos8)
Download full text (22.6 KiB)

Reviewed: https://review.opendev.org/729809
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=73027425d4501a6b7785e91024c9e8ddbc03115d
Submitter: Zuul
Branch: f/centos8

commit 55c9afd075194f7669fa2a87e546f61034679b04
Author: Dan Voiculeasa <email address hidden>
Date: Wed May 13 14:19:52 2020 +0300

    Restore: disconnect etcd from ceph

    At the moment etcd is restored only if ceph data is kept.
    Etcd should be restored regardless if ceph data is kept or wiped.

    Story: 2006770
    Task 39751
    Change-Id: I9dfb1be0a83c3fdc5f1b29cbb974c5e0e2236ad3
    Signed-off-by: Dan Voiculeasa <email address hidden>

commit 003ddff574c74adf11cf8e4758e93ba0eed45a6a
Author: Don Penney <email address hidden>
Date: Fri May 8 11:35:58 2020 -0400

    Add playbook for updating static images

    This commit introduces a new playbook, upgrade-static-images.yml, used
    for downloading updating images and pushing to the local registry.

    Change-Id: I8884440261a5a4e27b40398e5a75c9d03b09d4ba
    Story: 2006781
    Task: 39706
    Signed-off-by: Don Penney <email address hidden>

commit 26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Author: Matt Peters <email address hidden>
Date: Thu May 7 14:29:02 2020 -0500

    Add kube-apiserver port to calico failsafe rules

    An invalid GlobalNetworkPolicy or NetworkPolicy may prevent
    calico-node from communicating with the kube-apiserver.
    Once the communication is broken, calico-node is no longer
    able to update the policies since it cannot communicate to
    read the updated policies. It can also prevent the pod
    from starting since the policies will prevent it from
    reading the configuration.

    To ensure that this scenario does not happen, the kube-apiserver
    port is being added to the failsafe rules to ensure communication
    is always possible, regardless of the network policy configuration.

    Change-Id: I1b065a74e7ad0ba9b1fdba4b63136b97efbe98ce
    Closes-Bug: 1877166
    Related-Bug: 1877383
    Signed-off-by: Matt Peters <email address hidden>

commit bd0f14a7dfb206ccaa3ce0f5e7d9034703b3403c
Author: Robert Church <email address hidden>
Date: Tue May 5 15:11:15 2020 -0400

    Provide an update strategy for Tiller deployment

    In the case of a simplex controller configuration the current patching
    strategy for the Tiller environment will fail as the tiller ports will
    be in use when the new deployment is attempted to be applied. The
    resulting tiller pod will be stuck in a Pending state.

    This will be observed if the node becomes ready after 'helm init'
    installs the initial deployment and before the deployment is patched for
    environment checks.

    The deployment strategy provided by 'helm init' is unspecified. This
    change will allow one additional pod (current + new) and one unavailable
    pod (current) during an update. The maxUnavailable setting allows the
    tiller pod to be deleted which will release its ports, thus allowing the
    patch deployment to spin up an new pod to a Running state.

    Change-Id: I83c43c52a77...

Revision history for this message
Bill Zvonar (billzvonar) wrote :

Stefan/Frank, This LP is marked as gating for stx.3.0. Please cherry-pick the code changes to the stx.3.0 branch if applicable or add a note to indicate otherwise.

tags: added: stx.cherrypickneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (r/stx.3.0)

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/749491

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (r/stx.3.0)

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/749492

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (r/stx.3.0)

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/749494

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (r/stx.3.0)

Reviewed: https://review.opendev.org/749494
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=9f300cd9f805912db5e9d75c5faf3af4c18641fc
Submitter: Zuul
Branch: r/stx.3.0

commit 9f300cd9f805912db5e9d75c5faf3af4c18641fc
Author: Stefan Dinescu <email address hidden>
Date: Thu Dec 19 15:23:23 2019 +0200

    Resize DRBD resources when doing a restore

    In cases where we do a backup of a system that has non-default
    sizes for drbd-backed partitions, the restore fails when first
    unlocking controller-0.

    The normal resize procedure requires all controller nodes to
    be unlocked and available because the puppet manifest does
    not support resizing at unlock.

    To prevent the issue from occuring, as part of the restore
    procedure, we should resize the partitions on controller-0
    with the proper sizes found in sysinv. Controller-1 will
    automatically create the partitions with the proper sizes
    from the very start, so it will not need any resizes.

    Change-Id: Ia73452ce721514d393b486a659730d0ca7c0d7e5
    Closes-bug: 1854169
    Depends-on: https://review.opendev.org/#/c/699990
    Signed-off-by: Stefan Dinescu <email address hidden>
    (cherry picked from master commit 5cdd394cb10c2c2d94174fdc32beb989290c6de9)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (r/stx.3.0)

Reviewed: https://review.opendev.org/749492
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=0668d58e0bb4153ac67f957da32bf3a74b0453a6
Submitter: Zuul
Branch: r/stx.3.0

commit 0668d58e0bb4153ac67f957da32bf3a74b0453a6
Author: Stefan Dinescu <email address hidden>
Date: Tue Apr 14 09:59:54 2020 +0000

    Add B&R information comments to DRBD manifest

    This commit adds a series of comments to the DRBD manifest
    so that users doing any changes to this manifest know also
    update the list of DRBD devices in the restore playbook.

    Change-Id: Iae1d9d98391759669871b016721418922aa134ce
    Partial-bug: 1854169
    Signed-off-by: Stefan Dinescu <email address hidden>
    (cherry picked from master commit 4fc8bdcf4a011864aabe9df561e2c9bd2165c481)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (r/stx.3.0)

Reviewed: https://review.opendev.org/749491
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=922a885aec53523d6d200a638e438dd7d625c045
Submitter: Zuul
Branch: r/stx.3.0

commit 922a885aec53523d6d200a638e438dd7d625c045
Author: Stefan Dinescu <email address hidden>
Date: Thu Dec 19 15:47:00 2019 +0200

    Allow yaml formatting for controllerfs-list

    In oder to be easily parsed by ansible, the controllerfs-list
    command should support yaml output format.

    Change-Id: Ic766980645d618d54d34bd04d82339fd2cd36562
    Depends-On: https://review.opendev.org/#/c/749492/
    Partial-bug: 1854169
    Signed-off-by: Stefan Dinescu <email address hidden>
    (cherry picked from master commit b0e76a69277441b6becec6533214bdbbb38e6058)

Bill Zvonar (billzvonar)
tags: removed: stx.cherrypickneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.