AIO-DX: controller-1 failed after host-downgrade and unlock

Bug #1924786 reported by Angie Wang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Angie Wang

Bug Description

Brief Description
-----------------
During stx4.0 to stx5.0 upgrade, upgrade abort cannot be completed due to controller-1 can not be unlocked successfully after host-downgrade.
The root cause is cgts-vg is decreased in stx5.0 so new partition and pv are introduced during controller-1's upgrade and sysinv-agent sends the new partition and pv data back to controller-0 which is running stx4.0 so stx4.0 database gets updated with the new partition and pv that causes the unlock failed.

Severity
--------
Major

Steps to Reproduce
------------------
system upgrade-start
system host-lock controller-1
system host-upgrade controller-1
system host-unlock controller-1
system upgrade-abort
system host-lock controller-1
system host-downgrade controller-1
sudo sw-patch host-install controller-1
system host-unlock controller-1

Expected Behavior
------------------
controller-1 is downgraded and unlocked successfully, controller-1 becomes available

Actual Behavior
----------------
controller-1 was downgraded, but after unlock, controller-1 stayed at failed state

Reproducibility
---------------
100% Reproducible

System Configuration
--------------------
AIO-DX

Last Pass
---------
N/A

Timestamp/Logs
--------------
2021-01-15T00:05:39.684 Notice: 2021-01-15 00:05:39 +0000 /Stage[main]/Platform::Partitions/Platform_manage_partition[check]/Exec[manage-partitions-check]/returns: sysinv 2021-01-15 00:05:39.381 84156 INFO manage-partitions [-] Executing command: 'parted -s /dev/disk/by-path/pci-0000:00:1f.2-ata-5.0 unit mib mkpart primary 198438 262950'
2021-01-15T00:05:39.687 Notice: 2021-01-15 00:05:39 +0000 /Stage[main]/Platform::Partitions/Platform_manage_partition[check]/Exec[manage-partitions-check]/returns: sysinv 2021-01-15 00:05:39.504 84156 CRITICAL sysinv [-] Unhandled error: IOError: Could not create partition 6 of 64512MiB on disk /dev/disk/by-path/pci-0000:00:1f.2-ata-5.0: Error: You requested a partition from 198438MiB to 262950MiB (sectors 406401024..538521599).
2021-01-15T00:05:39.689 Notice: 2021-01-15 00:05:39 +0000 /Stage[main]/Platform::Partitions/Platform_manage_partition[check]/Exec[manage-partitions-check]/returns: The closest location we can manage is 262950MiB to 262950MiB (sectors 538521600..538521600).

Test Activity
-------------
Developer Testing

Workaround
----------
None

Angie Wang (angiewang)
Changed in starlingx:
assignee: nobody → Angie Wang (angiewang)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/786724

Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: stx.5.0 / medium - issue with upgrade framework which should be cherrypicked to the r/stx.5.0 release when ready

Changed in starlingx:
importance: Undecided → High
tags: added: stx.5.0 stx.update
Changed in starlingx:
importance: High → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/786724
Committed: https://opendev.org/starlingx/config/commit/a24cd707a5cafaa2d49b25d1d5285c5c8582eb5d
Submitter: "Zuul (22348)"
Branch: master

commit a24cd707a5cafaa2d49b25d1d5285c5c8582eb5d
Author: Angie Wang <email address hidden>
Date: Fri Apr 16 12:46:52 2021 -0400

    AIO-DX: Controller-1 fails to be unlocked after downgrade

    During stx4.0 to stx5.0 upgrade, controller-1 fails to be unlocked
    after downgrade due to the incorrect disk partition and physical
    volume information stored in stx4.0 DB that causes the puppet
    manifest apply failed during unlock.

    This is because cgts-vg size is decreased in stx5.0 and after
    controller-1 is upgraded to stx5.0, additional partition and pv
    are created at stx5.0 side to match the size in stx4.0. However,
    controller-0 is still running stx4.0 DB and it gets updated with
    the new created partition and pv info sent from controller-1 sysinv
    agent audit.

    This commit updates to ignore the disk partition and physical volume
    information sent back from a different version during upgrade.

    Tested:
    - AIO-DX upgrade from stx4.0 to stx5.0, verified upgrade is completed
    - controller-1 downgrade after it is upgraded and unlocked, verified
      upgrade abort is completed

    Change-Id: I5d7858e4b29d096437a5ddf94cd78c74fadfacad
    Closes-Bug: 1924786
    Signed-off-by: Angie Wang <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (r/stx.4.0)

Fix proposed to branch: r/stx.4.0
Review: https://review.opendev.org/c/starlingx/config/+/786972

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (r/stx.5.0)

Fix proposed to branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/config/+/786973

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (r/stx.5.0)

Reviewed: https://review.opendev.org/c/starlingx/config/+/786973
Committed: https://opendev.org/starlingx/config/commit/41846084674b6af611064ce9eac38008349721f8
Submitter: "Zuul (22348)"
Branch: r/stx.5.0

commit 41846084674b6af611064ce9eac38008349721f8
Author: Angie Wang <email address hidden>
Date: Fri Apr 16 12:46:52 2021 -0400

    AIO-DX: Controller-1 fails to be unlocked after downgrade

    During stx4.0 to stx5.0 upgrade, controller-1 fails to be unlocked
    after downgrade due to the incorrect disk partition and physical
    volume information stored in stx4.0 DB that causes the puppet
    manifest apply failed during unlock.

    This is because cgts-vg size is decreased in stx5.0 and after
    controller-1 is upgraded to stx5.0, additional partition and pv
    are created at stx5.0 side to match the size in stx4.0. However,
    controller-0 is still running stx4.0 DB and it gets updated with
    the new created partition and pv info sent from controller-1 sysinv
    agent audit.

    This commit updates to ignore the disk partition and physical volume
    information sent back from a different version during upgrade.

    Tested:
    - AIO-DX upgrade from stx4.0 to stx5.0, verified upgrade is completed
    - controller-1 downgrade after it is upgraded and unlocked, verified
      upgrade abort is completed

    Change-Id: I5d7858e4b29d096437a5ddf94cd78c74fadfacad
    Closes-Bug: 1924786
    Signed-off-by: Angie Wang <email address hidden>
    (cherry picked from commit a24cd707a5cafaa2d49b25d1d5285c5c8582eb5d)

Ghada Khalil (gkhalil)
tags: added: in-r-stx50
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (r/stx.4.0)

Reviewed: https://review.opendev.org/c/starlingx/config/+/786972
Committed: https://opendev.org/starlingx/config/commit/47a03edc3bfe0ce41dbe6f2dd06e07214586415c
Submitter: "Zuul (22348)"
Branch: r/stx.4.0

commit 47a03edc3bfe0ce41dbe6f2dd06e07214586415c
Author: Angie Wang <email address hidden>
Date: Fri Apr 16 12:46:52 2021 -0400

    AIO-DX: Controller-1 fails to be unlocked after downgrade

    During stx4.0 to stx5.0 upgrade, controller-1 fails to be unlocked
    after downgrade due to the incorrect disk partition and physical
    volume information stored in stx4.0 DB that causes the puppet
    manifest apply failed during unlock.

    This is because cgts-vg size is decreased in stx5.0 and after
    controller-1 is upgraded to stx5.0, additional partition and pv
    are created at stx5.0 side to match the size in stx4.0. However,
    controller-0 is still running stx4.0 DB and it gets updated with
    the new created partition and pv info sent from controller-1 sysinv
    agent audit.

    This commit updates to ignore the disk partition and physical volume
    information sent back from a different version during upgrade.

    Tested:
    - AIO-DX upgrade from stx4.0 to stx5.0, verified upgrade is completed
    - controller-1 downgrade after it is upgraded and unlocked, verified
      upgrade abort is completed

    Change-Id: I5d7858e4b29d096437a5ddf94cd78c74fadfacad
    Closes-Bug: 1924786
    Signed-off-by: Angie Wang <email address hidden>
    (cherry picked from commit a24cd707a5cafaa2d49b25d1d5285c5c8582eb5d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (147.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/794906
Committed: https://opendev.org/starlingx/config/commit/75758b37a5a23c8811355b67e2a430a1713cd85b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9e420d9513e5fafb1df4d29567bc299a9e04d58d
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400

    Add more logging to run docker login

    Add error log for running docker login. The new log could
    help identify docker login failure.

    Closes-Bug: 1930310
    Change-Id: I8a709fb6665de8301fbe3022563499a92b2a0211
    Signed-off-by: Bin Qian <email address hidden>

commit 31c77439d2cea590dfcca13cfa646522665f8686
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500

    Fix controller-0 downgrade failing to kill ceph

    kill_ceph_storage_monitor tried to manipulate a pmon
    file that does not exist in an AIO-DX environment.

    We no longer invoke kill_ceph_storage_monitor in an
    AIO SX or DX env.

    This allows: "system host-downgrade controller-0"
    to proceed in an AIO-DX environment where that second
    controller (controller-0) was upgraded.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I633853f75317736084feae96b5b849c601204c13

commit 0dc99eee608336fe01b58821ea404286371f1408
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500

    Fix file permissions failure during duplex upgrade abort

    When issuing a downgrade for controller-0 in a duplex upgrade
    abort and rollback scenario, the downgrade command was failing
    because the sysinv API does not have root permissions to set
    a file flag.
    The fix is to use RPC so the conductor can create the flag
    and allow the downgrade for controller-0 to get further.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I913bcad73309fe887a12cbb016a518da93327947

commit 7ef3724dad173754e40b45538b1cc726a458cc1c
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800

    Fix bug rook-ceph provision with multi osd on one host

    Test case:
    1, deploy simplex system
    2, apply rook-ceph with below override value
    value.yaml
    cluster:
      storage:
        nodes:
        - name: controller-0
          devices:
          - name: sdb
          - name: sdc
    3, reboot

    Without this fix, only osd pod could launch successfully after boot
    as vg start with ceph could not correctly add in sysinv-database

    Closes-bug: 1929511

    Change-Id: Ia5be599cd168d13d2aab7b5e5890376c3c8a0019
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 23505ba77d76114cf8a0bf833f9a5bcd05bc1dd1
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400

    Fix issue in partition data migration script

    The created partition dictonary partition_map is not
    an ordered dict so we need to sort it by its key -
    device node when iterating it to adjust the device
    nodes/paths for user created extra partitions to ensure
    the number of device node...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.