On AIO hosts, kuberenetes is starting before key resources are initialized

Bug #1918139 reported by Bart Wensley
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bin Qian

Bug Description

Brief Description
-----------------
On AIO hosts, kubernetes (i.e. the kubelet) is started by the controller manifests. This causes pods to be launched before the worker manifests are applied, resulting in pods starting before the worker manifests have configured key resources that may be required by these pods. Some examples: SRIOV, huge pages, cgroups, PTP, FPGA and more. All of these items can impact the startup of the application pods.

This results in pods being launched in a broken state (e.g. LP1896631). We have done some terrible workarounds (e.g. to restart pods after they come up) to deal with this but we need a proper fix to ensure that all the necessary platform configuration has been completed before kubernetes is started.

Severity
--------
Major: System/Feature is usable but degraded

Steps to Reproduce
------------------
Install an AIO system and create pods that use SRIOV, PTP, huge pages, etc...
Reboot the controller(s)

Expected Behavior
------------------
All platform resources are initialized before kubernetes is started (e.g. SRIOV, huge pages, cgroups, PTP, FPGA). Pods using these resources are not started until the resources have been configured.

Actual Behavior
----------------
See above

Reproducibility
---------------
Intermittent - even with the workarounds pods will occasionally fail to come up properly

System Configuration
--------------------
AIO-SX and AIO-DX

Branch/Pull Time/Commit
-----------------------
stx.4.0

Last Pass
---------
Never - day one issue

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Other

Workaround
----------
Reboot the host and hope for better results

Changed in starlingx:
assignee: nobody → Bin Qian (bqian20)
Ghada Khalil (gkhalil)
tags: added: stx.config
tags: added: stx.5.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium - robustness fixes to better handle startup sequence on AIO

tags: added: stx.containers
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Bin Qian (bqian20) wrote :
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Bin Qian (bqian20) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: moving to stx.6.0 given this is a day 1 issue, it doesn't strictly gate stx.5.0. A fix in stx master for the next release is sufficient.

tags: added: stx.6.0
removed: stx.5.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/780600
Committed: https://opendev.org/starlingx/stx-puppet/commit/f4694f8a30f1e5cbe0f7d354f95949a1601eb1e1
Submitter: "Zuul (22348)"
Branch: master

commit f4694f8a30f1e5cbe0f7d354f95949a1601eb1e1
Author: Bin Qian <email address hidden>
Date: Mon Feb 8 13:00:38 2021 -0500

    Single puppet for AIO controllers

    This change includes:
    1. create aio.pp for AIO controller nodes
    2. execute aio.pp for nodes with subfunctions of 'controller,worker'
    3. remove sriov device plugin restart code as now kubelet starts
       after related config are applied.

    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/784761
    Change-Id: I54b90a76454c6c545bf2891b81225bbf2ba15b03
    Partial-Bug: 1918139
    Signed-off-by: Bin Qian <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/780603
Committed: https://opendev.org/starlingx/config/commit/6acd2e3564d3d708e496c1a7e78b064419f1fdbf
Submitter: "Zuul (22348)"
Branch: master

commit 6acd2e3564d3d708e496c1a7e78b064419f1fdbf
Author: Bin Qian <email address hidden>
Date: Tue Feb 23 12:59:28 2021 -0500

    Single puppet manifest for AIO controllers

    Create a single puppet manifest for AIO controllers.
    This change includes:
    1. remove workerconfig from an AIO controller deployment
    2. running puppet based on subfunctions of the nodes

    Depends-on: https://review.opendev.org/c/starlingx/stx-puppet/+/780600
    Partial-Bug: 1918139
    Signed-off-by: Bin Qian <email address hidden>
    Change-Id: Ie3693219e3c19460ac5b617cc216cbc809ec2403

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/starlingx/config/+/785683
Committed: https://opendev.org/starlingx/config/commit/7ce3d16eeaa4c186e70aad56a1c92a8279dd0aae
Submitter: "Zuul (22348)"
Branch: master

commit 7ce3d16eeaa4c186e70aad56a1c92a8279dd0aae
Author: Bin Qian <email address hidden>
Date: Thu Apr 8 11:08:27 2021 -0400

    Add sysinv-reset-n3000-fpgas cmd

    When AIO runs single manifest, reset N3000 FPGA needs to complete
    without docker local registry and other SM managed services.

    This adds sysinv-reset-n3000-fpgas cmd for puppet to reset
    N3000 FPGAS at host start-up.
    The sysinv-reset-n3000-fpgas cmd separates the function of
    reseting n3000 fpgas from sysinv-fpgas-agent as
    sysinv-fpgas-agent has dependency to rabbit, which is not
    available until manifest completes.

    Change-Id: Ic3c4b2a00515d194793257729362f71e2951286c
    Partial-Bug: 1918139
    Signed-off-by: Bin Qian <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/785736
Committed: https://opendev.org/starlingx/stx-puppet/commit/70971df9f35886f5ece04c82bfccee105d3d0861
Submitter: "Zuul (22348)"
Branch: master

commit 70971df9f35886f5ece04c82bfccee105d3d0861
Author: Bin Qian <email address hidden>
Date: Tue Mar 30 15:58:15 2021 -0400

    AIO manifest to start kubernetes once

    This change is to avoid restarting kubernetes.
    Also calling sysinv-reset-n3000-fpgas to reset N3000 FPGAS
    on host start up.

    Depends-On: https://review.opendev.org/c/starlingx/config/+/785683
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/780600
    Change-Id: I4a27840820fd45ad86cef4dfce6ea0389e583f68
    Partial-Bug: 1918139
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/785737
Committed: https://opendev.org/starlingx/integ/commit/8abcbf6fb1951b25e9964933558b75b9aff88135
Submitter: "Zuul (22348)"
Branch: master

commit 8abcbf6fb1951b25e9964933558b75b9aff88135
Author: Bin Qian <email address hidden>
Date: Thu Apr 8 12:58:44 2021 -0400

    Remove recover operations to "restart-on-reboot" pods

    The pods being labeled as "restart-on-reboot" is to workaround
    kubernetes restart on worker manifest. As the AIO running a
    single manifest to start kubernetes only once, the operation
    is no longer needed.

    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/785736
    Change-Id: I0d6c549199559b2bc19d8edff52f64ea0b08b50d
    Closes-Bug: 1918139
    Signed-off-by: Bin Qian <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/788066

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/788066
Committed: https://opendev.org/starlingx/stx-puppet/commit/139ba4aa6c143e495b8b7136b359254ceb3ba296
Submitter: "Zuul (22348)"
Branch: master

commit 139ba4aa6c143e495b8b7136b359254ceb3ba296
Author: Bin Qian <email address hidden>
Date: Mon Apr 26 14:59:51 2021 -0400

    Reset N3000 fpgas only when it exists

    Remove calling reset n3000 fpga before detecting h/w exists.

    Closes-Bug: 1918139
    Change-Id: I81b7fbc9500fac7e86424537551c1e9aac7492ec
    Signed-off-by: Bin Qian <email address hidden>

Revision history for this message
Andre Kantek (akantek) wrote :

I detected a side-effect of this correction

Brief Description
-----------------
After the change https://review.opendev.org/c/starlingx/stx-puppet/+/780600 a regression occurred and the kube-sriov-device-plugin is not being restarted at runtime
Without the pod restart, the user pods will not be able to user SRIOV interfaces

Severity
--------
Major

Steps to Reproduce
------------------
The execution of interface-datanetwork-assign should reflect on a restart of kube-sriov-device-plugin pod, on an UNLOCKED system

1) check the pod hash before the system
kubectl get pods -n kube-system
2) execute the command:
system interface-datanetwork-assign controller-0 sriov0 datanet1
3) check that kube-sriov-device-plugin was restarted (different pod hash)

Expected Behavior
-----------------
The pod should have restarted

Actual Behavior
---------------
The pod is not restarted

Reproducibility
---------------
Reproducible

System Configuration
--------------------
AIO-SX

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Re-opening so that the fix for the above issue can be linked against the same LP

Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/788570

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/788676

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/788814

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/788570
Committed: https://opendev.org/starlingx/stx-puppet/commit/736199af4106378b86b4cdca784105fe2cd8ed05
Submitter: "Zuul (22348)"
Branch: master

commit 736199af4106378b86b4cdca784105fe2cd8ed05
Author: Andre Fernando Zanella Kantek <email address hidden>
Date: Wed Apr 28 14:50:21 2021 -0400

    On runtime, kube-sriov-device-plugin needs to be restarted

    The previous correction for bug 1918139 removed the sriov plugin
    restart necessary during runtime, done during the interface sriov
    assign to a datanetwork (allowed on an unlocked AIO-SX). Without
    it, the pod creation will not be able to use a datanetwork created
    on runtime.

    The correction bring back the platform::kubernetes::worker::sriovdp
    class to be used only on runtime

    Closes-Bug: 1918139

    Signed-off-by: Andre Fernando Zanella Kantek <email address hidden>
    Change-Id: Ied19bf3138b58b279b350d067ae0c1080e220f31

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (r/stx.5.0)

Fix proposed to branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/788923

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (r/stx.5.0)

Change abandoned by "Andre Kantek <email address hidden>" on branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/788923
Reason: code not needed on stx-5.0

Ghada Khalil (gkhalil)
Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by "Bin Qian <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/config/+/788676
Reason: The real issue is in upgrade, n3000 fpga image cache is unexpected deleted. Fix: https://review.opendev.org/c/starlingx/ansible-playbooks/+/789562

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/788814
Committed: https://opendev.org/starlingx/stx-puppet/commit/5695a29e6a5ed8ee5d211e937496384027d7fd4e
Submitter: "Zuul (22348)"
Branch: master

commit 5695a29e6a5ed8ee5d211e937496384027d7fd4e
Author: Bin Qian <email address hidden>
Date: Thu Apr 29 13:35:38 2021 -0400

    Fix missing kubelet service enable for worker nodes

    Previous commit:
      https://review.opendev.org/c/starlingx/stx-puppet/+/780600/
    kubelet enable is skipped for the worker nodes.

    Change-Id: I7769aebb4a9e38404af0c883640e1a27bb1e9e84
    Closes-Bug: 1918139
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792029

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/integ/+/793754

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (48.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/792029
Committed: https://opendev.org/starlingx/stx-puppet/commit/2b026190a3cb6d561b6ec4a46dfb3add67f1fa69
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 3e3940824dfb830ebd39fd93265b983c6a22fc51
Author: Dan Voiculeasa <email address hidden>
Date: Thu May 13 18:03:45 2021 +0300

    Enable kubelet support for pod pid limit

    Enable limiting the number of pids inside of pods.

    Add a default value to protect against a missing value.
    Default to 750 pids limit to align with service parameter default
    value for most resource consuming StarlingX optional app (openstack).
    In fact any value above service parameter minimum value is good for the
    default.

    Closes-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I10c1684fe3145e0a46b011f8e87f7a23557ddd4a

commit 0c16d288fbc483103b7ba5dad7782e97f59f4e17
Author: Jessica Castelino <email address hidden>
Date: Tue May 11 10:21:57 2021 -0400

    Safe restart of the etcd SM service in etcd upgrade runtime class

    While upgrading the central cloud of a DC system, activation failed
    because there was an unexpected SWACT to controller-1. This was due
    to the etcd upgrade script. Part of this script runs the etcd
    manifest. This triggers a reload/restart of the etcd service. As this
    is done outside of the sm, sm saw the process failure and triggered
    the SWACT.

    This commit modifies platform::etcd::upgrade::runtime puppet class
    to do a safe restart of the etcd SM service and thus, solve the
    issue.

    Change-Id: I3381b6976114c77ee96028d7d96a00302ad865ec
    Signed-off-by: Jessica Castelino <email address hidden>
    Closes-Bug: 1928135

commit eec3008f600aeeb69a42338ed44332228a862d11
Author: Mihnea Saracin <email address hidden>
Date: Mon May 10 13:09:52 2021 +0300

    Serialize updates to global_filter in the AIO manifest

    Right now, looking at the aio manifest:
    https://review.opendev.org/c/starlingx/stx-puppet/+/780600/15/puppet-manifests/src/manifests/aio.pp
    there are 3 classes that update
    in parallel the lvm global_filter:
    - include ::platform::lvm::controller
    - include ::platform::worker::storage
    - include ::platform::lvm::compute
    And this generates some errors.

    We fix this by adding dependencies between the above classes
    in order to update the global_filter in a serial mode.

    Closes-Bug: 1927762
    Signed-off-by: Mihnea Saracin <email address hidden>
    Change-Id: If6971e520454cdef41138b2f29998c036d8307ff

commit 97371409b9b2ae3f0db6a6a0acaeabd74927160e
Author: Steven Webster <email address hidden>
Date: Fri May 7 15:33:43 2021 -0400

    Add SR-IOV rate-limit dependency

    Currently, the binding of an SR-IOV virtual function (VF) to a
    driver has a dependency on platform::networking. This is needed
    to ensure that SR-IOV is enabled (VFs created) before actually
    doing the bind.

    This dependency does not exist for configuring the VF rate-limits
    however. There is a cha...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (37.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/793754
Committed: https://opendev.org/starlingx/integ/commit/a13966754d4e19423874ca31bf1533f057380c52
Submitter: "Zuul (22348)"
Branch: f/centos8

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300

    Fix resize of filesystems in puppet logical_volume

    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.

    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146

    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.

    Tested the following scenarios:

    B&R on SX with default sizes of filesystems and cgts-vg.

    B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
    cgts-vg with additional physical volumes:

    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    B&R on DX system with backup of size 70G and cgts-vg
    with additional physical volumes:

    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300

    Execute once the ceph services script on AIO

    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh

    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.

    We fix this by exiting the ceph script if it is the one from
    /etc/services.d/worker on AIO systems.

    Closes-Bug: 1928934
    Change-Id: I3e4dc313cc3764f870b8f6c640a60338...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (147.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/794906
Committed: https://opendev.org/starlingx/config/commit/75758b37a5a23c8811355b67e2a430a1713cd85b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9e420d9513e5fafb1df4d29567bc299a9e04d58d
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400

    Add more logging to run docker login

    Add error log for running docker login. The new log could
    help identify docker login failure.

    Closes-Bug: 1930310
    Change-Id: I8a709fb6665de8301fbe3022563499a92b2a0211
    Signed-off-by: Bin Qian <email address hidden>

commit 31c77439d2cea590dfcca13cfa646522665f8686
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500

    Fix controller-0 downgrade failing to kill ceph

    kill_ceph_storage_monitor tried to manipulate a pmon
    file that does not exist in an AIO-DX environment.

    We no longer invoke kill_ceph_storage_monitor in an
    AIO SX or DX env.

    This allows: "system host-downgrade controller-0"
    to proceed in an AIO-DX environment where that second
    controller (controller-0) was upgraded.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I633853f75317736084feae96b5b849c601204c13

commit 0dc99eee608336fe01b58821ea404286371f1408
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500

    Fix file permissions failure during duplex upgrade abort

    When issuing a downgrade for controller-0 in a duplex upgrade
    abort and rollback scenario, the downgrade command was failing
    because the sysinv API does not have root permissions to set
    a file flag.
    The fix is to use RPC so the conductor can create the flag
    and allow the downgrade for controller-0 to get further.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I913bcad73309fe887a12cbb016a518da93327947

commit 7ef3724dad173754e40b45538b1cc726a458cc1c
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800

    Fix bug rook-ceph provision with multi osd on one host

    Test case:
    1, deploy simplex system
    2, apply rook-ceph with below override value
    value.yaml
    cluster:
      storage:
        nodes:
        - name: controller-0
          devices:
          - name: sdb
          - name: sdc
    3, reboot

    Without this fix, only osd pod could launch successfully after boot
    as vg start with ceph could not correctly add in sysinv-database

    Closes-bug: 1929511

    Change-Id: Ia5be599cd168d13d2aab7b5e5890376c3c8a0019
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 23505ba77d76114cf8a0bf833f9a5bcd05bc1dd1
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400

    Fix issue in partition data migration script

    The created partition dictonary partition_map is not
    an ordered dict so we need to sort it by its key -
    device node when iterating it to adjust the device
    nodes/paths for user created extra partitions to ensure
    the number of device node...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.