StarlingX

kube-apiserver endpoints not configured correctly

Bug #1877383 reported by Bart Wensley on 2020-05-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Matt Peters

Bug Description

Brief Description
-----------------
The kube-apiserver endpoints are being configured incorrectly:
- controller-0 endpoint set to floating cluster IP
- controller-1 endpoint set to OAM unit IP

The endpoints can change when controllers are locked/unlocked and re-installed.

The correct configuration would be:
- controller-0 endpoint set to controller-0 cluster IP
- controller-1 endpoint set to controller-1 cluster IP

Severity
--------
Major: kube-apiserver endpoints should not be on OAM network and should be fixed to the controller they are running on.

Steps to Reproduce
------------------
Install a lab.

Expected Behavior
------------------
See above.

Actual Behavior
----------------
It looks like this was always broken and got worse when we moved to using “kubadm join” on the second controller instead of “kubadm init”.

In short:
- controller-0 (ansible bootstrap does “kubeadm init” with kubeadm.yaml config file):
  - sets the InitConfiguration localAPIEndpoint/advertiseAddress to the floating cluster IP (wrong)
  - sets the ClusterConfiguration controlPlaneEndpoint to the floating cluster IP (correct)
- controller-1 (runs “kubeadm join” using cluster configuration):
  - the InitConfiguration localAPIEndpoint/advertiseAddress is not set (not part of the ClusterConfiguration) so defaults to “the IP of the default interface”.

Using the WC-4 as an example this results in:

# kubectl -n kube-system get configmap kubeadm-config -o yaml
  ClusterStatus: |
    apiEndpoints:
      controller-0:
        advertiseAddress: aefd::1 <- floating cluster IP
        bindPort: 6443
      controller-1:
        advertiseAddress: 2620:10a:a001:a103::1199 <- controller-1 OAM IP
        bindPort: 6443
    apiVersion: kubeadm.k8s.io/v1beta2
    kind: ClusterStatus

# kubectl get ep kubernetes
NAME ENDPOINTS AGE
kubernetes [2620:10a:a001:a103::1182]:6443,[2620:10a:a001:a103::1199]:6443 30h <- controller-0 and controller-1 OAM IPs

When looking at the endpoints, both API servers are using the local OAM IPs, which doesn’t line up with the config map. I believe this is explained here:
https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2

    // LocalAPIEndpoint represents the endpoint of the API server instance that's deployed on this control plane node
    // In HA setups, this differs from ClusterConfiguration.ControlPlaneEndpoint in the sense that ControlPlaneEndpoint
    // is the global endpoint for the cluster, which then loadbalances the requests to each individual API server. This
    // configuration object lets you customize what IP/DNS name and port the local API server advertises it's accessible
    // on. By default, kubeadm tries to auto-detect the IP of the default interface and use that, but in case that process
    // fails you may set the desired value here.
    LocalAPIEndpoint APIEndpoint `json:"localAPIEndpoint,omitempty"`

Not sure of the fix yet, but I think:
- At bootstrap we need to set InitConfiguration localAPIEndpoint/advertiseAddress to the controller-0 cluster IP (not floating IP).
- When doing the join on controller-1 (or on controller-0 reinstall), we need to pass the --apiserver-advertise-address parameter with the unit specific cluster IP. From the docs: --apiserver-advertise-address string: If the node should host a new control plane instance, the IP address the API Server will advertise it’s listening on. If not set the default network interface will be used.

There may be other changes required - we need to check that the static manifests in /etc/kubernetes/manifests have the right --advertise-address set for the kube-apiserver and that these changes are preserved over reboots/re-installs.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
All configurations are affected

Branch/Pull Time/Commit
-----------------------
stx.4.0 load built from master on 2020-05-05

Last Pass
---------
Unknown

Timestamp/Logs
--------------
See above

Test Activity
-------------
Developer Testing

Workaround
----------
None

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-07: Related fix proposed to ansible-playbooks (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/726231

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-08: Related fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/726231
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Submitter: Zuul
Branch: master

commit 26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Author: Matt Peters <email address hidden>
Date: Thu May 7 14:29:02 2020 -0500

Add kube-apiserver port to calico failsafe rules

    An invalid GlobalNetworkPolicy or NetworkPolicy may prevent
    calico-node from communicating with the kube-apiserver.
    Once the communication is broken, calico-node is no longer
    able to update the policies since it cannot communicate to
    read the updated policies. It can also prevent the pod
    from starting since the policies will prevent it from
    reading the configuration.

    To ensure that this scenario does not happen, the kube-apiserver
    port is being added to the failsafe rules to ensure communication
    is always possible, regardless of the network policy configuration.

    Change-Id: I1b065a74e7ad0ba9b1fdba4b63136b97efbe98ce
    Closes-Bug: 1877166
    Related-Bug: 1877383
    Signed-off-by: Matt Peters <email address hidden>

Ghada Khalil (gkhalil) on 2020-05-11

tags:	added: stx.4.0 stx.containers
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Frank Miller (sensfan22)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-21: Related fix proposed to ansible-playbooks (f/centos8)

Related fix proposed to branch: f/centos8
Review: https://review.opendev.org/729809

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-05-25:

Assigning to Paul to prime the proper solution for this LP. Please consult with Bart as needed.

Changed in starlingx:
assignee:	Frank Miller (sensfan22) → Paul-Ionut Vaduva (pvaduva)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-03: Related fix merged to ansible-playbooks (f/centos8)

Download full text (22.6 KiB)

Reviewed: https://review.opendev.org/729809
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=73027425d4501a6b7785e91024c9e8ddbc03115d
Submitter: Zuul
Branch: f/centos8

commit 55c9afd075194f7669fa2a87e546f61034679b04
Author: Dan Voiculeasa <email address hidden>
Date: Wed May 13 14:19:52 2020 +0300

Restore: disconnect etcd from ceph

At the moment etcd is restored only if ceph data is kept.
Etcd should be restored regardless if ceph data is kept or wiped.

    Story: 2006770
    Task 39751
    Change-Id: I9dfb1be0a83c3fdc5f1b29cbb974c5e0e2236ad3
    Signed-off-by: Dan Voiculeasa <email address hidden>

commit 003ddff574c74adf11cf8e4758e93ba0eed45a6a
Author: Don Penney <email address hidden>
Date: Fri May 8 11:35:58 2020 -0400

Add playbook for updating static images

This commit introduces a new playbook, upgrade-static-images.yml, used
for downloading updating images and pushing to the local registry.

    Change-Id: I8884440261a5a4e27b40398e5a75c9d03b09d4ba
    Story: 2006781
    Task: 39706
    Signed-off-by: Don Penney <email address hidden>

commit 26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Author: Matt Peters <email address hidden>
Date: Thu May 7 14:29:02 2020 -0500

Add kube-apiserver port to calico failsafe rules

    Change-Id: I1b065a74e7ad0ba9b1fdba4b63136b97efbe98ce
    Closes-Bug: 1877166
    Related-Bug: 1877383
    Signed-off-by: Matt Peters <email address hidden>

commit bd0f14a7dfb206ccaa3ce0f5e7d9034703b3403c
Author: Robert Church <email address hidden>
Date: Tue May 5 15:11:15 2020 -0400

Provide an update strategy for Tiller deployment

    In the case of a simplex controller configuration the current patching
    strategy for the Tiller environment will fail as the tiller ports will
    be in use when the new deployment is attempted to be applied. The
    resulting tiller pod will be stuck in a Pending state.

    This will be observed if the node becomes ready after 'helm init'
    installs the initial deployment and before the deployment is patched for
    environment checks.

    The deployment strategy provided by 'helm init' is unspecified. This
    change will allow one additional pod (current + new) and one unavailable
    pod (current) during an update. The maxUnavailable setting allows the
    tiller pod to be deleted which will release its ports, thus allowing the
    patch deployment to spin up an new pod to a Running state.

Change-Id: I83c43c52a77...

Reviewed:  https://review.opendev.org/729809
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=73027425d4501a6b7785e91024c9e8ddbc03115d
Submitter: Zuul
Branch:    f/centos8

commit 55c9afd075194f7669fa2a87e546f61034679b04
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Wed May 13 14:19:52 2020 +0300

Restore: disconnect etcd from ceph
    
    At the moment etcd is restored only if ceph data is kept.
    Etcd should be restored regardless if ceph data is kept or wiped.
    
    Story: 2006770
    Task 39751
    Change-Id: I9dfb1be0a83c3fdc5f1b29cbb974c5e0e2236ad3
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit 003ddff574c74adf11cf8e4758e93ba0eed45a6a
Author: Don Penney <don.penney@windriver.com>
Date:   Fri May 8 11:35:58 2020 -0400

Add playbook for updating static images
    
    This commit introduces a new playbook, upgrade-static-images.yml, used
    for downloading updating images and pushing to the local registry.
    
    Change-Id: I8884440261a5a4e27b40398e5a75c9d03b09d4ba
    Story: 2006781
    Task: 39706
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit 26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Author: Matt Peters <matt.peters@windriver.com>
Date:   Thu May 7 14:29:02 2020 -0500

Add kube-apiserver port to calico failsafe rules
    
    An invalid GlobalNetworkPolicy or NetworkPolicy may prevent
    calico-node from communicating with the kube-apiserver.
    Once the communication is broken, calico-node is no longer
    able to update the policies since it cannot communicate to
    read the updated policies.  It can also prevent the pod
    from starting since the policies will prevent it from
    reading the configuration.
    
    To ensure that this scenario does not happen, the kube-apiserver
    port is being added to the failsafe rules to ensure communication
    is always possible, regardless of the network policy configuration.
    
    Change-Id: I1b065a74e7ad0ba9b1fdba4b63136b97efbe98ce
    Closes-Bug: 1877166
    Related-Bug: 1877383
    Signed-off-by: Matt Peters <matt.peters@windriver.com>

commit bd0f14a7dfb206ccaa3ce0f5e7d9034703b3403c
Author: Robert Church <robert.church@windriver.com>
Date:   Tue May 5 15:11:15 2020 -0400

Provide an update strategy for Tiller deployment
    
    In the case of a simplex controller configuration the current patching
    strategy for the Tiller environment will fail as the tiller ports will
    be in use when the new deployment is attempted to be applied. The
    resulting tiller pod will be stuck in a Pending state.
    
    This will be observed if the node becomes ready after 'helm init'
    installs the initial deployment and before the deployment is patched for
    environment checks.
    
    The deployment strategy provided by 'helm init' is unspecified. This
    change will allow one additional pod (current + new) and one unavailable
    pod (current) during an update. The maxUnavailable setting allows the
    tiller pod to be deleted which will release its ports, thus allowing the
    patch deployment to spin up an new pod to a Running state.
    
    Change-Id: I83c43c52a77bce9f085bfb6c6a2c4171f2ba8f97
    Partial-Bug: #1876396
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 0dc9e173855792c38bec90360c0c4c066c36d66b
Author: Robert Church <robert.church@windriver.com>
Date:   Mon May 4 12:59:49 2020 -0400

Ensure containerd binds to the loopback interface
    
    Set the stream_server_address to bind to the loopback interface with a
    value of "127.0.0.1" for IPv4 and "::1" for IPv6.
    
    This will explicitly update the containerd configuration to use the IP
    address of the loopback interface based on the system's network
    configuration.
    
    Change-Id: I76a4ad1c123b8b701cb1fa74b16609b50cdf9bd2
    Partial-Bug: #1875891
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 2ea3ce6a7fdff5c2079acd76bd8eee7001b4127c
Author: Andy Ning <andy.ning@windriver.com>
Date:   Thu Apr 30 13:41:33 2020 -0400

Increase wait time for certificate during subcloud bootstrap
    
    Currently during subcloud ansible bootstrap, it waits up to 15s for
    certificate secret to be ready after the yaml file applies. For some
    slow hosts (VBox for example) 15s appears not long enough so the
    extracted certificate is partial, which in turn fails haproxy.
    
    This commit updates to use the better "kubectl wait" mechanism to wait
    for the certificate to be ready, with a timeout of 30s.
    
    Change-Id: Ibd8cab9339c6d532353b45b49cc4d141f0cf5ace
    Closes-Bug: 1876099
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit d05785ffd9add6553662fcab43f30bf8d9f6d2e3
Author: Stefan Dinescu <stefan.dinescu@windriver.com>
Date:   Fri Apr 24 10:48:20 2020 +0000

Upversion Netapp application
    
    Changes included in this commit:
    - updated netapp required docker images
    - add support for PVC snapshots (beta feature since K8s
      1.17);
    - create new ansible role for enabling PVC snapshot
      support and start required pod
    - import role for bootstrap as well, so any backend
      added in the future will also have support enabled
      by default
    - also use snapshot role for the netapp backend
      configuration (for upgrade considerations)
    - change netapp backend configuration of mapping backends
      and storage classes from 1-to-1 mapping to many-to-many
      mapping; instead of one backend configured for each
      storage-class, now any number of backends can be
      configured for any number of storage classes
    - add a new VolumeSnapshotClass configuration option for
      PVC snapshot support
    
    Change-Id: Ib1cf5a5b46f24a6864ac6d894e37db8732e0c6fb
    Depends-On: https://review.opendev.org/#/c/724237/
    Story: 2007391
    Task: 39566
    Signed-off-by: Stefan Dinescu <stefan.dinescu@windriver.com>

commit 204641a5b3082c9873109169f93ae1845eb79813
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Wed Apr 15 15:54:58 2020 +0300

DC subcloud restore registry.central certs
    
    During restore a certificate is missing.
    Docker needs the certificate to connect to registry.central.
    Extract it from backup archive.
    
    Closes-Bug: 1870389
    
    Depends-On: I64c8b38a51bf04714931d70e126e0f63782deb20
    Depends-On: Ieb12ffc0ad769dd6ca22eb4c15f9d6d55778fd4b
    Depends-On: I86166da31491736d6695e04fa287f79871975b55
    Depends-On: Iebab8dc059435c7e2b0f19947fedce88bd71bb65
    Depends-On: I278f19be32d1fe87687feb75e26b2898237de86f
    
    Change-Id: Ief65a8963b81ef489171c264964d472a66fec282
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit acd84841d201f1d5777edd2996086732cb3a3104
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Thu Apr 23 17:37:23 2020 +0300

Fix SystemController filesystem at restore
    
    The filesystem `dc-vault` is created at unlock.
    It doesn't exist at restore time to be resized.
    It will be correctly sized during unlock.
    
    It is not mounted into /dev/cgts-vg/dc-vault-lv.
    
    Closes-Bug: 1873617
    Change-Id: Ia2748756eaa8109065af1848374cc058c447910e
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit 885cfe61269a43c7cff7e56732baefc2190d5be1
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Apr 29 11:58:14 2020 -0400

Set root certification duration
    
    Setting root certification to 5 years and renew 30 days ahead.
    
    Change-Id: I780edaab0c041a0db1e9faf47bcd473e20068247
    Story: 3007347
    Task: 39428

commit 54e9b94773f3ae9c6be7eb14e141537cad373915
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Wed Apr 22 15:44:15 2020 +0300

Fix restore without ceph backend
    
    When ceph backend is not configured there is no ceph crushmap to be
    restored, nor ceph monitors data. Skip restoring those.
    
    The rest of the logic regarding ceph osds can be treaded as if osds were
    wiped.
    
    Closes-Bug: 1873974
    Depends-On: Ic2b7a77f4a54d3d30aedd6c00747fc4586428997
    Change-Id: I2776d7c2d5801ce6e81c487da263075b6f6873c8
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit dd89ba118d21027da28f860f2da47e6794d0453b
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Wed Apr 22 13:32:21 2020 +0300

Fix backup without ceph backend
    
    When ceph backend is not configured there is no ceph crushmap to be
    backed up. Skip the crushmap backup step.
    
    Partial-Bug: 1873974
    Change-Id: Ic2b7a77f4a54d3d30aedd6c00747fc4586428997
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit 3bb26d81d51f0590dba2a19caf9cc430673f6018
Author: Andy Ning <andy.ning@windriver.com>
Date:   Wed Apr 8 09:42:10 2020 -0400

Setup https admin endpoint certificates for subcloud
    
    This commit updated ansible bootstrap to generate, install and
    configure certificates for https enabled admin endpoints. This change
    applies to subcloud of a DC system only.
    
    The subcloud admin endpoint certificate has valid duration of 180 days
    and renew before of 30 days.
    
    Tests:
      - Successfully deploy subcloud by "dcmanager subcloud add"
      - Verify haproxy admin endpoint certificate is generated and
        installed properly in subcloud.
      - Verify DC admin endpoint root CA certificate is installed in
        subcloud's trusted CA cert list in subcloud.
      - Verify the haproxy admin endpoint certificate can be validiated by
        the DC endpoint root CA certificate successfully in subcloud.
    
    Change-Id: Ib24d27ac4cafe345fb57ba906ea5baf0930af892
    Story: 2007347
    Task: 39465
    Depends-On: https://review.opendev.org/#/c/720224/
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit 2b287b1050fa2b1a7b5f5d983eaa634a055b8ec2
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue Apr 7 23:48:11 2020 -0400

Install dc root cert
    
    This is to create a distributed cloud specific root CA issuer with
    cert-manager.
    
    The root CA issuer is to authorize intermediate issuers for each
    subcloud, the latter then to issue certificate for admin endpoints.
    
    Test cases:
    Bootstrap systemcontroller from local/remote
    Replay systemcontroller bootstrap playbook
    
    Story: 3007347
    Task: 39428
    
    Change-Id: I7546d6562f0bc072c3cf76f422a258a2c32b4a34
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 36a01e8ba38f3e0d1e2ea7a2bce31edbedfde04e
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Tue Apr 21 17:54:53 2020 +0300

B&R: Do keystone db backup for subcloud
    
    Keystone db backup file is missing for subclouds.
    Create the keystone db backup file when running the backup playbook on
    subcloud.
    
    Partial-Bug: 1870389
    Change-Id: I64c8b38a51bf04714931d70e126e0f63782deb20
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit df25466798d2487c933f7d2fc1d04ec968f4bcd2
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Fri Apr 24 15:23:37 2020 -0400

Rename the existing /opt/patch-vault filesystem to /opt/dc-vault
    
    The filesystem /opt/patch-vault is renamed to /opt/dc-vault so that
    it can be re-used to store FPGA images and software loads. Thus,
    necessary changes have been made to the ansible playbook files.
    
    Change-Id: I3358fe2d87c79785a8803815b1bbd2727ae80a24
    Story: 2006740
    Task: 39550
    Depends-On: https://review.opendev.org/#/c/723007/
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

commit d3341102189031551e8d4d194e42d86d8878920f
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Sun Apr 19 21:30:57 2020 -0400

Enable applying applications after bootstrap
    
    This commit adds the ability to specify applications to be applied
    directly after bootstrap, before controller-0 have been unlocked.
    This is needed for cert manager.
    
    Currently, nginx and cert manager will be applied by default, with
    no overrides. The user can optionally specify overrides if they wish
    
    NOTE: This aligns with long term direction for platform applications
    to:
    - move away from the existing platform application framework in sysinv
      due to wanting to decouple application behaviour from sysinv code
      in order to support such things as independent upgrades of these
      platform applications.
    - support auto-upload/apply of platform applications in either:
         a) bootstrap playbook, if app required for supporting bootstrap
            functions, or
         b) a post-bootstrap deployment-type playbook.
    In the case of cert-manager, in near future, it will be required at
    bootstrap to support initial configuration around generating
    certificates for kubernetes and https connections.
    
    Story: 2007360
    Task: 39471
    
    Change-Id: I91ee31c7c2d35c2a101b156ef8633fc69139938d
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit 0a1c06a66bc286b306bfdf4ada7cf823787b7a94
Author: Tao Liu <tao.liu@windriver.com>
Date:   Tue Apr 21 15:36:29 2020 -0400

Increase wait timeout for service endpoints reconfig
    
    Install/bootstrap HP EL8000 as subcloud timed out, while
    waiting for endpoints reconfiguration to complete
    during bootstrapping.
    
    This server has a single processor which takes around 9 mins
    to apply the runtime manifest, which is greater timeout
    value than 450 seconds. In general, everything is slower on this
    particular hardware, e.g. install is slower and cli commands
    take almost twice longer to complete than other servers.
    
    This update increases the endpoints reconfiguration wait
    timeout to 720 seconds which provides a safety margin.
    
    Testcases:
    Install/bootstrap HP EL8000 as a subcloud.
    
    Closes-Bug: 1871699
    
    Change-Id: If284281aa13e79cc14d0369e44e8cacebb24f415
    Signed-off-by: Tao Liu <tao.liu@windriver.com>

commit abbf21f7fcef00e90e75d393f638a73d58b41adb
Author: Robert Church <robert.church@windriver.com>
Date:   Mon Dec 16 12:53:10 2019 -0500

Patch tiller deployment to provide environment validation
    
    There appears to be a race condition between when kubelet sees a pod and
    when kubelet sees a service. Due to this race, required environment
    variable are missing to allow tiller to function properly.
    
    See the comment at
    https://github.com/kubernetes/kubernetes/blob/v1.18.1/pkg/kubelet/kubelet_pods.go#L566
    
    This change patches the tiller deployment to make sure the four classes
    of environment variables are present prior to starting tiller. If any
    class of variables are not present in the environment, then exit. This
    will recreate the pod and will populate the correct environment for
    tiller to function.
    
    Since the upgrade to v1.18.1, this has been seen in simplex and duplex
    controller configurations.
    
    This will cover patching during initial provisioning via ansible and
    will be reverted once StarlingX moves to helm v3.
    
    Change-Id: I78e43459fedab611a67b8d9b6b3121b78ef048a6
    Partial-Bug: #1856078
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 9a8136b5b11a874da9a5b67519a59b27530b4aad
Author: Tao Liu <tao.liu@windriver.com>
Date:   Sat Apr 18 13:54:45 2020 -0400

Backup & restore: subcloud deploy files
    
    Backup the subcloud deploy files if available on the system.
    Restore the subcloud deploy files if included in the archive.
    
    Testcases:
    Backup & restore System Controller with the subcloud deploy
    files.
    Backup & restore a regular system without the subcloud
    deploy files
    
    Partial-Bug: 1864508
    
    Change-Id: Ic14f6c02dd187a082b03458b0a766c690400e317
    Signed-off-by: Tao Liu <tao.liu@windriver.com>

commit 40cfef7c417709c234e50a1a034fb4a11dbf180a
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Tue Apr 14 14:18:29 2020 +0300

Remove subcloud task from restore mode
    
    A task supposed to run only during bootstrap is running during restore.
    
    Keystone dc variables (dc_admin_user_id and dc_admin_project_id) are
    added during bootstrap to hieradata static.yaml file.
    When doing the restore the information is already present in the file in
    the backup archive.
    
    Partial-Bug: 1870389
    Change-Id: Iebab8dc059435c7e2b0f19947fedce88bd71bb65
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit 5cdd394cb10c2c2d94174fdc32beb989290c6de9
Author: Stefan Dinescu <stefan.dinescu@windriver.com>
Date:   Thu Dec 19 15:23:23 2019 +0200

Resize DRBD resources when doing a restore
    
    In cases where we do a backup of a system that has non-default
    sizes for drbd-backed partitions, the restore fails when first
    unlocking controller-0.
    
    The normal resize procedure requires all controller nodes to
    be unlocked and available because the puppet manifest does
    not support resizing at unlock.
    
    To prevent the issue from occuring, as part of the restore
    procedure, we should resize the partitions on controller-0
    with the proper sizes found in sysinv. Controller-1 will
    automatically create the partitions with the proper sizes
    from the very start, so it will not need any resizes.
    
    Change-Id: Ia73452ce721514d393b486a659730d0ca7c0d7e5
    Closes-bug: 1854169
    Depends-on: https://review.opendev.org/#/c/699990
    Signed-off-by: Stefan Dinescu <stefan.dinescu@windriver.com>

commit a027bcf50a037166f84d897e22535c8dedf2590f
Author: Robert Church <robert.church@windriver.com>
Date:   Mon Mar 23 20:32:08 2020 -0400

Support for upversioning of k8s to v1.18.1
    
    Changes include:
    - Renamed the v1.16.2 versioned directories to v1.18.1.
    - Updated kubeadm.yaml to align the kubernetesVersion and enable the
      featureGate for multiple hugepage support
    
    Change-Id: I7241164f0185496093c0c8b5cb541fd09926b2ed
    Story: 2006999
    Task: 39334
    Depends-On: https://review.opendev.org/#/c/718568/
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 1b50022d55a9da2bbab284b1fdda2ddc78c30c79
Author: Shuicheng Lin <shuicheng.lin@intel.com>
Date:   Wed Apr 8 10:57:50 2020 +0800

Fix account be locked due to access registry without password
    
    Correct code to let exception be raised when password cannot be
    got from keyring. Account is locked due to exception is not raised,
    and client try to access registry with None password, which is
    incorrect.
    
    Closes-Bug: #1871141
    Change-Id: Ia68b4a4f25756fdad7a198a31d5870245ff9dc1a
    Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>

commit 9080db419d559d3d5d33c0a6459e9f5e8b7700e5
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Thu Apr 9 16:07:30 2020 +0300

Add registry.central host for DC subcloud restore
    
    During bootstrap management network is temporarly assigned on lo
    interface. Backup archive contains /etc/resolv.conf and /etc/hosts
    of an already unlocked controller. Before backup registry.central is
    resolved through dns (nameserver `floating central management`).
    
    During restore a temporary host for registry.central must be created.
    Since there is no reference of a backup/shadow management network that
    provides connectivity for such use cases the `floating central oam`
    can be used.
    
    Partial-Bug: 1870389
    
    Change-Id: I86166da31491736d6695e04fa287f79871975b55
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit 46e9c405cb13972a3bf08cbfcdfe4181c12b3cfc
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Fri Mar 27 14:09:45 2020 -0400

Add default pod security policies
    
    This commit adds default pod security policies. We need this
    pod security plugin. Starting pod security plugin without any
    policies will result in all pods being denied. These default
    policies prevent the user from putting the system into an
    unusable state if they accidentally enable pod security
    policies without adding policies first.
    
    Story: 2007351
    Task: 38897
    
    Change-Id: Iac49f81ef44e6cb82ff884717888dfc1a7cd2a45
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit f3340a3b5379f8c33de42aeaf11e96cc886df020
Author: Stefan Dinescu <stefan.dinescu@windriver.com>
Date:   Tue Apr 7 11:36:19 2020 +0300

Backup & restore: Restore license files
    
    STX offers support for installing license files through the
    "system license-install" command.
    
    While, these licenses are not enforced, they are part of the
    backups created, but they are not restored when doing a full
    backup & restore.
    
    Since license is optional, it is not expected to always be
    present in the backup archive, so we only restore it if it
    is present in the archive.
    
    Change-Id: Ibd4cdcb53d1d55409d947c1f3af45659ed21a7ae
    Closes-bug: 1871034
    Signed-off-by: Stefan Dinescu <stefan.dinescu@windriver.com>

commit 5c542524e4cd9fb65da698c1d4cba4d50f56bdab
Author: Shuicheng Lin <shuicheng.lin@intel.com>
Date:   Wed Apr 1 15:58:07 2020 +0800

Add kubelet_vol_plugin_dir definition to fix ansible failure
    
    When do host-swact, upgrade-k8s-networking.yml will be called to check
    calico upgrade. And kubelet_vol_plugin_dir is missed in definition
    and cause ansible fail. Add definition from main.yml to fix it.
    
    Closes-Bug: 1870038
    Change-Id: I30287ebca7f0d4a1d3c5ee656136375a7b1c182f
    Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>

commit d6cff0496dcf52655eba340e1e57b1d973040edf
Author: Shuicheng Lin <shuicheng.lin@intel.com>
Date:   Thu Mar 12 14:34:09 2020 +0800

Refresh local registry auth info each time when access local registry
    
    Local registry uses admin account password as authentication info.
    And this password may be changed by openstack client at any time.
    When try to download images from local registry, auth info cannot
    be cached, otherwise it may lead to authentication failure in keystone,
    and account be locked at the end.
    For this specific case, there is host-swact first, then function
    "_upgrade_downgrade_kube_networking" in sysinv conductor is called.
    And upgrade-k8s-networking.yml is executed which will try to download
    kube network images from local registry. During this period, admin
    account password is changed. And lead to account be locked due to
    authentication failure in keystone.
    With this update, there is still possibility that password be changed
    just after get operation. And due to the images download are run in
    parallel with multi threads, so account lock may still hit. This
    change could minimize the issue rate, but cannot fix all.
    
    Closes-Bug: 1853017
    
    Change-Id: I686616937031a3f7ac6d65e5b118511dc549ab85
    Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>

tags:

added: in-f-centos8

Matt Peters (mpeters-wrs) on 2020-06-04

Changed in starlingx:
assignee:	Paul-Ionut Vaduva (pvaduva) → Matt Peters (mpeters-wrs)

Matt Peters (mpeters-wrs) on 2020-06-08

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-10: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/734865

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-10: Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/734879

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-10: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/734865
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=966bce1398dd23be49acc9c55e833987148db454
Submitter: Zuul
Branch: master

commit 966bce1398dd23be49acc9c55e833987148db454
Author: Matt Peters <email address hidden>
Date: Mon Jun 8 09:59:01 2020 -0500

Fix kubernetes apiserver advertise address

    Set the kube-apiserver advertise address to the local
    controller cluster-host unit address to ensure kubeadm
    does not attempt to discover the default address which
    in most cases will be invalid.

Set the kubeadm InitConfiguration advertiseAddress for
the initial controller master node.

    Closes-Bug: 1877383
    Change-Id: I759234685966234bf987a9e06be77a5f793ee782
    Signed-off-by: Matt Peters <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-10: Fix merged to config (master)

Reviewed: https://review.opendev.org/734879
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=601612676e5883ed220ae86fb40dd1a02c76139c
Submitter: Zuul
Branch: master

commit 601612676e5883ed220ae86fb40dd1a02c76139c
Author: Matt Peters <email address hidden>
Date: Mon Jun 8 09:55:55 2020 -0500

Fix kubernetes apiserver advertise address

The option --apiserver-advertise-addres is added to the
join command for controller nodes joining as master nodes.

    Closes-Bug: 1877383
    Depends-On: https://review.opendev.org/734865
    Change-Id: I1575da6d28d08731a8aaf4200f920f5e8f510fa0
    Signed-off-by: Matt Peters <email address hidden>

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.