Bug #1875891 “helm cmd failed after host swact” : Bugs : StarlingX

Revision history for this message

Yang Liu (yliu12) wrote on 2020-04-29:

#1

Logs: https://files.starlingx.kube.cengn.ca/launchpad/1875891

Frank Miller (sensfan22) on 2020-04-29

Changed in starlingx:
assignee:	nobody → Bob Church (rchurch)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-04-29: Fix proposed to stx-puppet (master)

#2

Fix proposed to branch: master
Review: https://review.opendev.org/724384

Changed in starlingx:
status:	New → In Progress

Ghada Khalil (gkhalil) on 2020-04-29

tags:	added: stx.4.0 stx.containers
Changed in starlingx:
importance:	Undecided → High

Revision history for this message

Bob Church (rchurch) wrote on 2020-04-29:

#3

Download full text (8.3 KiB)

Analysis:
- [1] was done to ensure that fixed IPs are used for the cluster and mgmt
   networks when multiple (floating) IPs are present when in an IPv6 config
   - We don't do this for the OAM network so after controller-0 unlocks we end up
     using the OAM floating address instead of the fixed host address when
     containerd starts.
   - When controller-1 is unlocked, it is the standby controller with only one
    OAM specific host address so containerd starts successfully with the
     correct IP
- The containerd template does not set the stream_server_address [2].
   - Should we be providing an address here that aligns to the containerd
     address that we want it to listen to (i.e. the OAM host specific address)
   - Do we want to use the OAM address here? Should this be the MGMT host
     specific address or loopback?

I suspect that if you restarted containerd on controller-0 (currently standby)
it would get the correct address and things would work correctly.

We should specify a specific address for containerd to bind to avoid IP
discovery when starting up the containerd process.

[1] https://review.opendev.org/#/c/715120/
[2] https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/templates/config.toml.erb#L29

Look at the lab and related logs:

# Ansible sets up the following:
external_oam_subnet: 2620:10a:a001:a103::6:0/64
external_oam_gateway_address: 2620:10a:a001:a103::6:0
external_oam_floating_address: 2620:10A:A001:A103::11
external_oam_node_0_address: 2620:10A:A001:A103::8
external_oam_node_1_address: 2620:10A:A001:A103::9

# Controller-0 (standby) after ansible, unlocked, and swact to controller-1:
# Problem: Containerd is listening on the OAM controller address -> should be the node address 2620:10A:A001:A103::8
controller-0:~$ sudo netstat -tulpn6 | grep contain
tcp6 0 0 2620:10a:a001:a10:40870 :::* LISTEN 89515/containerd

controller-0:~$ sudo netstat -tulp6 | grep contain
tcp6 0 0 oamcontroller:40870 [::]:* LISTEN 89515/containerd

controller-0:~$ cat /etc/resolv.conf
nameserver face::1
nameserver 2620:10a:a001:a103::2

controller-0:~$ cat /etc/hosts | grep oamcontroller
2620:10a:a001:a103::11 oamcontroller

controller-0:~$ ip a
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 18:66:da:af:b5:88 brd ff:ff:ff:ff:ff:ff
    inet6 2620:10a:a001:a103::8/64 scope global <------- containerd should be listening here.
       valid_lft forever preferred_lft forever
    inet6 fe80::1a66:daff:feaf:b588/64 scope link
       valid_lft forever preferred_lft forever
14: vlan157@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb state UP group default qlen 1000
    link/ether 3c:fd:fe:25:d5:c0 brd ff:ff:ff:ff:ff:ff
    inet6 face::2/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::3efd:feff:fe25:d5c0/64 scope link
       valid_lft forever preferred_lft forever
15: vlan158@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3c:fd:fe:25:d5:c0 brd ...

Analysis:
- [1] was done to ensure that fixed IPs are used for the cluster and mgmt
   networks when multiple (floating) IPs are present when in an IPv6 config
   - We don't do this for the OAM network so after controller-0 unlocks we end up
     using the OAM floating address instead of the fixed host address when
     containerd starts.
   - When controller-1 is unlocked, it is the standby controller with only one
    OAM specific host address so containerd starts successfully with the
     correct IP
 - The containerd template does not set the stream_server_address [2]. 
   - Should we be providing an address here that aligns to the containerd
     address that we want it to listen to (i.e. the OAM host specific address)
   - Do we want to use the OAM address here? Should this be the MGMT host
     specific address or loopback?
 
I suspect that if you restarted containerd on controller-0 (currently standby)
it would get the correct address and things would work correctly.

We should specify a specific address for containerd to bind to avoid IP
discovery when starting up the containerd process.
 
[1] https://review.opendev.org/#/c/715120/
[2] https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/templates/config.toml.erb#L29

Look at the lab and related logs:

# Ansible sets up the following:
external_oam_subnet: 2620:10a:a001:a103::6:0/64
external_oam_gateway_address: 2620:10a:a001:a103::6:0
external_oam_floating_address: 2620:10A:A001:A103::11
external_oam_node_0_address: 2620:10A:A001:A103::8
external_oam_node_1_address: 2620:10A:A001:A103::9
 
# Controller-0 (standby) after ansible, unlocked, and swact to controller-1:
# Problem: Containerd is listening on the OAM controller address -> should be the node address 2620:10A:A001:A103::8
controller-0:~$ sudo netstat -tulpn6 | grep contain
tcp6       0      0 2620:10a:a001:a10:40870 :::*                    LISTEN      89515/containerd
 
controller-0:~$ sudo netstat -tulp6 | grep contain
tcp6       0      0 oamcontroller:40870     [::]:*                  LISTEN      89515/containerd
 
controller-0:~$ cat /etc/resolv.conf
nameserver face::1
nameserver 2620:10a:a001:a103::2
 
controller-0:~$ cat /etc/hosts | grep oamcontroller
2620:10a:a001:a103::11  oamcontroller
 
controller-0:~$ ip a
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 18:66:da:af:b5:88 brd ff:ff:ff:ff:ff:ff
    inet6 2620:10a:a001:a103::8/64 scope global      <------- containerd should be listening here.
       valid_lft forever preferred_lft forever
    inet6 fe80::1a66:daff:feaf:b588/64 scope link
       valid_lft forever preferred_lft forever
14: vlan157@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb state UP group default qlen 1000
    link/ether 3c:fd:fe:25:d5:c0 brd ff:ff:ff:ff:ff:ff
    inet6 face::2/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::3efd:feff:fe25:d5c0/64 scope link
       valid_lft forever preferred_lft forever
15: vlan158@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3c:fd:fe:25:d5:c0 brd ff:ff:ff:ff:ff:ff
    inet6 feed:beef::2/64 scope site
       valid_lft forever preferred_lft forever
    inet6 fe80::3efd:feff:fe25:d5c0/64 scope link
       valid_lft forever preferred_lft forever
 
# Controller-1 (active) after unlocked, and swact to from controller-0:
# Containerd is listening on the correct address: 2620:10A:A001:A103::9
controller-1:~$ sudo netstat -tulpn6 | grep contain
tcp6       0      0 2620:10a:a001:a10:39273 :::*                    LISTEN      93111/containerd   
 
controller-1:~$ sudo netstat -tulp6 | grep contain
tcp6       0      0 v6-150-9.yow.lab.:39273 [::]:*                  LISTEN      93111/containerd   
 
 
rchurch@yow-tuxlab2:~$ nslookup v6-150-9
Server:         127.0.0.1
Address:        127.0.0.1#53
Name:   v6-150-9.yow.lab.wrs.com
Address: 2620:10a:a001:a103::9
 
controller-1:~$ cat /etc/resolv.conf
nameserver face::1
nameserver 2620:10a:a001:a103::2
 
controller-1:~$ cat /etc/hosts | grep oamcontroller
2620:10a:a001:a103::11  oamcontroller
controller-1:~$ ip a
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 18:66:da:af:bc:d2 brd ff:ff:ff:ff:ff:ff
    inet6 2620:10a:a001:a103::11/64 scope global                 
       valid_lft forever preferred_lft forever   <----- Missing “0sec” on floating OAM IP
    inet6 2620:10a:a001:a103::9/64 scope global  <----- containerd should be and IS listening here
       valid_lft forever preferred_lft forever
    inet6 fe80::1a66:daff:feaf:bcd2/64 scope link
       valid_lft forever preferred_lft forever
15: vlan157@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb state UP group default qlen 1000
    link/ether 3c:fd:fe:25:d7:00 brd ff:ff:ff:ff:ff:ff
    inet6 face::4/64 scope global deprecated
       valid_lft forever preferred_lft 0sec      <----- Per https://review.opendev.org/#/c/715120/
    inet6 face::1/64 scope global deprecated
       valid_lft forever preferred_lft 0sec      <----- Per https://review.opendev.org/#/c/715120/
    inet6 face::3/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::3efd:feff:fe25:d700/64 scope link
       valid_lft forever preferred_lft forever
16: vlan158@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3c:fd:fe:25:d7:00 brd ff:ff:ff:ff:ff:ff
    inet6 feed:beef::1/64 scope site deprecated
       valid_lft forever preferred_lft 0sec      <----- Per https://review.opendev.org/#/c/715120/
    inet6 feed:beef::3/64 scope site
       valid_lft forever preferred_lft forever
    inet6 fe80::3efd:feff:fe25:d700/64 scope link
       valid_lft forever preferred_lft forever
 
# Tiller is accessible from the mgmt network floating IP
controller-1:~$ kubectl get ep -n kube-system tiller-deploy
NAME            ENDPOINTS         AGE
tiller-deploy   [face::2]:44134   14h
 
# Last message from tiller probably aligned with when controller-0 was active
2020-04-28T18:29:39.174576873Z stderr F [tiller] 2020/04/28 18:29:39 executing 0 post-delete hooks for oidc-oidc-client
2020-04-28T18:29:39.17459409Z stderr F [tiller] 2020/04/28 18:29:39 hooks complete for post-delete oidc-oidc-client
2020-04-28T18:29:39.17460406Z stderr F [tiller] 2020/04/28 18:29:39 purge requested for oidc-oidc-client
2020-04-28T18:29:39.174611595Z stderr F [storage] 2020/04/28 18:29:39 deleting release "oidc-oidc-client.v1"
2020-04-28T18:34:53.429325711Z stderr F [storage] 2020/04/28 18:34:53 listing all releases with filter
 
# From sm-customer.log, so yes controller-0 was active at the the time of the last tiller logs
| 2020-04-28T18:00:56.910 |        196 | service-group-scn    | controller-services              | standby-degraded                 | standby                          | 
| 2020-04-28T19:14:08.477 |        197 | node-scn             | controller-1                     |                                  | swact                            | issued against host controller-0
 
# Tiller pod is on controller-0
controller-1:~$ kubectl get pods --all-namespaces -o wide -w | grep tiller
kube-system                   tiller-deploy-5c8dd9fb56-lgldr                      1/1     Running     0          11h     face::2                          controller-0   <none>           <none>
 
# Get the container id
controller-1:~$ kubectl describe pods -n  kube-system tiller-deploy-5c8dd9fb56-lgldr | grep containerd
    Container ID:  containerd://bb7618a8a6305780322c804ef0991e102596d0d8448c9387e6ad4cbeedd28185
 
# Can't connect from controller-1: OAM floating address is assigned to controller-1 but
#   controller-0 has containerd listening on that that same address
controller-1:~$ kubectl exec -it -n kube-system tiller-deploy-5c8dd9fb56-lgldr -- ls
error: unable to upgrade connection: error dialing backend: dial tcp [2620:10a:a001:a103::11]:40870: connect: connection timed out
 
# Can't connect from controller-0: OAM floating address is assigned to controller-1 but
#   controller-0 has containerd listening on that that same address
controller-0:~# DEBUG=1 crictl exec -it bb7618a8a6305 ls
FATA[0127] execing command in container failed: error sending request: Post "http://[2620:10a:a001:a103::11]:40870/exec/va5vLJhQ": dial tcp [2620:10a:a001:a103::11]:40870: connect: connection timed out

Revision history for this message

Peng Peng (ppeng) wrote on 2020-04-30:

#4

Issue was reproduced on
Lab: WCP_78_79
Load: 2020-04-29_20-00-00

https://files.starlingx.kube.cengn.ca/launchpad/1875891

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-04: Fix proposed to ansible-playbooks (master)

#5

Fix proposed to branch: master
Review: https://review.opendev.org/725394

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-05: Fix merged to ansible-playbooks (master)

#6

Reviewed: https://review.opendev.org/725394
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=0dc9e173855792c38bec90360c0c4c066c36d66b
Submitter: Zuul
Branch: master

commit 0dc9e173855792c38bec90360c0c4c066c36d66b
Author: Robert Church <email address hidden>
Date: Mon May 4 12:59:49 2020 -0400

Ensure containerd binds to the loopback interface

Set the stream_server_address to bind to the loopback interface with a
value of "127.0.0.1" for IPv4 and "::1" for IPv6.

    This will explicitly update the containerd configuration to use the IP
    address of the loopback interface based on the system's network
    configuration.

    Change-Id: I76a4ad1c123b8b701cb1fa74b16609b50cdf9bd2
    Partial-Bug: #1875891
    Signed-off-by: Robert Church <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-05: Fix merged to stx-puppet (master)

#7

Reviewed: https://review.opendev.org/724384
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=b793518f65ae932f3974ff85b797f505b5ef1c2a
Submitter: Zuul
Branch: master

commit b793518f65ae932f3974ff85b797f505b5ef1c2a
Author: Robert Church <email address hidden>
Date: Wed Apr 29 12:49:04 2020 -0400

Ensure containerd binds to the loopback interface

Set the stream_server_address to bind to the loopback interface with a
value of "127.0.0.1" for IPv4 and "::1" for IPv6.

    Without setting the stream_server_address in config.toml, containerd was
    binding to the OAM interface. Under most situations this resulted in
    containerd binding to the OAM fixed host address. But in an IPv6
    configuration there were occasions where after controller-0 unlock, the
    OAM floating IP would be used. When this happened, swacting away from
    controller-0 would move the OAM floating IP to controller-1 and break
    access to containers residing on controller-0.

    This will explicitly update the containerd configuration to use the IP
    address of the loopback interface based on the system's network
    configuration.

This also removes any security concerns with containerd binding to the
OAM interface.

    Change-Id: I0f914d738e94b525cf217712675d3b4575817d1d
    Depends-On: https://review.opendev.org/#/c/725394/
    Closes-Bug: #1875891
    Signed-off-by: Robert Church <email address hidden>

Revision history for this message

Yang Liu (yliu12) wrote on 2020-05-06:

#8

This issue is no longer seen in sanity on 3 different system with 2020-05-05_20-29-49 load.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-21: Fix proposed to ansible-playbooks (f/centos8)

#9

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729809

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-21: Fix proposed to stx-puppet (f/centos8)

#10

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729825

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-21: Fix merged to stx-puppet (f/centos8)

#11

Download full text (16.7 KiB)

Reviewed: https://review.opendev.org/729825
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=d4617fbad74a05f2af81ee85a47565083991e6f8
Submitter: Zuul
Branch: f/centos8

commit 4134023ab84d8a635b118d5e3ff26ade3bbe535b
Author: Sharath Kumar K <email address hidden>
Date: Thu May 7 10:08:11 2020 +0200

Tox and Zuul job for the bandit code scan in stx/stx-puppet

    Setting up the bandit tool for the scanning of HIGH severity issues
    in the python codes under Starlingx/stx-puppet folder.
    Expecting this merge will enable zuul job for CI/CD of bandit scan.

    Configuration files:
    1. tox.ini for adding bandit environment and command.
    2. test-requirements.txt for adding bandit version.
    3. .zuul.yaml file for adding bandit job and configuring under
       check job to run code scan every time before code commit.

    Test:
    Run tox -e bandit command inside the fault folder to validate the
    bandit scan and result.

    Story: 2007541
    Task: 39687
    Depends-On: https://review.opendev.org/#/c/721294/

Change-Id: I2982268db2b5e75feeb287bc95420fedc9b0d816
Signed-off-by: Sharath Kumar K <email address hidden>

commit 65daac29e4635f32a57e80cd18f96fd59dc8ebe0
Author: Bin Qian <email address hidden>
Date: Tue May 12 22:39:21 2020 -0400

DC cert manifest should only apply to controller nodes

    DC cert manifest should only apply to controller nodes on system
    controller.
    This fix is for DC with worker nodes in central cloud.

    Change-Id: I4233509a6f0afb3013c01e81dea6f655d9e15371
    Closes-Bug: 1878260
    Signed-off-by: Bin Qian <email address hidden>

commit 04a3cb8cbad9b1700286c5de67aa5d974cf54400
Author: Elena Taivan <email address hidden>
Date: Wed Apr 29 08:44:13 2020 +0000

Changing permissions for conversion folder

Adding writing permissions to '/opt/conversion' mountpoint
so openstack image conversion can happen there.

    Change-Id: Id1a91db6570dcbed3b8068e79e72f5bb800f24ad
    Partial-bug: 1819688
    Signed-off-by: Elena Taivan <email address hidden>

commit 4e9153cf234e714e4bbc9a9eb3d9b55b2828145a
Author: Tao Liu <email address hidden>
Date: Mon May 4 14:30:30 2020 -0500

Move subcloud audit to separate process

Subcloud audit is being removed from the dcmanager-manager
process and it is running in dcmanager-audit process.

This update adds associated puppet config.

    Story: 2007267
    Task: 39640
    Depends-On: https://review.opendev.org/#/c/725627/

Change-Id: Idd2e675126a01d6113597646ddd9eb4a0bc5be44
Signed-off-by: Tao Liu <email address hidden>

commit b793518f65ae932f3974ff85b797f505b5ef1c2a
Author: Robert Church <email address hidden>
Date: Wed Apr 29 12:49:04 2020 -0400

Ensure containerd binds to the loopback interface

Set the stream_server_address to bind to the loopback interface with a
value of "127.0.0.1" for IPv4 and "::1" for IPv6.

    Without setting the stream_server_address in config.toml, containerd was
    binding to the OAM interface. Under most situations this resulted in
    containe...

Reviewed:  https://review.opendev.org/729825
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=d4617fbad74a05f2af81ee85a47565083991e6f8
Submitter: Zuul
Branch:    f/centos8

commit 4134023ab84d8a635b118d5e3ff26ade3bbe535b
Author: Sharath Kumar K <sharath.kumar@intel.com>
Date:   Thu May 7 10:08:11 2020 +0200

Tox and Zuul job for the bandit code scan in stx/stx-puppet
    
    Setting up the bandit tool for the scanning of HIGH severity issues
    in the python codes under Starlingx/stx-puppet folder.
    Expecting this merge will enable zuul job for CI/CD of bandit scan.
    
    Configuration files:
    1. tox.ini for adding bandit environment and command.
    2. test-requirements.txt for adding bandit version.
    3. .zuul.yaml file for adding bandit job and configuring under
       check job to run code scan every time before code commit.
    
    Test:
    Run tox -e bandit command inside the fault folder to validate the
    bandit scan and result.
    
    Story: 2007541
    Task: 39687
    Depends-On: https://review.opendev.org/#/c/721294/
    
    Change-Id: I2982268db2b5e75feeb287bc95420fedc9b0d816
    Signed-off-by: Sharath Kumar K <sharath.kumar@intel.com>

commit 65daac29e4635f32a57e80cd18f96fd59dc8ebe0
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue May 12 22:39:21 2020 -0400

DC cert manifest should only apply to controller nodes
    
    DC cert manifest should only apply to controller nodes on system
    controller.
    This fix is for DC with worker nodes in central cloud.
    
    Change-Id: I4233509a6f0afb3013c01e81dea6f655d9e15371
    Closes-Bug: 1878260
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 04a3cb8cbad9b1700286c5de67aa5d974cf54400
Author: Elena Taivan <elena.taivan@windriver.com>
Date:   Wed Apr 29 08:44:13 2020 +0000

Changing permissions for conversion folder
    
    Adding writing permissions to '/opt/conversion' mountpoint
    so openstack image conversion can happen there.
    
    Change-Id: Id1a91db6570dcbed3b8068e79e72f5bb800f24ad
    Partial-bug: 1819688
    Signed-off-by: Elena Taivan <elena.taivan@windriver.com>

commit 4e9153cf234e714e4bbc9a9eb3d9b55b2828145a
Author: Tao Liu <tao.liu@windriver.com>
Date:   Mon May 4 14:30:30 2020 -0500

Move subcloud audit to separate process
    
    Subcloud audit is being removed from the dcmanager-manager
    process and it is running in dcmanager-audit process.
    
    This update adds associated puppet config.
    
    Story: 2007267
    Task: 39640
    Depends-On: https://review.opendev.org/#/c/725627/
    
    Change-Id: Idd2e675126a01d6113597646ddd9eb4a0bc5be44
    Signed-off-by: Tao Liu <tao.liu@windriver.com>

commit b793518f65ae932f3974ff85b797f505b5ef1c2a
Author: Robert Church <robert.church@windriver.com>
Date:   Wed Apr 29 12:49:04 2020 -0400

Ensure containerd binds to the loopback interface
    
    Set the stream_server_address to bind to the loopback interface with a
    value of "127.0.0.1" for IPv4 and "::1" for IPv6.
    
    Without setting the stream_server_address in config.toml, containerd was
    binding to the OAM interface. Under most situations this resulted in
    containerd binding to the OAM fixed host address. But in an IPv6
    configuration there were occasions where after controller-0 unlock, the
    OAM floating IP would be used. When this happened, swacting away from
    controller-0 would move the OAM floating IP to controller-1 and break
    access to containers residing on controller-0.
    
    This will explicitly update the containerd configuration to use the IP
    address of the loopback interface based on the system's network
    configuration.
    
    This also removes any security concerns with containerd binding to the
    OAM interface.
    
    Change-Id: I0f914d738e94b525cf217712675d3b4575817d1d
    Depends-On: https://review.opendev.org/#/c/725394/
    Closes-Bug: #1875891
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 4107faed7e3466cba6fe7b6867152c91c869105b
Author: Elena Taivan <elena.taivan@windriver.com>
Date:   Wed Mar 25 11:48:49 2020 +0000

Add a new filesystem for image conversion
    
    Adding runtime manifest for conversion logical volume.
    Adding new 'ensure' parameter for 'platform::filesystem' class.
    
    Change-Id: I622837959a5a7aabc462640b588713396354ce73
    Partial-bug: 1819688
    Signed-off-by: Elena Taivan <elena.taivan@windriver.com>

commit db97027fb7b8cf8484f6ddc9ee4906ca091107ec
Author: albailey <Al.Bailey@windriver.com>
Date:   Tue Apr 28 12:39:05 2020 -0500

Clamp pylint to be less than 2.5.0
    
    A new version of pylint was released on April 25
    and it is breaking zuul jobs so submissions cannot merge.
    Clamping pylint to be less than 2.5.0 for now.
    
    Change-Id: Ibd62a5d67bf8f37119b612a274c2d472a3474859
    Partial-Bug: 1875705
    Signed-off-by: albailey <Al.Bailey@windriver.com>

commit 77b2e1ccfa612b632a4831da8b9a2c95fa812e9b
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Fri Apr 24 15:09:15 2020 -0400

Rename the existing /opt/patch-vault filesystem to /opt/dc-vault
    
    The filesystem /opt/patch-vault is renamed to /opt/dc-vault so that
    it can be re-used to store FPGA images and software loads. Thus,
    necessary changes have been made in the puppet manifests.
    
    Story: 2006740
    Task: 39550
    Depends-On: https://review.opendev.org/#/c/723007/
    Change-Id: I26055b12e7bd241adb072c609f72b8d113b4a20e
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

commit 7a759239557ca69e2bc0c0b3084e0759b461f06b
Author: Robert Church <robert.church@windriver.com>
Date:   Wed Apr 22 02:42:13 2020 -0400

Enable --reserved-cpus option in k8s v1.18.1
    
    The option was introduced in k8s v1.17 and will now be used to define
    the explicit set of CPUs that are reserved for specific cpu functions in
    StarlingX.
    
    This retires setting the number of CPUs reserved in the --kube-reserved
    and --system-reserved options.
    
    Change-Id: I1a3d4e4cca7b6940682a787c2e7348e56a047a06
    Story: 2006999
    Task: 39529
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 9e86812ec1301f384ebc8a701c021af9932ac2c1
Author: Tee Ngo <tee.ngo@windriver.com>
Date:   Wed Apr 15 15:36:49 2020 -0400

Add a cron job to purge dcorch database
    
    This commit adds a daily cron job to purge deleted orch
    requests that are older than 3 days, their orch jobs
    and resources from dcorch database.
    
    Story: 2007267
    Task: 39044
    Depends-On: https://review.opendev.org/720277
    Change-Id: Ibc9f78ac89f4cc6706886a49062c3f5a6145cc9f
    Signed-off-by: Tee Ngo <tee.ngo@windriver.com>

commit e5f325ccca896e9ba96d199c6cff456cce0014f5
Author: Andy Ning <andy.ning@windriver.com>
Date:   Mon Apr 6 10:11:56 2020 -0400

Config platform service admin endpoints to https for DC
    
    With this update https is enabled for platform services' admin endpoints
    for System Controller and subclouds when the first controller is
    unlocked.
    
    The services with admin endpoints enabled are:
    - fm
    - patching
    - vim
    - smapi
    - barbican
    - keystone
    - sysinv
    - dcdbsync
    - dcmanager
    
    Change-Id: I45b3c541cdb6191dad6d3e2b3e9cf8a3398b3a1b
    Story: 2007347
    Task: 38891
    Depends-On: https://review.opendev.org/#/c/720224/
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit 7910646e9bd97af02d7f95eec5d8bd3a19dfb0e1
Author: Tao Liu <tao.liu@windriver.com>
Date:   Thu Apr 16 10:08:59 2020 -0400

Support subcloud deploy upload the common files
    
    Create /opt/platform/deploy to host the deploy common files.
    
    Partial-Bug: 1864508
    
    Change-Id: Ifd40cb02d4a2ee17a05457b43c6227aaa069e01e
    Signed-off-by: Tao Liu <tao.liu@windriver.com>

commit 4fc8bdcf4a011864aabe9df561e2c9bd2165c481
Author: Stefan Dinescu <stefan.dinescu@windriver.com>
Date:   Tue Apr 14 09:59:54 2020 +0000

Add B&R information comments to DRBD manifest
    
    This commit adds a series of comments to the DRBD manifest
    so that users doing any changes to this manifest know also
    update the list of DRBD devices in the restore playbook.
    
    Change-Id: Iae1d9d98391759669871b016721418922aa134ce
    Partial-bug: 1854169
    Signed-off-by: Stefan Dinescu <stefan.dinescu@windriver.com>

commit c82b459703c65d9d64759c124236c1c60b3d1916
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue Apr 7 23:51:24 2020 -0400

Install DC adminep cert and DC root ca certificate
    
    This is to install DC admin endpoint certificate (pem).
    This also install root CA to trusted CA, so to trust the certificate
    issued directly and indirectly by DC root CA.
    
    Story: 2007347
    Task: 39430
    
    Depends-on: https://review.opendev.org/720273
    
    Change-Id: Ie242c6e833a574ff29562b468fff72352515d22a
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 9a18b7086035062bd326a279aea47c23c3c3f96e
Author: Paul Vaduva <Paul.Vaduva@windriver.com>
Date:   Wed Apr 15 09:56:42 2020 -0400

Introduce a wait until network interfaces are ready
    
    The DAD (Duplicate Address Detection) mechanism keeps
    ipv6 network interface in tentative state until it finishes.
    During this time no binding to this interface address is
    possible and networking dependent services fail to start
    
    Change-Id: I9cfa604a0d75400f6d3c7172b3b973b0d50c3578
    Closes-bug: 1871638
    Signed-off-by: Paul Vaduva <Paul.Vaduva@windriver.com>

commit ccb72490976519ace03db8e5be4f7391f5e2942d
Author: Bart Wensley <barton.wensley@windriver.com>
Date:   Tue Apr 14 15:43:20 2020 -0500

Allow k8s upgrades to any release if necessary
    
    The default behaviour of the "kubeadm upgrade apply" command is
    to only allow upgrades to stable kubernetes versions. However,
    for both testing purposes and for potential critical fixes in
    the future, it may be necessary to upgrade to a release
    candidate or other release that kubernetes deems as unstable.
    Adding in the appropriate options when calling the "kubeadm
    upgrade apply" command to make this possible.
    
    Change-Id: I164caf495ee3680f549d651b97e7e502b1172c70
    Story: 2006781
    Task: 37578
    Signed-off-by: Bart Wensley <barton.wensley@windriver.com>

commit 3b7ab6010ee45f5b35de54ff1b6d147761ea5d7f
Author: Andy Ning <andy.ning@windriver.com>
Date:   Tue Apr 14 11:24:17 2020 -0400

Free dcdbsync openstack instance port for https admin endpoint
    
    Currently dcdbsync instance for openstack is listening on port 8220.
    With the admin endpoint of dcdbsync instance for platform has https
    enabled and uses port 8220, the port of dcdbsync instance for
    openstack is updated to use 8229.
    
    Change-Id: Ie3d60164e4e81de8e53ad452d4dbeab7ce4a5058
    Story: 2007347
    Task: 39409
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit 438354a28cf34c63a807ca90b6ed8806e01376af
Author: Robert Church <robert.church@windriver.com>
Date:   Mon Mar 23 20:57:45 2020 -0400

Upversion sandbox image to align with k8s v1.18.0
    
    Change-Id: I02f6158d39b4f10764faf4055da4ab4cdc1f9662
    Story: 2006999
    Task: 39342
    Depends-On: https://review.opendev.org/#/c/718568
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 7134a062502bab3afde7d44c1d7cf6c21b2fa7ab
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Wed Apr 8 11:12:00 2020 -0400

Database connection exhaustion in dcmanager during sync
    
    When a data sync is triggered for large number of subclouds (~100),
    the sync fails for some subclouds due to database connection exhaustion.
    In order to fix this issue, the limit on the number of database
    connections has been increased.
    
    Story: 2007267
    Task: 38956
    Change-Id: I88ed37ba3a143e3abee78a9f5584b16f17becc76
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

commit 21690922e2dc5653ba843167075e0f3577a7c8ed
Author: John Kung <john.kung@windriver.com>
Date:   Thu Apr 2 10:53:21 2020 -0400

Enable duplex platform upgrades: migrate etcd
    
    Enable the mechanism to upgrade the platform components on
    a running StarlingX system with duplex controllers.
    
    This includes upgrade updates for:
      o migrate etcd on host-swact
    
    Depends-On: https://review.opendev.org/#/c/717038/
    Change-Id: Ife45253b46a9d58216d6cc943d7f4d40dd48b970
    Story: 2007403
    Task: 39246
    Signed-off-by: John Kung <john.kung@windriver.com>

commit 6b11dcc799c62fd9690ece744cf6a9583b2db994
Author: Jim Somerville <Jim.Somerville@windriver.com>
Date:   Mon Apr 6 13:25:58 2020 -0400

lowlat: enable ktimer_lockless_check if it exists
    
    Enable check for raising timer interrupt only if one is pending.
    This allows nohz full mode to operate properly on isolated cores.
    Without it, ktimersoftd interferes with only one job being
    on the run queue on that core, causing it to drop out of nohz.
    
    If ktimer_lockless_check doesn't exist in the kernel, then no
    error is reported ie. it just fails silently.
    
    Closes-Bug: 1870456
    Change-Id: I93d0fab3e9f4f56f9afb9bbfaa04882cf9068db5
    Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>

commit 45ecd74e05deb3d37d51d7d4812ae9fdfa296d31
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Fri Apr 3 15:25:51 2020 -0400

Support adding admission plugin post bootrstrap
    
    This commit adds mandatory plugins automatically, without having the
    user specify them through system service-parameters.
    
    Story: 2007351
    Task: 38897
    
    Change-Id: Ia423bc3b7be241297d9d1c7a917ac308855c6114
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit 93d22c438ed6939bd4b1723b37e23794eacb7006
Author: Paul Vaduva <Paul.Vaduva@windriver.com>
Date:   Thu Apr 2 13:06:57 2020 +0300

Configure docker and containerd once per AIO deploy
    
    Prevent a double configuration of docker and containerd
    for AIO scenarios.
    
    Change-Id: I0cb9fdde5acf8d5d44d526e70ae4af726932709f
    Closes-bug: 1869193
    Signed-off-by: Paul Vaduva <Paul.Vaduva@windriver.com>

commit 296bd3d1f733e10b11f3dc2601e9fa1f08c9c719
Author: Robert Church <robert.church@windriver.com>
Date:   Fri Mar 27 23:38:24 2020 -0400

Ensure network config has been applied before containerd
    
    If containerd is started prior to networking providing a default route,
    the containerd cri plugin will fail to load with the following message:
    
    msg="failed to load plugin io.containerd.grpc.v1.cri" error="failed to
    create CRI service: failed to create stream server: failed to get stream
    server address: no default routes found in \"/proc/net/route\" or
    \"/proc/net/ipv6_route\""
    
    and the status of the plugin will be in 'error'
    
    TYPE                  ID  PLATFORMS   STATUS
    io.containerd.grpc.v1 cri linux/amd64 error
    
    This will prevent any crictl image pulls from working.
    
    This change will ensure the network config is applied prior to
    configuring and restarting containerd.
    
    Docker and containerd also have a dependency, so also ensure the
    network config is applied prior to configuring and restarting
    docker.
    
    Change-Id: I94a3349b438816d21b147cbd62054862d07d8bee
    Partial-Bug: #1868728
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 07edad67cc55caf4726d3db3529c8e71fff6254e
Author: Paul Vaduva <Paul.Vaduva@windriver.com>
Date:   Thu Mar 26 03:09:47 2020 +0200

Set preferred_lft to 0 for mgmt and nfs floating ips
    
    For ipv6 the only way to prefer the fixed ip for
    outgoing connection is to set preferred_lft to 0 for
    the floating ips
    
    Change-Id: I13573ac4628db1fc49146f353d7eb2c96eb1aff0
    Closes-bug: 1856064
    Signed-off-by: Paul Vaduva <Paul.Vaduva@windriver.com>

commit cc786eda4dafb88f857c7b5272338b4bcf4a5204
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Fri Mar 27 14:11:45 2020 -0400

Support adding admission plugin post bootstrap
    
    This commit adds the ability to change the admission plugins of
    kube-apiserver post bootstrap. We need this for pod security plugin.
    Starting pod security plugin without any policies will result in all
    pods being denied.
    
    Story: 2007351
    Task: 38897
    
    Change-Id: I3ad3ba91f3084bd2f0054d5d063d2242594997b2
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit f24b2f5054156016284fb520022d259906fb3ef5
Author: Gerry Kopec <gerry.kopec@windriver.com>
Date:   Mon Mar 30 00:37:38 2020 -0400

Remove dcorch-snmp
    
    dcorch-snmp process/service is being removed from distributed cloud.
    Remove associated puppet config.
    
    Change-Id: I5691648887e2302eeda0b5e853a72df52ae0ba72
    Story: 2007267
    Task: 39190
    Depends-On: https://review.opendev.org/#/c/715765
    Signed-off-by: Gerry Kopec <gerry.kopec@windriver.com>

tags:

added: in-f-centos8

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-03: Fix merged to ansible-playbooks (f/centos8)

#12

Download full text (22.6 KiB)

Reviewed: https://review.opendev.org/729809
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=73027425d4501a6b7785e91024c9e8ddbc03115d
Submitter: Zuul
Branch: f/centos8

commit 55c9afd075194f7669fa2a87e546f61034679b04
Author: Dan Voiculeasa <email address hidden>
Date: Wed May 13 14:19:52 2020 +0300

Restore: disconnect etcd from ceph

At the moment etcd is restored only if ceph data is kept.
Etcd should be restored regardless if ceph data is kept or wiped.

    Story: 2006770
    Task 39751
    Change-Id: I9dfb1be0a83c3fdc5f1b29cbb974c5e0e2236ad3
    Signed-off-by: Dan Voiculeasa <email address hidden>

commit 003ddff574c74adf11cf8e4758e93ba0eed45a6a
Author: Don Penney <email address hidden>
Date: Fri May 8 11:35:58 2020 -0400

Add playbook for updating static images

This commit introduces a new playbook, upgrade-static-images.yml, used
for downloading updating images and pushing to the local registry.

    Change-Id: I8884440261a5a4e27b40398e5a75c9d03b09d4ba
    Story: 2006781
    Task: 39706
    Signed-off-by: Don Penney <email address hidden>

commit 26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Author: Matt Peters <email address hidden>
Date: Thu May 7 14:29:02 2020 -0500

Add kube-apiserver port to calico failsafe rules

    An invalid GlobalNetworkPolicy or NetworkPolicy may prevent
    calico-node from communicating with the kube-apiserver.
    Once the communication is broken, calico-node is no longer
    able to update the policies since it cannot communicate to
    read the updated policies. It can also prevent the pod
    from starting since the policies will prevent it from
    reading the configuration.

    To ensure that this scenario does not happen, the kube-apiserver
    port is being added to the failsafe rules to ensure communication
    is always possible, regardless of the network policy configuration.

    Change-Id: I1b065a74e7ad0ba9b1fdba4b63136b97efbe98ce
    Closes-Bug: 1877166
    Related-Bug: 1877383
    Signed-off-by: Matt Peters <email address hidden>

commit bd0f14a7dfb206ccaa3ce0f5e7d9034703b3403c
Author: Robert Church <email address hidden>
Date: Tue May 5 15:11:15 2020 -0400

Provide an update strategy for Tiller deployment

    In the case of a simplex controller configuration the current patching
    strategy for the Tiller environment will fail as the tiller ports will
    be in use when the new deployment is attempted to be applied. The
    resulting tiller pod will be stuck in a Pending state.

    This will be observed if the node becomes ready after 'helm init'
    installs the initial deployment and before the deployment is patched for
    environment checks.

    The deployment strategy provided by 'helm init' is unspecified. This
    change will allow one additional pod (current + new) and one unavailable
    pod (current) during an update. The maxUnavailable setting allows the
    tiller pod to be deleted which will release its ports, thus allowing the
    patch deployment to spin up an new pod to a Running state.

Change-Id: I83c43c52a77...

Reviewed:  https://review.opendev.org/729809
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=73027425d4501a6b7785e91024c9e8ddbc03115d
Submitter: Zuul
Branch:    f/centos8

commit 55c9afd075194f7669fa2a87e546f61034679b04
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Wed May 13 14:19:52 2020 +0300

Restore: disconnect etcd from ceph
    
    At the moment etcd is restored only if ceph data is kept.
    Etcd should be restored regardless if ceph data is kept or wiped.
    
    Story: 2006770
    Task 39751
    Change-Id: I9dfb1be0a83c3fdc5f1b29cbb974c5e0e2236ad3
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit 003ddff574c74adf11cf8e4758e93ba0eed45a6a
Author: Don Penney <don.penney@windriver.com>
Date:   Fri May 8 11:35:58 2020 -0400

Add playbook for updating static images
    
    This commit introduces a new playbook, upgrade-static-images.yml, used
    for downloading updating images and pushing to the local registry.
    
    Change-Id: I8884440261a5a4e27b40398e5a75c9d03b09d4ba
    Story: 2006781
    Task: 39706
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit 26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Author: Matt Peters <matt.peters@windriver.com>
Date:   Thu May 7 14:29:02 2020 -0500

Add kube-apiserver port to calico failsafe rules
    
    An invalid GlobalNetworkPolicy or NetworkPolicy may prevent
    calico-node from communicating with the kube-apiserver.
    Once the communication is broken, calico-node is no longer
    able to update the policies since it cannot communicate to
    read the updated policies.  It can also prevent the pod
    from starting since the policies will prevent it from
    reading the configuration.
    
    To ensure that this scenario does not happen, the kube-apiserver
    port is being added to the failsafe rules to ensure communication
    is always possible, regardless of the network policy configuration.
    
    Change-Id: I1b065a74e7ad0ba9b1fdba4b63136b97efbe98ce
    Closes-Bug: 1877166
    Related-Bug: 1877383
    Signed-off-by: Matt Peters <matt.peters@windriver.com>

commit bd0f14a7dfb206ccaa3ce0f5e7d9034703b3403c
Author: Robert Church <robert.church@windriver.com>
Date:   Tue May 5 15:11:15 2020 -0400

Provide an update strategy for Tiller deployment
    
    In the case of a simplex controller configuration the current patching
    strategy for the Tiller environment will fail as the tiller ports will
    be in use when the new deployment is attempted to be applied. The
    resulting tiller pod will be stuck in a Pending state.
    
    This will be observed if the node becomes ready after 'helm init'
    installs the initial deployment and before the deployment is patched for
    environment checks.
    
    The deployment strategy provided by 'helm init' is unspecified. This
    change will allow one additional pod (current + new) and one unavailable
    pod (current) during an update. The maxUnavailable setting allows the
    tiller pod to be deleted which will release its ports, thus allowing the
    patch deployment to spin up an new pod to a Running state.
    
    Change-Id: I83c43c52a77bce9f085bfb6c6a2c4171f2ba8f97
    Partial-Bug: #1876396
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 0dc9e173855792c38bec90360c0c4c066c36d66b
Author: Robert Church <robert.church@windriver.com>
Date:   Mon May 4 12:59:49 2020 -0400

Ensure containerd binds to the loopback interface
    
    Set the stream_server_address to bind to the loopback interface with a
    value of "127.0.0.1" for IPv4 and "::1" for IPv6.
    
    This will explicitly update the containerd configuration to use the IP
    address of the loopback interface based on the system's network
    configuration.
    
    Change-Id: I76a4ad1c123b8b701cb1fa74b16609b50cdf9bd2
    Partial-Bug: #1875891
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 2ea3ce6a7fdff5c2079acd76bd8eee7001b4127c
Author: Andy Ning <andy.ning@windriver.com>
Date:   Thu Apr 30 13:41:33 2020 -0400

Increase wait time for certificate during subcloud bootstrap
    
    Currently during subcloud ansible bootstrap, it waits up to 15s for
    certificate secret to be ready after the yaml file applies. For some
    slow hosts (VBox for example) 15s appears not long enough so the
    extracted certificate is partial, which in turn fails haproxy.
    
    This commit updates to use the better "kubectl wait" mechanism to wait
    for the certificate to be ready, with a timeout of 30s.
    
    Change-Id: Ibd8cab9339c6d532353b45b49cc4d141f0cf5ace
    Closes-Bug: 1876099
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit d05785ffd9add6553662fcab43f30bf8d9f6d2e3
Author: Stefan Dinescu <stefan.dinescu@windriver.com>
Date:   Fri Apr 24 10:48:20 2020 +0000

Upversion Netapp application
    
    Changes included in this commit:
    - updated netapp required docker images
    - add support for PVC snapshots (beta feature since K8s
      1.17);
    - create new ansible role for enabling PVC snapshot
      support and start required pod
    - import role for bootstrap as well, so any backend
      added in the future will also have support enabled
      by default
    - also use snapshot role for the netapp backend
      configuration (for upgrade considerations)
    - change netapp backend configuration of mapping backends
      and storage classes from 1-to-1 mapping to many-to-many
      mapping; instead of one backend configured for each
      storage-class, now any number of backends can be
      configured for any number of storage classes
    - add a new VolumeSnapshotClass configuration option for
      PVC snapshot support
    
    Change-Id: Ib1cf5a5b46f24a6864ac6d894e37db8732e0c6fb
    Depends-On: https://review.opendev.org/#/c/724237/
    Story: 2007391
    Task: 39566
    Signed-off-by: Stefan Dinescu <stefan.dinescu@windriver.com>

commit 204641a5b3082c9873109169f93ae1845eb79813
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Wed Apr 15 15:54:58 2020 +0300

DC subcloud restore registry.central certs
    
    During restore a certificate is missing.
    Docker needs the certificate to connect to registry.central.
    Extract it from backup archive.
    
    Closes-Bug: 1870389
    
    Depends-On: I64c8b38a51bf04714931d70e126e0f63782deb20
    Depends-On: Ieb12ffc0ad769dd6ca22eb4c15f9d6d55778fd4b
    Depends-On: I86166da31491736d6695e04fa287f79871975b55
    Depends-On: Iebab8dc059435c7e2b0f19947fedce88bd71bb65
    Depends-On: I278f19be32d1fe87687feb75e26b2898237de86f
    
    Change-Id: Ief65a8963b81ef489171c264964d472a66fec282
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit acd84841d201f1d5777edd2996086732cb3a3104
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Thu Apr 23 17:37:23 2020 +0300

Fix SystemController filesystem at restore
    
    The filesystem `dc-vault` is created at unlock.
    It doesn't exist at restore time to be resized.
    It will be correctly sized during unlock.
    
    It is not mounted into /dev/cgts-vg/dc-vault-lv.
    
    Closes-Bug: 1873617
    Change-Id: Ia2748756eaa8109065af1848374cc058c447910e
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit 885cfe61269a43c7cff7e56732baefc2190d5be1
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Apr 29 11:58:14 2020 -0400

Set root certification duration
    
    Setting root certification to 5 years and renew 30 days ahead.
    
    Change-Id: I780edaab0c041a0db1e9faf47bcd473e20068247
    Story: 3007347
    Task: 39428

commit 54e9b94773f3ae9c6be7eb14e141537cad373915
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Wed Apr 22 15:44:15 2020 +0300

Fix restore without ceph backend
    
    When ceph backend is not configured there is no ceph crushmap to be
    restored, nor ceph monitors data. Skip restoring those.
    
    The rest of the logic regarding ceph osds can be treaded as if osds were
    wiped.
    
    Closes-Bug: 1873974
    Depends-On: Ic2b7a77f4a54d3d30aedd6c00747fc4586428997
    Change-Id: I2776d7c2d5801ce6e81c487da263075b6f6873c8
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit dd89ba118d21027da28f860f2da47e6794d0453b
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Wed Apr 22 13:32:21 2020 +0300

Fix backup without ceph backend
    
    When ceph backend is not configured there is no ceph crushmap to be
    backed up. Skip the crushmap backup step.
    
    Partial-Bug: 1873974
    Change-Id: Ic2b7a77f4a54d3d30aedd6c00747fc4586428997
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit 3bb26d81d51f0590dba2a19caf9cc430673f6018
Author: Andy Ning <andy.ning@windriver.com>
Date:   Wed Apr 8 09:42:10 2020 -0400

Setup https admin endpoint certificates for subcloud
    
    This commit updated ansible bootstrap to generate, install and
    configure certificates for https enabled admin endpoints. This change
    applies to subcloud of a DC system only.
    
    The subcloud admin endpoint certificate has valid duration of 180 days
    and renew before of 30 days.
    
    Tests:
      - Successfully deploy subcloud by "dcmanager subcloud add"
      - Verify haproxy admin endpoint certificate is generated and
        installed properly in subcloud.
      - Verify DC admin endpoint root CA certificate is installed in
        subcloud's trusted CA cert list in subcloud.
      - Verify the haproxy admin endpoint certificate can be validiated by
        the DC endpoint root CA certificate successfully in subcloud.
    
    Change-Id: Ib24d27ac4cafe345fb57ba906ea5baf0930af892
    Story: 2007347
    Task: 39465
    Depends-On: https://review.opendev.org/#/c/720224/
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit 2b287b1050fa2b1a7b5f5d983eaa634a055b8ec2
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue Apr 7 23:48:11 2020 -0400

Install dc root cert
    
    This is to create a distributed cloud specific root CA issuer with
    cert-manager.
    
    The root CA issuer is to authorize intermediate issuers for each
    subcloud, the latter then to issue certificate for admin endpoints.
    
    Test cases:
    Bootstrap systemcontroller from local/remote
    Replay systemcontroller bootstrap playbook
    
    Story: 3007347
    Task: 39428
    
    Change-Id: I7546d6562f0bc072c3cf76f422a258a2c32b4a34
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 36a01e8ba38f3e0d1e2ea7a2bce31edbedfde04e
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Tue Apr 21 17:54:53 2020 +0300

B&R: Do keystone db backup for subcloud
    
    Keystone db backup file is missing for subclouds.
    Create the keystone db backup file when running the backup playbook on
    subcloud.
    
    Partial-Bug: 1870389
    Change-Id: I64c8b38a51bf04714931d70e126e0f63782deb20
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit df25466798d2487c933f7d2fc1d04ec968f4bcd2
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Fri Apr 24 15:23:37 2020 -0400

Rename the existing /opt/patch-vault filesystem to /opt/dc-vault
    
    The filesystem /opt/patch-vault is renamed to /opt/dc-vault so that
    it can be re-used to store FPGA images and software loads. Thus,
    necessary changes have been made to the ansible playbook files.
    
    Change-Id: I3358fe2d87c79785a8803815b1bbd2727ae80a24
    Story: 2006740
    Task: 39550
    Depends-On: https://review.opendev.org/#/c/723007/
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

commit d3341102189031551e8d4d194e42d86d8878920f
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Sun Apr 19 21:30:57 2020 -0400

Enable applying applications after bootstrap
    
    This commit adds the ability to specify applications to be applied
    directly after bootstrap, before controller-0 have been unlocked.
    This is needed for cert manager.
    
    Currently, nginx and cert manager will be applied by default, with
    no overrides. The user can optionally specify overrides if they wish
    
    NOTE: This aligns with long term direction for platform applications
    to:
    - move away from the existing platform application framework in sysinv
      due to wanting to decouple application behaviour from sysinv code
      in order to support such things as independent upgrades of these
      platform applications.
    - support auto-upload/apply of platform applications in either:
         a) bootstrap playbook, if app required for supporting bootstrap
            functions, or
         b) a post-bootstrap deployment-type playbook.
    In the case of cert-manager, in near future, it will be required at
    bootstrap to support initial configuration around generating
    certificates for kubernetes and https connections.
    
    Story: 2007360
    Task: 39471
    
    Change-Id: I91ee31c7c2d35c2a101b156ef8633fc69139938d
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit 0a1c06a66bc286b306bfdf4ada7cf823787b7a94
Author: Tao Liu <tao.liu@windriver.com>
Date:   Tue Apr 21 15:36:29 2020 -0400

Increase wait timeout for service endpoints reconfig
    
    Install/bootstrap HP EL8000 as subcloud timed out, while
    waiting for endpoints reconfiguration to complete
    during bootstrapping.
    
    This server has a single processor which takes around 9 mins
    to apply the runtime manifest, which is greater timeout
    value than 450 seconds. In general, everything is slower on this
    particular hardware, e.g. install is slower and cli commands
    take almost twice longer to complete than other servers.
    
    This update increases the endpoints reconfiguration wait
    timeout to 720 seconds which provides a safety margin.
    
    Testcases:
    Install/bootstrap HP EL8000 as a subcloud.
    
    Closes-Bug: 1871699
    
    Change-Id: If284281aa13e79cc14d0369e44e8cacebb24f415
    Signed-off-by: Tao Liu <tao.liu@windriver.com>

commit abbf21f7fcef00e90e75d393f638a73d58b41adb
Author: Robert Church <robert.church@windriver.com>
Date:   Mon Dec 16 12:53:10 2019 -0500

Patch tiller deployment to provide environment validation
    
    There appears to be a race condition between when kubelet sees a pod and
    when kubelet sees a service. Due to this race, required environment
    variable are missing to allow tiller to function properly.
    
    See the comment at
    https://github.com/kubernetes/kubernetes/blob/v1.18.1/pkg/kubelet/kubelet_pods.go#L566
    
    This change patches the tiller deployment to make sure the four classes
    of environment variables are present prior to starting tiller. If any
    class of variables are not present in the environment, then exit. This
    will recreate the pod and will populate the correct environment for
    tiller to function.
    
    Since the upgrade to v1.18.1, this has been seen in simplex and duplex
    controller configurations.
    
    This will cover patching during initial provisioning via ansible and
    will be reverted once StarlingX moves to helm v3.
    
    Change-Id: I78e43459fedab611a67b8d9b6b3121b78ef048a6
    Partial-Bug: #1856078
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 9a8136b5b11a874da9a5b67519a59b27530b4aad
Author: Tao Liu <tao.liu@windriver.com>
Date:   Sat Apr 18 13:54:45 2020 -0400

Backup & restore: subcloud deploy files
    
    Backup the subcloud deploy files if available on the system.
    Restore the subcloud deploy files if included in the archive.
    
    Testcases:
    Backup & restore System Controller with the subcloud deploy
    files.
    Backup & restore a regular system without the subcloud
    deploy files
    
    Partial-Bug: 1864508
    
    Change-Id: Ic14f6c02dd187a082b03458b0a766c690400e317
    Signed-off-by: Tao Liu <tao.liu@windriver.com>

commit 40cfef7c417709c234e50a1a034fb4a11dbf180a
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Tue Apr 14 14:18:29 2020 +0300

Remove subcloud task from restore mode
    
    A task supposed to run only during bootstrap is running during restore.
    
    Keystone dc variables (dc_admin_user_id and dc_admin_project_id) are
    added during bootstrap to hieradata static.yaml file.
    When doing the restore the information is already present in the file in
    the backup archive.
    
    Partial-Bug: 1870389
    Change-Id: Iebab8dc059435c7e2b0f19947fedce88bd71bb65
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit 5cdd394cb10c2c2d94174fdc32beb989290c6de9
Author: Stefan Dinescu <stefan.dinescu@windriver.com>
Date:   Thu Dec 19 15:23:23 2019 +0200

Resize DRBD resources when doing a restore
    
    In cases where we do a backup of a system that has non-default
    sizes for drbd-backed partitions, the restore fails when first
    unlocking controller-0.
    
    The normal resize procedure requires all controller nodes to
    be unlocked and available because the puppet manifest does
    not support resizing at unlock.
    
    To prevent the issue from occuring, as part of the restore
    procedure, we should resize the partitions on controller-0
    with the proper sizes found in sysinv. Controller-1 will
    automatically create the partitions with the proper sizes
    from the very start, so it will not need any resizes.
    
    Change-Id: Ia73452ce721514d393b486a659730d0ca7c0d7e5
    Closes-bug: 1854169
    Depends-on: https://review.opendev.org/#/c/699990
    Signed-off-by: Stefan Dinescu <stefan.dinescu@windriver.com>

commit a027bcf50a037166f84d897e22535c8dedf2590f
Author: Robert Church <robert.church@windriver.com>
Date:   Mon Mar 23 20:32:08 2020 -0400

Support for upversioning of k8s to v1.18.1
    
    Changes include:
    - Renamed the v1.16.2 versioned directories to v1.18.1.
    - Updated kubeadm.yaml to align the kubernetesVersion and enable the
      featureGate for multiple hugepage support
    
    Change-Id: I7241164f0185496093c0c8b5cb541fd09926b2ed
    Story: 2006999
    Task: 39334
    Depends-On: https://review.opendev.org/#/c/718568/
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 1b50022d55a9da2bbab284b1fdda2ddc78c30c79
Author: Shuicheng Lin <shuicheng.lin@intel.com>
Date:   Wed Apr 8 10:57:50 2020 +0800

Fix account be locked due to access registry without password
    
    Correct code to let exception be raised when password cannot be
    got from keyring. Account is locked due to exception is not raised,
    and client try to access registry with None password, which is
    incorrect.
    
    Closes-Bug: #1871141
    Change-Id: Ia68b4a4f25756fdad7a198a31d5870245ff9dc1a
    Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>

commit 9080db419d559d3d5d33c0a6459e9f5e8b7700e5
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Thu Apr 9 16:07:30 2020 +0300

Add registry.central host for DC subcloud restore
    
    During bootstrap management network is temporarly assigned on lo
    interface. Backup archive contains /etc/resolv.conf and /etc/hosts
    of an already unlocked controller. Before backup registry.central is
    resolved through dns (nameserver `floating central management`).
    
    During restore a temporary host for registry.central must be created.
    Since there is no reference of a backup/shadow management network that
    provides connectivity for such use cases the `floating central oam`
    can be used.
    
    Partial-Bug: 1870389
    
    Change-Id: I86166da31491736d6695e04fa287f79871975b55
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit 46e9c405cb13972a3bf08cbfcdfe4181c12b3cfc
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Fri Mar 27 14:09:45 2020 -0400

Add default pod security policies
    
    This commit adds default pod security policies. We need this
    pod security plugin. Starting pod security plugin without any
    policies will result in all pods being denied. These default
    policies prevent the user from putting the system into an
    unusable state if they accidentally enable pod security
    policies without adding policies first.
    
    Story: 2007351
    Task: 38897
    
    Change-Id: Iac49f81ef44e6cb82ff884717888dfc1a7cd2a45
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit f3340a3b5379f8c33de42aeaf11e96cc886df020
Author: Stefan Dinescu <stefan.dinescu@windriver.com>
Date:   Tue Apr 7 11:36:19 2020 +0300

Backup & restore: Restore license files
    
    STX offers support for installing license files through the
    "system license-install" command.
    
    While, these licenses are not enforced, they are part of the
    backups created, but they are not restored when doing a full
    backup & restore.
    
    Since license is optional, it is not expected to always be
    present in the backup archive, so we only restore it if it
    is present in the archive.
    
    Change-Id: Ibd4cdcb53d1d55409d947c1f3af45659ed21a7ae
    Closes-bug: 1871034
    Signed-off-by: Stefan Dinescu <stefan.dinescu@windriver.com>

commit 5c542524e4cd9fb65da698c1d4cba4d50f56bdab
Author: Shuicheng Lin <shuicheng.lin@intel.com>
Date:   Wed Apr 1 15:58:07 2020 +0800

Add kubelet_vol_plugin_dir definition to fix ansible failure
    
    When do host-swact, upgrade-k8s-networking.yml will be called to check
    calico upgrade. And kubelet_vol_plugin_dir is missed in definition
    and cause ansible fail. Add definition from main.yml to fix it.
    
    Closes-Bug: 1870038
    Change-Id: I30287ebca7f0d4a1d3c5ee656136375a7b1c182f
    Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>

commit d6cff0496dcf52655eba340e1e57b1d973040edf
Author: Shuicheng Lin <shuicheng.lin@intel.com>
Date:   Thu Mar 12 14:34:09 2020 +0800

Refresh local registry auth info each time when access local registry
    
    Local registry uses admin account password as authentication info.
    And this password may be changed by openstack client at any time.
    When try to download images from local registry, auth info cannot
    be cached, otherwise it may lead to authentication failure in keystone,
    and account be locked at the end.
    For this specific case, there is host-swact first, then function
    "_upgrade_downgrade_kube_networking" in sysinv conductor is called.
    And upgrade-k8s-networking.yml is executed which will try to download
    kube network images from local registry. During this period, admin
    account password is changed. And lead to account be locked due to
    authentication failure in keystone.
    With this update, there is still possibility that password be changed
    just after get operation. And due to the images download are run in
    parallel with multi threads, so account lock may still hit. This
    change could minimize the issue rate, but cannot fix all.
    
    Closes-Bug: 1853017
    
    Change-Id: I686616937031a3f7ac6d65e5b118511dc549ab85
    Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>

StarlingX

helm cmd failed after host swact

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches