Bug #1871638 “Controller-0 did not become active after installat...” : Bugs : StarlingX

Yang Liu (yliu12) on 2020-04-08

description:

updated

Yang Liu (yliu12) on 2020-04-08

description:

updated

Revision history for this message

Difu Hu (difuhu) wrote on 2020-04-08:

#1

log
https://files.starlingx.kube.cengn.ca/launchpad/1871638

Revision history for this message

Bart Wensley (bartwensley) wrote on 2020-04-08:

#2

Download full text (3.2 KiB)

After the controller is first unlocked, the controller manifest is failing to apply. The failure happens because the docker daemon cannot be started:

2020-04-08T06:55:57.588 [0;36mDebug: 2020-04-08 06:55:57 +0000 Exec[perform systemctl daemon reload for docker proxy](provider=posix): Executing 'systemctl daemon-reload'[0m
2020-04-08T06:55:57.590 [0;36mDebug: 2020-04-08 06:55:57 +0000 Executing: 'systemctl daemon-reload'[0m
2020-04-08T06:55:57.592 [mNotice: 2020-04-08 06:55:57 +0000 /Stage[main]/Platform::Docker::Config/Exec[perform systemctl daemon reload for docker proxy]: Triggered 'refresh' from 1 events[0m
2020-04-08T06:55:57.594 [0;32mInfo: 2020-04-08 06:55:57 +0000 /Stage[main]/Platform::Docker::Config/Exec[perform systemctl daemon reload for docker proxy]: Scheduling refresh of Service[docker][0m
2020-04-08T06:55:57.596 [0;36mDebug: 2020-04-08 06:55:57 +0000 /Stage[main]/Platform::Docker::Config/Exec[perform systemctl daemon reload for docker proxy]: The container Class[Platform::Docker::Config] will propagate my refresh event[0m
2020-04-08T06:55:57.598 [0;36mDebug: 2020-04-08 06:55:57 +0000 Executing: '/usr/bin/systemctl is-active docker'[0m
2020-04-08T06:55:57.600 [0;36mDebug: 2020-04-08 06:55:57 +0000 Executing: '/usr/bin/systemctl is-enabled docker'[0m
2020-04-08T06:55:57.602 [0;36mDebug: 2020-04-08 06:55:57 +0000 Executing: '/usr/bin/systemctl unmask docker'[0m
2020-04-08T06:55:57.643 [0;36mDebug: 2020-04-08 06:55:57 +0000 Executing: '/usr/bin/systemctl start docker'[0m
2020-04-08T06:55:57.816 [0;36mDebug: 2020-04-08 06:55:57 +0000 Runing journalctl command to get logs for systemd start failure: journalctl -n 50 --since '5 minutes ago' -u docker --no-pager[0m
2020-04-08T06:55:57.819 [0;36mDebug: 2020-04-08 06:55:57 +0000 Executing: 'journalctl -n 50 --since '5 minutes ago' -u docker --no-pager'[0m
2020-04-08T06:55:57.828 [1;31mError: 2020-04-08 06:55:57 +0000 Systemd start for docker failed!

Looking at daemon.log, it shows dockerd and containerd being restarted, but then dockerd gets stuck with the following logs coming out 1000s of times:

2020-04-08T06:55:57.472 controller-0 dockerd[100339]: info time="2020-04-08T06:55:57.472875804Z" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" module=libcontainerd namespace=moby
2020-04-08T06:55:57.472 controller-0 dockerd[100339]: info time="2020-04-08T06:55:57.472870255Z" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" module=libcontainerd namespace=plugins.moby

The containerd/dockerd startup was modified recently:
https://review.opendev.org/#/c/716911
https://review.opendev.org/#/c/715593
https://review.opendev.org/#/c/717044

Paul or Bob should take a look at this LP. Note that the puppet exec that is failing is ...

After the controller is first unlocked, the controller manifest is failing to apply. The failure happens because the docker daemon cannot be started:

2020-04-08T06:55:57.588 [0;36mDebug: 2020-04-08 06:55:57 +0000 Exec[perform systemctl daemon reload for docker proxy](provider=posix): Executing 'systemctl daemon-reload'[0m
2020-04-08T06:55:57.590 [0;36mDebug: 2020-04-08 06:55:57 +0000 Executing: 'systemctl daemon-reload'[0m
2020-04-08T06:55:57.592 [mNotice: 2020-04-08 06:55:57 +0000 /Stage[main]/Platform::Docker::Config/Exec[perform systemctl daemon reload for docker proxy]: Triggered 'refresh' from 1 events[0m
2020-04-08T06:55:57.594 [0;32mInfo: 2020-04-08 06:55:57 +0000 /Stage[main]/Platform::Docker::Config/Exec[perform systemctl daemon reload for docker proxy]: Scheduling refresh of Service[docker][0m
2020-04-08T06:55:57.596 [0;36mDebug: 2020-04-08 06:55:57 +0000 /Stage[main]/Platform::Docker::Config/Exec[perform systemctl daemon reload for docker proxy]: The container Class[Platform::Docker::Config] will propagate my refresh event[0m
2020-04-08T06:55:57.598 [0;36mDebug: 2020-04-08 06:55:57 +0000 Executing: '/usr/bin/systemctl is-active docker'[0m
2020-04-08T06:55:57.600 [0;36mDebug: 2020-04-08 06:55:57 +0000 Executing: '/usr/bin/systemctl is-enabled docker'[0m
2020-04-08T06:55:57.602 [0;36mDebug: 2020-04-08 06:55:57 +0000 Executing: '/usr/bin/systemctl unmask docker'[0m
2020-04-08T06:55:57.643 [0;36mDebug: 2020-04-08 06:55:57 +0000 Executing: '/usr/bin/systemctl start docker'[0m
2020-04-08T06:55:57.816 [0;36mDebug: 2020-04-08 06:55:57 +0000 Runing journalctl command to get logs for systemd start failure: journalctl -n 50 --since '5 minutes ago' -u docker --no-pager[0m
2020-04-08T06:55:57.819 [0;36mDebug: 2020-04-08 06:55:57 +0000 Executing: 'journalctl -n 50 --since '5 minutes ago' -u docker --no-pager'[0m
2020-04-08T06:55:57.828 [1;31mError: 2020-04-08 06:55:57 +0000 Systemd start for docker failed!

Looking at daemon.log, it shows dockerd and containerd being restarted, but then dockerd gets stuck with the following logs coming out 1000s of times:

2020-04-08T06:55:57.472 controller-0 dockerd[100339]: info time="2020-04-08T06:55:57.472875804Z" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" module=libcontainerd namespace=moby
2020-04-08T06:55:57.472 controller-0 dockerd[100339]: info time="2020-04-08T06:55:57.472870255Z" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" module=libcontainerd namespace=plugins.moby

The containerd/dockerd startup was modified recently:
https://review.opendev.org/#/c/716911
https://review.opendev.org/#/c/715593
https://review.opendev.org/#/c/717044

Paul or Bob should take a look at this LP. Note that the puppet exec that is failing is only hit if an http_proxy/https_proxy is configured. I'm not sure if that makes a difference?

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-04-08:

#3

stx.4.0 / high priority - as per above analysis, this looks like it may have been introduced by recent code changes

Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
tags:	added: stx.4.0 stx.containers
Changed in starlingx:
assignee:	nobody → Frank Miller (sensfan22)

Revision history for this message

Yang Liu (yliu12) wrote on 2020-04-08:

#4

Seen similar issue on a duplex subcloud on DC-1 with "2020-04-07_00-10-00" load. New logs added to:
https://files.starlingx.kube.cengn.ca/launchpad/1871638

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-04-09:

#5

The dockerd failure occurs when containerd dies during dockerd startup. It looks like containerd cannot bind to an address: [2620:10a:a001:a103::1065]:0

2020-04-08T06:55:57.789 controller-0 dockerd[101117]: info time="2020-04-08T06:55:57.789513836Z" level=info msg="Loading containers: start."
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801433786Z" level=info msg="Start event monitor"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801482117Z" level=info msg="Start snapshots syncer"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801493162Z" level=info msg="Start streaming server"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801571334Z" level=error msg="Failed to start streaming server" error="listen tcp [2620:10a:a001:a103::1065]:0: bind: cannot assign requested address"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801635792Z" level=info msg="Stop CRI service"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801687544Z" level=info msg="Event monitor stopped"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801723120Z" level=info msg="Stream server stopped"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801742976Z" level=fatal msg="Failed to run CRI service" error="stream server error: listen tcp [2620:10a:a001:a103::1065]:0: bind: cannot assign requested address"
2020-04-08T06:55:57.806 controller-0 dockerd[101117]: info time="2020-04-08T06:55:57.806166493Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42096e1e0, TRANSIENT_FAILURE" module=grpc
2020-04-08T06:55:57.806 controller-0 dockerd[101117]: info time="2020-04-08T06:55:57.806192558Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42096e1e0, CONNECTING" module=grpc
2020-04-08T06:55:57.806 controller-0 systemd[1]: notice containerd.service: main process exited, code=exited, status=1/FAILURE

Next step is to determine why containerd cannot connect to that address.

The dockerd failure occurs when containerd dies during dockerd startup.  It looks like containerd cannot bind to an address: [2620:10a:a001:a103::1065]:0

2020-04-08T06:55:57.789 controller-0 dockerd[101117]: info time="2020-04-08T06:55:57.789513836Z" level=info msg="Loading containers: start."
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801433786Z" level=info msg="Start event monitor"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801482117Z" level=info msg="Start snapshots syncer"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801493162Z" level=info msg="Start streaming server"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801571334Z" level=error msg="Failed to start streaming server" error="listen tcp [2620:10a:a001:a103::1065]:0: bind: cannot assign requested address"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801635792Z" level=info msg="Stop CRI service"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801687544Z" level=info msg="Event monitor stopped"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801723120Z" level=info msg="Stream server stopped"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801742976Z" level=fatal msg="Failed to run CRI service" error="stream server error: listen tcp [2620:10a:a001:a103::1065]:0: bind: cannot assign requested address"
2020-04-08T06:55:57.806 controller-0 dockerd[101117]: info time="2020-04-08T06:55:57.806166493Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42096e1e0, TRANSIENT_FAILURE" module=grpc
2020-04-08T06:55:57.806 controller-0 dockerd[101117]: info time="2020-04-08T06:55:57.806192558Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42096e1e0, CONNECTING" module=grpc
2020-04-08T06:55:57.806 controller-0 systemd[1]: notice containerd.service: main process exited, code=exited, status=1/FAILURE

Next step is to determine why containerd cannot connect to that address.

Frank Miller (sensfan22) on 2020-04-13

Changed in starlingx:
assignee:	Frank Miller (sensfan22) → Paul-Ionut Vaduva (pvaduva)

Revision history for this message

Bin Qian (bqian20) wrote on 2020-04-15:

#6

Issue is reproduced in a recent test. With identical error messages on puppet.log and daemon.log.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-04-15: Fix proposed to stx-puppet (master)

#7

Fix proposed to branch: master
Review: https://review.opendev.org/720205

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-04-15: Fix merged to stx-puppet (master)

#8

Reviewed: https://review.opendev.org/720205
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=9a18b7086035062bd326a279aea47c23c3c3f96e
Submitter: Zuul
Branch: master

commit 9a18b7086035062bd326a279aea47c23c3c3f96e
Author: Paul Vaduva <email address hidden>
Date: Wed Apr 15 09:56:42 2020 -0400

Introduce a wait until network interfaces are ready

    The DAD (Duplicate Address Detection) mechanism keeps
    ipv6 network interface in tentative state until it finishes.
    During this time no binding to this interface address is
    possible and networking dependent services fail to start

    Change-Id: I9cfa604a0d75400f6d3c7172b3b973b0d50c3578
    Closes-bug: 1871638
    Signed-off-by: Paul Vaduva <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Yang Liu (yliu12) on 2020-04-20

tags:

added: stx.retestneeded

Revision history for this message

Yang Liu (yliu12) wrote on 2020-04-21:

#9

This issue is not seen in recent 2 DC installations on with 04-19 load.

tags:

removed: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-21: Fix proposed to stx-puppet (f/centos8)

#10

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729825

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-21: Fix merged to stx-puppet (f/centos8)

#11

Download full text (16.7 KiB)

Reviewed: https://review.opendev.org/729825
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=d4617fbad74a05f2af81ee85a47565083991e6f8
Submitter: Zuul
Branch: f/centos8

commit 4134023ab84d8a635b118d5e3ff26ade3bbe535b
Author: Sharath Kumar K <email address hidden>
Date: Thu May 7 10:08:11 2020 +0200

Tox and Zuul job for the bandit code scan in stx/stx-puppet

    Setting up the bandit tool for the scanning of HIGH severity issues
    in the python codes under Starlingx/stx-puppet folder.
    Expecting this merge will enable zuul job for CI/CD of bandit scan.

    Configuration files:
    1. tox.ini for adding bandit environment and command.
    2. test-requirements.txt for adding bandit version.
    3. .zuul.yaml file for adding bandit job and configuring under
       check job to run code scan every time before code commit.

    Test:
    Run tox -e bandit command inside the fault folder to validate the
    bandit scan and result.

    Story: 2007541
    Task: 39687
    Depends-On: https://review.opendev.org/#/c/721294/

Change-Id: I2982268db2b5e75feeb287bc95420fedc9b0d816
Signed-off-by: Sharath Kumar K <email address hidden>

commit 65daac29e4635f32a57e80cd18f96fd59dc8ebe0
Author: Bin Qian <email address hidden>
Date: Tue May 12 22:39:21 2020 -0400

DC cert manifest should only apply to controller nodes

    DC cert manifest should only apply to controller nodes on system
    controller.
    This fix is for DC with worker nodes in central cloud.

    Change-Id: I4233509a6f0afb3013c01e81dea6f655d9e15371
    Closes-Bug: 1878260
    Signed-off-by: Bin Qian <email address hidden>

commit 04a3cb8cbad9b1700286c5de67aa5d974cf54400
Author: Elena Taivan <email address hidden>
Date: Wed Apr 29 08:44:13 2020 +0000

Changing permissions for conversion folder

Adding writing permissions to '/opt/conversion' mountpoint
so openstack image conversion can happen there.

    Change-Id: Id1a91db6570dcbed3b8068e79e72f5bb800f24ad
    Partial-bug: 1819688
    Signed-off-by: Elena Taivan <email address hidden>

commit 4e9153cf234e714e4bbc9a9eb3d9b55b2828145a
Author: Tao Liu <email address hidden>
Date: Mon May 4 14:30:30 2020 -0500

Move subcloud audit to separate process

Subcloud audit is being removed from the dcmanager-manager
process and it is running in dcmanager-audit process.

This update adds associated puppet config.

    Story: 2007267
    Task: 39640
    Depends-On: https://review.opendev.org/#/c/725627/

Change-Id: Idd2e675126a01d6113597646ddd9eb4a0bc5be44
Signed-off-by: Tao Liu <email address hidden>

commit b793518f65ae932f3974ff85b797f505b5ef1c2a
Author: Robert Church <email address hidden>
Date: Wed Apr 29 12:49:04 2020 -0400

Ensure containerd binds to the loopback interface

Set the stream_server_address to bind to the loopback interface with a
value of "127.0.0.1" for IPv4 and "::1" for IPv6.

    Without setting the stream_server_address in config.toml, containerd was
    binding to the OAM interface. Under most situations this resulted in
    containe...

Reviewed:  https://review.opendev.org/729825
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=d4617fbad74a05f2af81ee85a47565083991e6f8
Submitter: Zuul
Branch:    f/centos8

commit 4134023ab84d8a635b118d5e3ff26ade3bbe535b
Author: Sharath Kumar K <sharath.kumar@intel.com>
Date:   Thu May 7 10:08:11 2020 +0200

Tox and Zuul job for the bandit code scan in stx/stx-puppet
    
    Setting up the bandit tool for the scanning of HIGH severity issues
    in the python codes under Starlingx/stx-puppet folder.
    Expecting this merge will enable zuul job for CI/CD of bandit scan.
    
    Configuration files:
    1. tox.ini for adding bandit environment and command.
    2. test-requirements.txt for adding bandit version.
    3. .zuul.yaml file for adding bandit job and configuring under
       check job to run code scan every time before code commit.
    
    Test:
    Run tox -e bandit command inside the fault folder to validate the
    bandit scan and result.
    
    Story: 2007541
    Task: 39687
    Depends-On: https://review.opendev.org/#/c/721294/
    
    Change-Id: I2982268db2b5e75feeb287bc95420fedc9b0d816
    Signed-off-by: Sharath Kumar K <sharath.kumar@intel.com>

commit 65daac29e4635f32a57e80cd18f96fd59dc8ebe0
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue May 12 22:39:21 2020 -0400

DC cert manifest should only apply to controller nodes
    
    DC cert manifest should only apply to controller nodes on system
    controller.
    This fix is for DC with worker nodes in central cloud.
    
    Change-Id: I4233509a6f0afb3013c01e81dea6f655d9e15371
    Closes-Bug: 1878260
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 04a3cb8cbad9b1700286c5de67aa5d974cf54400
Author: Elena Taivan <elena.taivan@windriver.com>
Date:   Wed Apr 29 08:44:13 2020 +0000

Changing permissions for conversion folder
    
    Adding writing permissions to '/opt/conversion' mountpoint
    so openstack image conversion can happen there.
    
    Change-Id: Id1a91db6570dcbed3b8068e79e72f5bb800f24ad
    Partial-bug: 1819688
    Signed-off-by: Elena Taivan <elena.taivan@windriver.com>

commit 4e9153cf234e714e4bbc9a9eb3d9b55b2828145a
Author: Tao Liu <tao.liu@windriver.com>
Date:   Mon May 4 14:30:30 2020 -0500

Move subcloud audit to separate process
    
    Subcloud audit is being removed from the dcmanager-manager
    process and it is running in dcmanager-audit process.
    
    This update adds associated puppet config.
    
    Story: 2007267
    Task: 39640
    Depends-On: https://review.opendev.org/#/c/725627/
    
    Change-Id: Idd2e675126a01d6113597646ddd9eb4a0bc5be44
    Signed-off-by: Tao Liu <tao.liu@windriver.com>

commit b793518f65ae932f3974ff85b797f505b5ef1c2a
Author: Robert Church <robert.church@windriver.com>
Date:   Wed Apr 29 12:49:04 2020 -0400

Ensure containerd binds to the loopback interface
    
    Set the stream_server_address to bind to the loopback interface with a
    value of "127.0.0.1" for IPv4 and "::1" for IPv6.
    
    Without setting the stream_server_address in config.toml, containerd was
    binding to the OAM interface. Under most situations this resulted in
    containerd binding to the OAM fixed host address. But in an IPv6
    configuration there were occasions where after controller-0 unlock, the
    OAM floating IP would be used. When this happened, swacting away from
    controller-0 would move the OAM floating IP to controller-1 and break
    access to containers residing on controller-0.
    
    This will explicitly update the containerd configuration to use the IP
    address of the loopback interface based on the system's network
    configuration.
    
    This also removes any security concerns with containerd binding to the
    OAM interface.
    
    Change-Id: I0f914d738e94b525cf217712675d3b4575817d1d
    Depends-On: https://review.opendev.org/#/c/725394/
    Closes-Bug: #1875891
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 4107faed7e3466cba6fe7b6867152c91c869105b
Author: Elena Taivan <elena.taivan@windriver.com>
Date:   Wed Mar 25 11:48:49 2020 +0000

Add a new filesystem for image conversion
    
    Adding runtime manifest for conversion logical volume.
    Adding new 'ensure' parameter for 'platform::filesystem' class.
    
    Change-Id: I622837959a5a7aabc462640b588713396354ce73
    Partial-bug: 1819688
    Signed-off-by: Elena Taivan <elena.taivan@windriver.com>

commit db97027fb7b8cf8484f6ddc9ee4906ca091107ec
Author: albailey <Al.Bailey@windriver.com>
Date:   Tue Apr 28 12:39:05 2020 -0500

Clamp pylint to be less than 2.5.0
    
    A new version of pylint was released on April 25
    and it is breaking zuul jobs so submissions cannot merge.
    Clamping pylint to be less than 2.5.0 for now.
    
    Change-Id: Ibd62a5d67bf8f37119b612a274c2d472a3474859
    Partial-Bug: 1875705
    Signed-off-by: albailey <Al.Bailey@windriver.com>

commit 77b2e1ccfa612b632a4831da8b9a2c95fa812e9b
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Fri Apr 24 15:09:15 2020 -0400

Rename the existing /opt/patch-vault filesystem to /opt/dc-vault
    
    The filesystem /opt/patch-vault is renamed to /opt/dc-vault so that
    it can be re-used to store FPGA images and software loads. Thus,
    necessary changes have been made in the puppet manifests.
    
    Story: 2006740
    Task: 39550
    Depends-On: https://review.opendev.org/#/c/723007/
    Change-Id: I26055b12e7bd241adb072c609f72b8d113b4a20e
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

commit 7a759239557ca69e2bc0c0b3084e0759b461f06b
Author: Robert Church <robert.church@windriver.com>
Date:   Wed Apr 22 02:42:13 2020 -0400

Enable --reserved-cpus option in k8s v1.18.1
    
    The option was introduced in k8s v1.17 and will now be used to define
    the explicit set of CPUs that are reserved for specific cpu functions in
    StarlingX.
    
    This retires setting the number of CPUs reserved in the --kube-reserved
    and --system-reserved options.
    
    Change-Id: I1a3d4e4cca7b6940682a787c2e7348e56a047a06
    Story: 2006999
    Task: 39529
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 9e86812ec1301f384ebc8a701c021af9932ac2c1
Author: Tee Ngo <tee.ngo@windriver.com>
Date:   Wed Apr 15 15:36:49 2020 -0400

Add a cron job to purge dcorch database
    
    This commit adds a daily cron job to purge deleted orch
    requests that are older than 3 days, their orch jobs
    and resources from dcorch database.
    
    Story: 2007267
    Task: 39044
    Depends-On: https://review.opendev.org/720277
    Change-Id: Ibc9f78ac89f4cc6706886a49062c3f5a6145cc9f
    Signed-off-by: Tee Ngo <tee.ngo@windriver.com>

commit e5f325ccca896e9ba96d199c6cff456cce0014f5
Author: Andy Ning <andy.ning@windriver.com>
Date:   Mon Apr 6 10:11:56 2020 -0400

Config platform service admin endpoints to https for DC
    
    With this update https is enabled for platform services' admin endpoints
    for System Controller and subclouds when the first controller is
    unlocked.
    
    The services with admin endpoints enabled are:
    - fm
    - patching
    - vim
    - smapi
    - barbican
    - keystone
    - sysinv
    - dcdbsync
    - dcmanager
    
    Change-Id: I45b3c541cdb6191dad6d3e2b3e9cf8a3398b3a1b
    Story: 2007347
    Task: 38891
    Depends-On: https://review.opendev.org/#/c/720224/
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit 7910646e9bd97af02d7f95eec5d8bd3a19dfb0e1
Author: Tao Liu <tao.liu@windriver.com>
Date:   Thu Apr 16 10:08:59 2020 -0400

Support subcloud deploy upload the common files
    
    Create /opt/platform/deploy to host the deploy common files.
    
    Partial-Bug: 1864508
    
    Change-Id: Ifd40cb02d4a2ee17a05457b43c6227aaa069e01e
    Signed-off-by: Tao Liu <tao.liu@windriver.com>

commit 4fc8bdcf4a011864aabe9df561e2c9bd2165c481
Author: Stefan Dinescu <stefan.dinescu@windriver.com>
Date:   Tue Apr 14 09:59:54 2020 +0000

Add B&R information comments to DRBD manifest
    
    This commit adds a series of comments to the DRBD manifest
    so that users doing any changes to this manifest know also
    update the list of DRBD devices in the restore playbook.
    
    Change-Id: Iae1d9d98391759669871b016721418922aa134ce
    Partial-bug: 1854169
    Signed-off-by: Stefan Dinescu <stefan.dinescu@windriver.com>

commit c82b459703c65d9d64759c124236c1c60b3d1916
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue Apr 7 23:51:24 2020 -0400

Install DC adminep cert and DC root ca certificate
    
    This is to install DC admin endpoint certificate (pem).
    This also install root CA to trusted CA, so to trust the certificate
    issued directly and indirectly by DC root CA.
    
    Story: 2007347
    Task: 39430
    
    Depends-on: https://review.opendev.org/720273
    
    Change-Id: Ie242c6e833a574ff29562b468fff72352515d22a
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 9a18b7086035062bd326a279aea47c23c3c3f96e
Author: Paul Vaduva <Paul.Vaduva@windriver.com>
Date:   Wed Apr 15 09:56:42 2020 -0400

Introduce a wait until network interfaces are ready
    
    The DAD (Duplicate Address Detection) mechanism keeps
    ipv6 network interface in tentative state until it finishes.
    During this time no binding to this interface address is
    possible and networking dependent services fail to start
    
    Change-Id: I9cfa604a0d75400f6d3c7172b3b973b0d50c3578
    Closes-bug: 1871638
    Signed-off-by: Paul Vaduva <Paul.Vaduva@windriver.com>

commit ccb72490976519ace03db8e5be4f7391f5e2942d
Author: Bart Wensley <barton.wensley@windriver.com>
Date:   Tue Apr 14 15:43:20 2020 -0500

Allow k8s upgrades to any release if necessary
    
    The default behaviour of the "kubeadm upgrade apply" command is
    to only allow upgrades to stable kubernetes versions. However,
    for both testing purposes and for potential critical fixes in
    the future, it may be necessary to upgrade to a release
    candidate or other release that kubernetes deems as unstable.
    Adding in the appropriate options when calling the "kubeadm
    upgrade apply" command to make this possible.
    
    Change-Id: I164caf495ee3680f549d651b97e7e502b1172c70
    Story: 2006781
    Task: 37578
    Signed-off-by: Bart Wensley <barton.wensley@windriver.com>

commit 3b7ab6010ee45f5b35de54ff1b6d147761ea5d7f
Author: Andy Ning <andy.ning@windriver.com>
Date:   Tue Apr 14 11:24:17 2020 -0400

Free dcdbsync openstack instance port for https admin endpoint
    
    Currently dcdbsync instance for openstack is listening on port 8220.
    With the admin endpoint of dcdbsync instance for platform has https
    enabled and uses port 8220, the port of dcdbsync instance for
    openstack is updated to use 8229.
    
    Change-Id: Ie3d60164e4e81de8e53ad452d4dbeab7ce4a5058
    Story: 2007347
    Task: 39409
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit 438354a28cf34c63a807ca90b6ed8806e01376af
Author: Robert Church <robert.church@windriver.com>
Date:   Mon Mar 23 20:57:45 2020 -0400

Upversion sandbox image to align with k8s v1.18.0
    
    Change-Id: I02f6158d39b4f10764faf4055da4ab4cdc1f9662
    Story: 2006999
    Task: 39342
    Depends-On: https://review.opendev.org/#/c/718568
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 7134a062502bab3afde7d44c1d7cf6c21b2fa7ab
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Wed Apr 8 11:12:00 2020 -0400

Database connection exhaustion in dcmanager during sync
    
    When a data sync is triggered for large number of subclouds (~100),
    the sync fails for some subclouds due to database connection exhaustion.
    In order to fix this issue, the limit on the number of database
    connections has been increased.
    
    Story: 2007267
    Task: 38956
    Change-Id: I88ed37ba3a143e3abee78a9f5584b16f17becc76
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

commit 21690922e2dc5653ba843167075e0f3577a7c8ed
Author: John Kung <john.kung@windriver.com>
Date:   Thu Apr 2 10:53:21 2020 -0400

Enable duplex platform upgrades: migrate etcd
    
    Enable the mechanism to upgrade the platform components on
    a running StarlingX system with duplex controllers.
    
    This includes upgrade updates for:
      o migrate etcd on host-swact
    
    Depends-On: https://review.opendev.org/#/c/717038/
    Change-Id: Ife45253b46a9d58216d6cc943d7f4d40dd48b970
    Story: 2007403
    Task: 39246
    Signed-off-by: John Kung <john.kung@windriver.com>

commit 6b11dcc799c62fd9690ece744cf6a9583b2db994
Author: Jim Somerville <Jim.Somerville@windriver.com>
Date:   Mon Apr 6 13:25:58 2020 -0400

lowlat: enable ktimer_lockless_check if it exists
    
    Enable check for raising timer interrupt only if one is pending.
    This allows nohz full mode to operate properly on isolated cores.
    Without it, ktimersoftd interferes with only one job being
    on the run queue on that core, causing it to drop out of nohz.
    
    If ktimer_lockless_check doesn't exist in the kernel, then no
    error is reported ie. it just fails silently.
    
    Closes-Bug: 1870456
    Change-Id: I93d0fab3e9f4f56f9afb9bbfaa04882cf9068db5
    Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>

commit 45ecd74e05deb3d37d51d7d4812ae9fdfa296d31
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Fri Apr 3 15:25:51 2020 -0400

Support adding admission plugin post bootrstrap
    
    This commit adds mandatory plugins automatically, without having the
    user specify them through system service-parameters.
    
    Story: 2007351
    Task: 38897
    
    Change-Id: Ia423bc3b7be241297d9d1c7a917ac308855c6114
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit 93d22c438ed6939bd4b1723b37e23794eacb7006
Author: Paul Vaduva <Paul.Vaduva@windriver.com>
Date:   Thu Apr 2 13:06:57 2020 +0300

Configure docker and containerd once per AIO deploy
    
    Prevent a double configuration of docker and containerd
    for AIO scenarios.
    
    Change-Id: I0cb9fdde5acf8d5d44d526e70ae4af726932709f
    Closes-bug: 1869193
    Signed-off-by: Paul Vaduva <Paul.Vaduva@windriver.com>

commit 296bd3d1f733e10b11f3dc2601e9fa1f08c9c719
Author: Robert Church <robert.church@windriver.com>
Date:   Fri Mar 27 23:38:24 2020 -0400

Ensure network config has been applied before containerd
    
    If containerd is started prior to networking providing a default route,
    the containerd cri plugin will fail to load with the following message:
    
    msg="failed to load plugin io.containerd.grpc.v1.cri" error="failed to
    create CRI service: failed to create stream server: failed to get stream
    server address: no default routes found in \"/proc/net/route\" or
    \"/proc/net/ipv6_route\""
    
    and the status of the plugin will be in 'error'
    
    TYPE                  ID  PLATFORMS   STATUS
    io.containerd.grpc.v1 cri linux/amd64 error
    
    This will prevent any crictl image pulls from working.
    
    This change will ensure the network config is applied prior to
    configuring and restarting containerd.
    
    Docker and containerd also have a dependency, so also ensure the
    network config is applied prior to configuring and restarting
    docker.
    
    Change-Id: I94a3349b438816d21b147cbd62054862d07d8bee
    Partial-Bug: #1868728
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 07edad67cc55caf4726d3db3529c8e71fff6254e
Author: Paul Vaduva <Paul.Vaduva@windriver.com>
Date:   Thu Mar 26 03:09:47 2020 +0200

Set preferred_lft to 0 for mgmt and nfs floating ips
    
    For ipv6 the only way to prefer the fixed ip for
    outgoing connection is to set preferred_lft to 0 for
    the floating ips
    
    Change-Id: I13573ac4628db1fc49146f353d7eb2c96eb1aff0
    Closes-bug: 1856064
    Signed-off-by: Paul Vaduva <Paul.Vaduva@windriver.com>

commit cc786eda4dafb88f857c7b5272338b4bcf4a5204
Author: Jerry Sun <jerry.sun@windriver.com>
Date:   Fri Mar 27 14:11:45 2020 -0400

Support adding admission plugin post bootstrap
    
    This commit adds the ability to change the admission plugins of
    kube-apiserver post bootstrap. We need this for pod security plugin.
    Starting pod security plugin without any policies will result in all
    pods being denied.
    
    Story: 2007351
    Task: 38897
    
    Change-Id: I3ad3ba91f3084bd2f0054d5d063d2242594997b2
    Signed-off-by: Jerry Sun <jerry.sun@windriver.com>

commit f24b2f5054156016284fb520022d259906fb3ef5
Author: Gerry Kopec <gerry.kopec@windriver.com>
Date:   Mon Mar 30 00:37:38 2020 -0400

Remove dcorch-snmp
    
    dcorch-snmp process/service is being removed from distributed cloud.
    Remove associated puppet config.
    
    Change-Id: I5691648887e2302eeda0b5e853a72df52ae0ba72
    Story: 2007267
    Task: 39190
    Depends-On: https://review.opendev.org/#/c/715765
    Signed-off-by: Gerry Kopec <gerry.kopec@windriver.com>

tags:

added: in-f-centos8

StarlingX

Controller-0 did not become active after installation unlock

Bug Description

Other bug subscribers

Remote bug watches