AIO Plus system controller-1 failed after initial unlock

Bug #1868728 reported by Peng Peng
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bob Church

Bug Description

Brief Description
-----------------
In AIO-DX plus system, After controller-1 was initially unlocked, controller-1 went to "failed" status.

Severity
--------
Major

Steps to Reproduce
------------------
during initial setup, After controller-0 is unlock/available.
unlock controller-1

Expected Behavior
------------------
controller-0 status unlock/available

Actual Behavior
----------------
controller-0 status unlock/failed

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
Multi-node system

Lab-name: WP_8-12

Branch/Pull Time/Commit
-----------------------
2020-03-23_20-00-00

Last Pass
---------
2020-03-22_16-04-38

Timestamp/Logs
--------------

[2020-03-24 05:50:49,028]

system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-2 | worker | locked | disabled | online |
| 3 | controller-1 | controller | unlocked | disabled | failed |
| 4 | compute-1 | worker | locked | disabled | online |
| 5 | compute-0 | worker | locked | disabled | online |
+----+--------------+-------------+----------------+-------------+--------------+

Test Activity
-------------
installation

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Yang Liu (yliu12) wrote :

The same load successfully installed on other Duplex systems. Such as wcp99-103 (DX+worker) and Distributed cloud system (AIO-DX system controllers).

description: updated
Revision history for this message
Dariush Eslimi (deslimi) wrote :

Assign to Eric for triage.

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Maintenance experienced a controller-1 configuration failure as reported by the following log ...

2020-03-24T17:00:58.960 :Error : controller-1 critical config failure

The specific configuration failure is indicated in the controller-1 puppet logs as a 'kubeadm config' failure.

2020-03-24-16-54-53_controller/puppet.log:2020-03-24T16:57:31.843 Error: 2020-03-24 16:57:31 +0000 kubeadm config images list --kubernetes-version v1.16.2 --image-repository registry.local:9001/k8s.gcr.io | xargs -i crictl pull --creds admin:Li69nux* {} returned 123 instead of one of [0]
2020-03-24-16-54-53_controller/puppet.log:2020-03-24T16:57:32.047 Error: 2020-03-24 16:57:31 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[pre pull k8s images]/returns: change from notrun to 0 failed: kubeadm config images list --kubernetes-version v1.16.2 --image-repository registry.local:9001/k8s.gcr.io | xargs -i crictl pull --creds admin:Li69nux* {} returned 123 instead of one of [0]

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Seems to be an issue with image pulls. Assigning to Bob for further investigation.

Changed in starlingx:
assignee: Eric MacDonald (rocksolidmtce) → Bob Church (rchurch)
Ghada Khalil (gkhalil)
tags: added: stx.containers
tags: added: stx.4.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/715593

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config-files (master)

Fix proposed to branch: master
Review: https://review.opendev.org/717044

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/715593
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=296bd3d1f733e10b11f3dc2601e9fa1f08c9c719
Submitter: Zuul
Branch: master

commit 296bd3d1f733e10b11f3dc2601e9fa1f08c9c719
Author: Robert Church <email address hidden>
Date: Fri Mar 27 23:38:24 2020 -0400

    Ensure network config has been applied before containerd

    If containerd is started prior to networking providing a default route,
    the containerd cri plugin will fail to load with the following message:

    msg="failed to load plugin io.containerd.grpc.v1.cri" error="failed to
    create CRI service: failed to create stream server: failed to get stream
    server address: no default routes found in \"/proc/net/route\" or
    \"/proc/net/ipv6_route\""

    and the status of the plugin will be in 'error'

    TYPE ID PLATFORMS STATUS
    io.containerd.grpc.v1 cri linux/amd64 error

    This will prevent any crictl image pulls from working.

    This change will ensure the network config is applied prior to
    configuring and restarting containerd.

    Docker and containerd also have a dependency, so also ensure the
    network config is applied prior to configuring and restarting
    docker.

    Change-Id: I94a3349b438816d21b147cbd62054862d07d8bee
    Partial-Bug: #1868728
    Signed-off-by: Robert Church <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config-files (master)

Reviewed: https://review.opendev.org/717044
Committed: https://git.openstack.org/cgit/starlingx/config-files/commit/?id=966611331efccf09699c9bd326928ad2f03365a0
Submitter: Zuul
Branch: master

commit 966611331efccf09699c9bd326928ad2f03365a0
Author: Robert Church <email address hidden>
Date: Thu Apr 2 09:51:46 2020 -0500

    Defer monitoring of docker until after last config manifest is complete

    To prevent pmon from restarting docker at an inopportune time, align
    docker's pmon config to be the same as kubelet's config which is to
    defer monitoring until the after the worker manifest is complete on
    an AIO.

    Change-Id: Icf3859645295bda4238f8e5f79ca6f7faf603561
    Depends-On: https://review.opendev.org/#/c/715593/
    Closes-Bug: #1868728
    Signed-off-by: Robert Church <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Peng Peng (ppeng) wrote :

Verified on
Lab: WP_8_12
Load: 2020-04-20_20-00-00

tags: removed: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config-files (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729813

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729825

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (16.7 KiB)

Reviewed: https://review.opendev.org/729825
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=d4617fbad74a05f2af81ee85a47565083991e6f8
Submitter: Zuul
Branch: f/centos8

commit 4134023ab84d8a635b118d5e3ff26ade3bbe535b
Author: Sharath Kumar K <email address hidden>
Date: Thu May 7 10:08:11 2020 +0200

    Tox and Zuul job for the bandit code scan in stx/stx-puppet

    Setting up the bandit tool for the scanning of HIGH severity issues
    in the python codes under Starlingx/stx-puppet folder.
    Expecting this merge will enable zuul job for CI/CD of bandit scan.

    Configuration files:
    1. tox.ini for adding bandit environment and command.
    2. test-requirements.txt for adding bandit version.
    3. .zuul.yaml file for adding bandit job and configuring under
       check job to run code scan every time before code commit.

    Test:
    Run tox -e bandit command inside the fault folder to validate the
    bandit scan and result.

    Story: 2007541
    Task: 39687
    Depends-On: https://review.opendev.org/#/c/721294/

    Change-Id: I2982268db2b5e75feeb287bc95420fedc9b0d816
    Signed-off-by: Sharath Kumar K <email address hidden>

commit 65daac29e4635f32a57e80cd18f96fd59dc8ebe0
Author: Bin Qian <email address hidden>
Date: Tue May 12 22:39:21 2020 -0400

    DC cert manifest should only apply to controller nodes

    DC cert manifest should only apply to controller nodes on system
    controller.
    This fix is for DC with worker nodes in central cloud.

    Change-Id: I4233509a6f0afb3013c01e81dea6f655d9e15371
    Closes-Bug: 1878260
    Signed-off-by: Bin Qian <email address hidden>

commit 04a3cb8cbad9b1700286c5de67aa5d974cf54400
Author: Elena Taivan <email address hidden>
Date: Wed Apr 29 08:44:13 2020 +0000

    Changing permissions for conversion folder

    Adding writing permissions to '/opt/conversion' mountpoint
    so openstack image conversion can happen there.

    Change-Id: Id1a91db6570dcbed3b8068e79e72f5bb800f24ad
    Partial-bug: 1819688
    Signed-off-by: Elena Taivan <email address hidden>

commit 4e9153cf234e714e4bbc9a9eb3d9b55b2828145a
Author: Tao Liu <email address hidden>
Date: Mon May 4 14:30:30 2020 -0500

    Move subcloud audit to separate process

    Subcloud audit is being removed from the dcmanager-manager
    process and it is running in dcmanager-audit process.

    This update adds associated puppet config.

    Story: 2007267
    Task: 39640
    Depends-On: https://review.opendev.org/#/c/725627/

    Change-Id: Idd2e675126a01d6113597646ddd9eb4a0bc5be44
    Signed-off-by: Tao Liu <email address hidden>

commit b793518f65ae932f3974ff85b797f505b5ef1c2a
Author: Robert Church <email address hidden>
Date: Wed Apr 29 12:49:04 2020 -0400

    Ensure containerd binds to the loopback interface

    Set the stream_server_address to bind to the loopback interface with a
    value of "127.0.0.1" for IPv4 and "::1" for IPv6.

    Without setting the stream_server_address in config.toml, containerd was
    binding to the OAM interface. Under most situations this resulted in
    containe...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config-files (f/centos8)

Reviewed: https://review.opendev.org/729813
Committed: https://git.openstack.org/cgit/starlingx/config-files/commit/?id=fd87b6f58a1774fafa44a9f3574d3d0da55f2a69
Submitter: Zuul
Branch: f/centos8

commit 23ff1680680331d87e797a7744fa9e7a154ab324
Author: Tao Liu <email address hidden>
Date: Mon May 4 14:27:26 2020 -0500

    Define syslog file for dcmanager-audit

    Subcloud audit is being removed from the dcmanager-manager
    process and it is running in dcmanager-audit process.

    This update adds associated syslog config.

    Story: 2007267
    Task: 39638
    Depends-On: https://review.opendev.org/#/c/725627/

    Change-Id: I95d421b19096c7b321e9559bdc776081185bfd48
    Signed-off-by: Tao Liu <email address hidden>

commit 3ada8ea05cc348658f3a78c7669615c2db36ee45
Author: Lin, Shuicheng <email address hidden>
Date: Sat Apr 18 06:21:11 2020 +0000

    Configure containerd to be monitored by pmon

    Conf file parameter's value follow docker setting.

    Closes-Bug: #1869811

    Change-Id: Ib32d80d688ffba0e51a4e48bd63564282ac535b8
    Signed-off-by: Lin, Shuicheng <email address hidden>

commit 8683043a240aa14b15928a61ed40158ed75421f4
Author: Sharath Kumar K <email address hidden>
Date: Mon Apr 6 10:00:30 2020 +0200

    De-branding in starlingx/config-files: Titanium Cloud -> StarlingX

    1. Rename Titanium Cloud to StarlingX for .service file

    Test:
    After the de-brand change, bootimage.iso has built in the flock layer
    and installed on the dev machine to validate the changes.

    Please note, doing de-brand changes in batches, this is batch5 changes.

    Story: 2006387
    Task: 39271

    Change-Id: I642fad20048435b2bbaa34da235bd0597fc26525
    Signed-off-by: Sharath Kumar K <email address hidden>

commit 966611331efccf09699c9bd326928ad2f03365a0
Author: Robert Church <email address hidden>
Date: Thu Apr 2 09:51:46 2020 -0500

    Defer monitoring of docker until after last config manifest is complete

    To prevent pmon from restarting docker at an inopportune time, align
    docker's pmon config to be the same as kubelet's config which is to
    defer monitoring until the after the worker manifest is complete on
    an AIO.

    Change-Id: Icf3859645295bda4238f8e5f79ca6f7faf603561
    Depends-On: https://review.opendev.org/#/c/715593/
    Closes-Bug: #1868728
    Signed-off-by: Robert Church <email address hidden>

commit 8eaf729eb46fb53dc61a22e10db0c83aa3e1c355
Author: Gerry Kopec <email address hidden>
Date: Mon Mar 30 00:52:12 2020 -0400

    Remove dcorch-snmp

    dcorch-snmp process/service is being removed from distributed cloud.
    Remove associated syslog config.

    Change-Id: I46d9fa35edb34378dfa273bbe59aa141a9efce7e
    Story: 2007267
    Task: 39191
    Depends-On: https://review.opendev.org/#/c/715765
    Signed-off-by: Gerry Kopec <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.