AIO-DX: dockerd did not recover after power cycling both controllers

Bug #1869193 reported by Bart Wensley
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Paul-Ionut Vaduva

Bug Description

Brief Description
-----------------
In AIO-DX Distributed Cloud system controller, after power off/on both system controller nodes, ssh connection lost for 50 mins.

Investigation of the above issue (bug 1868604) revealed that one of the failures was that the application of the worker manifest on controller-0 causes containerd to be restarted, which also restarts dockerd and kubelet. Dockerd never came back. I believe the restart of containerd was introduced with the kata feature (https://review.opendev.org/#/c/703266/).

See bug 1868604 for the full analysis.

Severity
--------
Major

Steps to Reproduce
------------------
In Distributed Cloud, power off/on both (AIO-DX) system controller nodes, check ssh connection.

Expected Behavior
------------------
ssh connection should be resume after nodes boot up, within 5 mins

Actual Behavior
----------------
ssh re-connected in 50 mins

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
DC system (AIO-DX system controller)

Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
2020-03-20_00-10-00

Last Pass
---------
Last passed on same system with following load:
Load: 2020-03-14_04-10-00

Timestamp/Logs
--------------
See bug 1868604

Test Activity
-------------
Sanity

Ghada Khalil (gkhalil)
tags: added: stx.containers
Revision history for this message
Bart Wensley (bartwensley) wrote :

I did some digging into this issue - believe the root cause is controller/worker manifest overlap. Looking at the code, the docker.pp manifest is applied on both controller and worker nodes. I think that is a mistake and probably the cause of the issue. Normally, for manifests that must be included for controllers and worker nodes, we have different classes in the manifest for “controller” and “worker” and a hook in the worker class that drops out if we are running on an AIO host. For example, this is done in kubernetes.pp to avoid reconfiguring kubelet, which would have been done already when the manifest was applied on the controller.

The containerd.pp manifest just copied what docker.pp was doing and added a restart for the containerd service to pick up modified config (it says that dockerd may have already started containerd, which requires the restart).

So… seems like the solution here would be to update both the docker.pp and containerd.pp manifests to have controller/worker classes and have the worker class exit immediately if running on an AIO host.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - issue w/ dead office recovery

Changed in starlingx:
status: New → Triaged
assignee: nobody → Frank Miller (sensfan22)
importance: Undecided → Medium
tags: added: stx.4.0
Frank Miller (sensfan22)
Changed in starlingx:
assignee: Frank Miller (sensfan22) → Paul-Ionut Vaduva (pvaduva)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/716911

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/716911
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=93d22c438ed6939bd4b1723b37e23794eacb7006
Submitter: Zuul
Branch: master

commit 93d22c438ed6939bd4b1723b37e23794eacb7006
Author: Paul Vaduva <email address hidden>
Date: Thu Apr 2 13:06:57 2020 +0300

    Configure docker and containerd once per AIO deploy

    Prevent a double configuration of docker and containerd
    for AIO scenarios.

    Change-Id: I0cb9fdde5acf8d5d44d526e70ae4af726932709f
    Closes-bug: 1869193
    Signed-off-by: Paul Vaduva <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729825

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (16.7 KiB)

Reviewed: https://review.opendev.org/729825
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=d4617fbad74a05f2af81ee85a47565083991e6f8
Submitter: Zuul
Branch: f/centos8

commit 4134023ab84d8a635b118d5e3ff26ade3bbe535b
Author: Sharath Kumar K <email address hidden>
Date: Thu May 7 10:08:11 2020 +0200

    Tox and Zuul job for the bandit code scan in stx/stx-puppet

    Setting up the bandit tool for the scanning of HIGH severity issues
    in the python codes under Starlingx/stx-puppet folder.
    Expecting this merge will enable zuul job for CI/CD of bandit scan.

    Configuration files:
    1. tox.ini for adding bandit environment and command.
    2. test-requirements.txt for adding bandit version.
    3. .zuul.yaml file for adding bandit job and configuring under
       check job to run code scan every time before code commit.

    Test:
    Run tox -e bandit command inside the fault folder to validate the
    bandit scan and result.

    Story: 2007541
    Task: 39687
    Depends-On: https://review.opendev.org/#/c/721294/

    Change-Id: I2982268db2b5e75feeb287bc95420fedc9b0d816
    Signed-off-by: Sharath Kumar K <email address hidden>

commit 65daac29e4635f32a57e80cd18f96fd59dc8ebe0
Author: Bin Qian <email address hidden>
Date: Tue May 12 22:39:21 2020 -0400

    DC cert manifest should only apply to controller nodes

    DC cert manifest should only apply to controller nodes on system
    controller.
    This fix is for DC with worker nodes in central cloud.

    Change-Id: I4233509a6f0afb3013c01e81dea6f655d9e15371
    Closes-Bug: 1878260
    Signed-off-by: Bin Qian <email address hidden>

commit 04a3cb8cbad9b1700286c5de67aa5d974cf54400
Author: Elena Taivan <email address hidden>
Date: Wed Apr 29 08:44:13 2020 +0000

    Changing permissions for conversion folder

    Adding writing permissions to '/opt/conversion' mountpoint
    so openstack image conversion can happen there.

    Change-Id: Id1a91db6570dcbed3b8068e79e72f5bb800f24ad
    Partial-bug: 1819688
    Signed-off-by: Elena Taivan <email address hidden>

commit 4e9153cf234e714e4bbc9a9eb3d9b55b2828145a
Author: Tao Liu <email address hidden>
Date: Mon May 4 14:30:30 2020 -0500

    Move subcloud audit to separate process

    Subcloud audit is being removed from the dcmanager-manager
    process and it is running in dcmanager-audit process.

    This update adds associated puppet config.

    Story: 2007267
    Task: 39640
    Depends-On: https://review.opendev.org/#/c/725627/

    Change-Id: Idd2e675126a01d6113597646ddd9eb4a0bc5be44
    Signed-off-by: Tao Liu <email address hidden>

commit b793518f65ae932f3974ff85b797f505b5ef1c2a
Author: Robert Church <email address hidden>
Date: Wed Apr 29 12:49:04 2020 -0400

    Ensure containerd binds to the loopback interface

    Set the stream_server_address to bind to the loopback interface with a
    value of "127.0.0.1" for IPv4 and "::1" for IPv6.

    Without setting the stream_server_address in config.toml, containerd was
    binding to the OAM interface. Under most situations this resulted in
    containe...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.