AIO-DX: dockerd did not recover after power cycling both controllers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Paul-Ionut Vaduva |
Bug Description
Brief Description
-----------------
In AIO-DX Distributed Cloud system controller, after power off/on both system controller nodes, ssh connection lost for 50 mins.
Investigation of the above issue (bug 1868604) revealed that one of the failures was that the application of the worker manifest on controller-0 causes containerd to be restarted, which also restarts dockerd and kubelet. Dockerd never came back. I believe the restart of containerd was introduced with the kata feature (https:/
See bug 1868604 for the full analysis.
Severity
--------
Major
Steps to Reproduce
------------------
In Distributed Cloud, power off/on both (AIO-DX) system controller nodes, check ssh connection.
Expected Behavior
------------------
ssh connection should be resume after nodes boot up, within 5 mins
Actual Behavior
----------------
ssh re-connected in 50 mins
Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor
System Configuration
-------
DC system (AIO-DX system controller)
Lab-name: DC-3
Branch/Pull Time/Commit
-------
2020-03-20_00-10-00
Last Pass
---------
Last passed on same system with following load:
Load: 2020-03-14_04-10-00
Timestamp/Logs
--------------
See bug 1868604
Test Activity
-------------
Sanity
tags: | added: stx.containers |
Changed in starlingx: | |
assignee: | Frank Miller (sensfan22) → Paul-Ionut Vaduva (pvaduva) |
I did some digging into this issue - believe the root cause is controller/worker manifest overlap. Looking at the code, the docker.pp manifest is applied on both controller and worker nodes. I think that is a mistake and probably the cause of the issue. Normally, for manifests that must be included for controllers and worker nodes, we have different classes in the manifest for “controller” and “worker” and a hook in the worker class that drops out if we are running on an AIO host. For example, this is done in kubernetes.pp to avoid reconfiguring kubelet, which would have been done already when the manifest was applied on the controller.
The containerd.pp manifest just copied what docker.pp was doing and added a restart for the containerd service to pick up modified config (it says that dockerd may have already started containerd, which requires the restart).
So… seems like the solution here would be to update both the docker.pp and containerd.pp manifests to have controller/worker classes and have the worker class exit immediately if running on an AIO host.