Worker nodes double reboots / fails after unlock due to docker start failure

Bug #1884111 reported by Ghada Khalil
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Chris Friesen

Bug Description

Brief Description
-----------------
After the initial unlock of a worker node, docker failed to start. The subsequent reboot succeeded.

Severity
--------
Major

Steps to Reproduce
------------------
Nothing special. Initial unlock of a worker node.

Expected Behavior
------------------
Worker node should come up

Actual Behavior
----------------
Worker node fails and requires an additional reboot.

Reproducibility
---------------
Intermittent. Seen a few times so far.

System Configuration
--------------------
multi-node system

Branch/Pull Time/Commit
-----------------------
stx master load since May 2020

Last Pass
---------
N/A - Issue is intermittent

Timestamp/Logs
--------------
Logs are available from 2 occurrences:

May 2020:
2020-05-13T11:34:39.533 ^[[1;31mError: 2020-05-13 11:34:38 +0000 /Stage[main]/Platform::Docker::Config/Service[docker]/ensure: change from stopped to running failed: Systemd start for docker failed!
2020-05-13T11:34:39.538 journalctl log for docker:
2020-05-13T11:34:39.545 – No entries –

June 2020:
2020-06-17T15:13:28.172 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 /Stage[main]/Platform::Containerd::Config/Exec[restart-containerd]: The container Class[Platform::Containerd::Config] will propagate my refresh event^[[0m
2020-06-17T15:13:28.178 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Class[Platform::Containerd::Config]: The container Stage[main] will propagate my refresh event^[[0m
2020-06-17T15:13:28.183 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Exec[mount /dev/cgts-vg/scratch-lv](provider=posix): Executing check 'mount | awk '{print $3}' | grep -Fxq /scratch'^[[0m
2020-06-17T15:13:28.186 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Executing: 'mount | awk '{print $3}' | grep -Fxq /scratch'^[[0m
2020-06-17T15:13:28.189 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Exec[Change /scratch dir permissions](provider=posix): Executing 'chmod 0770 /scratch'^[[0m
2020-06-17T15:13:28.193 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Executing: 'chmod 0770 /scratch'^[[0m
2020-06-17T15:13:28.198 ^[[mNotice: 2020-06-17 15:13:28 +0000 /Stage[main]/Platform::Filesystem::Scratch/Platform::Filesystem[scratch-lv]/Exec[Change /scratch dir permissions]/returns: executed successfully^[[0m
2020-06-17T15:13:28.200 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 /Stage[main]/Platform::Filesystem::Scratch/Platform::Filesystem[scratch-lv]/Exec[Change /scratch dir permissions]: The container Platform::Filesystem[scratch-lv] will propagate my refresh event^[[0m
2020-06-17T15:13:28.203 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Exec[Change /scratch dir group](provider=posix): Executing 'chgrp sys_protected /scratch'^[[0m
2020-06-17T15:13:28.206 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Executing: 'chgrp sys_protected /scratch'^[[0m
2020-06-17T15:13:28.209 ^[[mNotice: 2020-06-17 15:13:28 +0000 /Stage[main]/Platform::Filesystem::Scratch/Platform::Filesystem[scratch-lv]/Exec[Change /scratch dir group]/returns: executed successfully^[[0m
2020-06-17T15:13:28.212 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 /Stage[main]/Platform::Filesystem::Scratch/Platform::Filesystem[scratch-lv]/Exec[Change /scratch dir group]: The container Platform::Filesystem[scratch-lv] will propagate my refresh event^[[0m
2020-06-17T15:13:28.215 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Platform::Filesystem[scratch-lv]: The container Class[Platform::Filesystem::Scratch] will propagate my refresh event^[[0m
2020-06-17T15:13:28.217 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Class[Platform::Filesystem::Scratch]: The container Stage[main] will propagate my refresh event^[[0m
2020-06-17T15:13:28.220 ^[[0;32mInfo: 2020-06-17 15:13:28 +0000 Class[Platform::Docker::Config]: Scheduling refresh of Service[docker]^[[0m
2020-06-17T15:13:28.224 ^[[0;32mInfo: 2020-06-17 15:13:28 +0000 Class[Platform::Docker::Config]: Scheduling refresh of Exec[enable-docker]^[[0m
2020-06-17T15:13:28.227 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Executing: '/usr/bin/systemctl is-active docker'^[[0m
2020-06-17T15:13:28.231 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Executing: '/usr/bin/systemctl is-enabled docker'^[[0m
2020-06-17T15:13:28.238 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Executing: '/usr/bin/systemctl unmask docker'^[[0m
2020-06-17T15:13:28.446 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Executing: '/usr/bin/systemctl start docker'^[[0m
2020-06-17T15:13:28.459 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Runing journalctl command to get logs for systemd start failure: journalctl -n 50 --since '5 minutes ago' -u docker --no-pager^[[0m
2020-06-17T15:13:28.466 ^[[0;36mDebug: 2020-06-17 15:13:28 +0000 Executing: 'journalctl -n 50 --since '5 minutes ago' -u docker --no-pager'^[[0m
2020-06-17T15:13:28.471 ^[[1;31mError: 2020-06-17 15:13:28 +0000 Systemd start for docker failed!
2020-06-17T15:13:28.474 journalctl log for docker:
2020-06-17T15:13:28.476 -- No entries --

Test Activity
-------------
General Use

Workaround
----------
N/A - worker nodes recovers, but requires multiple reboots

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - slow recovery due to multiple reboots. should be fixed for stx.4.0

Changed in starlingx:
assignee: nobody → Chris Friesen (cbf123)
description: updated
tags: added: stx.4.0 stx.containers
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/736853

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/736853
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=9400e1d2a6504d96c354876b7c76c7a2061b404c
Submitter: Zuul
Branch: master

commit 9400e1d2a6504d96c354876b7c76c7a2061b404c
Author: Chris Friesen <email address hidden>
Date: Wed Jun 17 19:39:25 2020 -0400

    switch to containerd.service file from upstream

    Back when the containerd package was first added to the build,
    the designer who added it didn't realize that the upstream source
    already contained a "containerd.service" file and so they added
    a separate one.

    It turns out that the upstream source *does* have a service file,
    and it also contains some additional settings that we might want
    to pick up. Furthermore, there are additional changes in more
    recent versions of the package. As such, we want to switch to
    use the service file from the upstream source instead of a custom
    one.

    The upstream service file wants to run /usr/local/bin/containerd
    so we just make a symlink at that location pointing to the
    current binary location.

    Closes-Bug: 1884111
    Change-Id: I5ed4f46a7bcceb0d0f71abb26590160fb62c0b7b
    Signed-off-by: Chris Friesen <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.