Controller-0 Not Ready after force rebooting active controller (Controller-1)

Bug #1887438 reported by Andrew Vaillancourt
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Chris Friesen

Bug Description

Brief Description
-----------------
After force rebooting controller-1, controller-0 did not reach 'Ready' status.

Severity
--------
Major

Steps to Reproduce
------------------
Force reboot active controller

Expected Behavior
------------------
Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running.

Actual Behavior
----------------

After force rebooting controller-1, controller-0 did not reach 'Ready' status.

controller-0:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready <none> 8h v1.18.1 compute-1 Ready <none> 8h v1.18.1 compute-2 Ready <none> 8h v1.18.1 controller-0 NotReady master 9h v1.18.1 controller-1 Ready master 8h v1.18.1

Following pods never reached healthy status:
cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m
cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m
cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m
kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m
kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m
kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m

Reproducibility
---------------
Reproduced on same lab with 2 diff builds.

System Configuration
--------------------
Standard System
2 Controllers and 3 Computes
LAB: WCP_71_75

Branch/Pull Time/Commit
-----------------------
first failure

BUILD_ID="2020-07-13_00-00-00"
BUILD_DATE="2020-07-13 00:05:40 -0400"

second failure
BUILD_ID="2020-07-14_00-00-00"
BUILD_DATE="2020-07-14 00:05:40 -0400"

Last Pass
---------
Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build.

Timestamp/Logs
--------------
Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438

Test Activity
-------------
System Test Automation Development

Workaround
----------
Possible workaround:

From https://github.com/kubernetes/kubernetes/issues/93268:

"... after all nodes were running again [...] restarting kubelet on the "NotReady" node was enough to make it go "Ready" again."

summary: Controller-0 Not Ready after force rebooting active controller
- (Controller-1))
+ (Controller-1)
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Can you provide the output of kubectl get nodes?

tags: added: stx.containers
Changed in starlingx:
status: New → Incomplete
description: updated
Revision history for this message
Chris Friesen (cbf123) wrote :

I've reproduced the same behaviour (node stuck in "NotReady" status after forced reboot of active controller) in another lab.

I tracked it down to a bug in upstream Kubernetes, which in turn is being hit by a bug in upstream go.

I opened a Kubernetes issue to track the inclusion of the go fix into Kubernetes. The issue is at https://github.com/kubernetes/kubernetes/issues/93268

The bug report for go is at https://github.com/golang/go/issues/40213 and there are two patches currently in review to fix it.

We're still sorting out how we plan on dealing with this issue.

Changed in starlingx:
status: Incomplete → Confirmed
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 - plan is to pick a new version of K8s in the next release to get the required fixes.

Changed in starlingx:
importance: Undecided → Low
tags: added: stx.5.0
Changed in starlingx:
importance: Low → Medium
assignee: nobody → Frank Miller (sensfan22)
description: updated
Frank Miller (sensfan22)
Changed in starlingx:
assignee: Frank Miller (sensfan22) → Chris Friesen (cbf123)
Revision history for this message
Frank Miller (sensfan22) wrote :

Moving tag to stx.6.0 as decision was made to not upversion kubernetes in stx.5.0

tags: added: stx.6.0
removed: stx.5.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to compile (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/compile/+/793743

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/integ/+/793754

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to compile (f/centos8)
Download full text (4.1 KiB)

Reviewed: https://review.opendev.org/c/starlingx/compile/+/793743
Committed: https://opendev.org/starlingx/compile/commit/30556b9bd01d82af5f6b67ee80afeba5521c8354
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 5f0f710f66852568b3a27bffafaedf0247227984
Author: Chris Friesen <email address hidden>
Date: Wed Jul 15 19:46:08 2020 -0400

    fix net/http caching of broken persistent connections

    The net/http transport code is currently broken, it keeps broken
    persistent connections in the cache if a write error happens during
    h2 handshake.

    This is documented in the upstream bug at:
    https://github.com/golang/go/issues/40213

    The problem occurs because in the "go" compiler the http2 code is
    imported into http as a bundle, with an additional "http2" prefix
    applied. This messes up the erringRoundTripper handling because
    the name doesn't match.

    The solution is to have the "go" compiler look for an interface
    instead, so we add a new dummy function that doesn't actually do
    anything and then the "go" compiler can check whether the specified
    RoundTripper implements the dummy function.

    This is slightly different from the proposed upstream fixes for the
    above upstream bug, it more closely follows how the equivalent
    problem was solved by IsHTTP2NoCachedConnError().

    Change-Id: Ia6e91acb15ff4fe996c8ea9b8a1032cede6c4aab
    Partial-Bug: 1887438
    Signed-off-by: Chris Friesen <email address hidden>

commit 49e4df5e538b239d9267baa28b100fa0edfbec69
Author: Zhixiong Chi <email address hidden>
Date: Fri Mar 5 04:05:50 2021 -0500

    bash: enable to log the shell command

    After merging the upversion commit
     https://review.opendev.org/c/starlingx/compile/+/771784,
    the new version add a condition check "syslog_history" variable to
    enable/disable the syslog of bash command.
    If the syslog_history shopt variable is unset as default, the shell
    commands won't be logged.

    Now we always enable it, since the commands run by every user in a
    login shell need to be logged to /var/log/bash.log. This is very
    important as an aid in troubleshooting and debugging issues.

    Closes-Bug: #1917864

    Change-Id: I4aa2f49a0ea4c54a0e836b8ccb33bcc173653252
    Signed-off-by: Zhixiong Chi <email address hidden>

commit 95c560dffeeeeab6a05766f327a05c06b9b3d65d
Author: Li Zhou <email address hidden>
Date: Wed Jan 27 00:50:01 2021 -0500

    python: fix CVE-2019-9636 CVE-2019-10160 CVE-2019-9948 CVE-2019-16056 in srpm build

    Upgrade python to python-2.7.5-89 for fixing above CVEs.

    This commit need work together with the commit
    <python: fix CVE-2019-9636 CVE-2019-10160 CVE-2019-9948 CVE-2019-16056
    in rpm list> for repository starlingx/tools.

    Depends-On: https://review.opendev.org/c/starlingx/tools/+/772627

    Story: 2008532
    Task: 41665
    Signed-off-by: Li Zhou <email address hidden>
    Change-Id: Iead83a4f8e617bed8182020d21d582273ae1e67e

commit 9af8123c7a2b8277408b47efc9128b9dfdcf5763
Author: Zhixiong Chi <email address hidden>
Date: Thu Jan 21 05:...

Read more...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (37.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/793754
Committed: https://opendev.org/starlingx/integ/commit/a13966754d4e19423874ca31bf1533f057380c52
Submitter: "Zuul (22348)"
Branch: f/centos8

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300

    Fix resize of filesystems in puppet logical_volume

    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.

    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146

    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.

    Tested the following scenarios:

    B&R on SX with default sizes of filesystems and cgts-vg.

    B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
    cgts-vg with additional physical volumes:

    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    B&R on DX system with backup of size 70G and cgts-vg
    with additional physical volumes:

    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300

    Execute once the ceph services script on AIO

    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh

    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.

    We fix this by exiting the ceph script if it is the one from
    /etc/services.d/worker on AIO systems.

    Closes-Bug: 1928934
    Change-Id: I3e4dc313cc3764f870b8f6c640a60338...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Chris Friesen, has this been addressed by the k8s upversion in stx.6.0? StoryBoard: https://storyboard.openstack.org/#!/story/2008972
If so, please add a note and mark as Fix Released.

Revision history for this message
Chris Friesen (cbf123) wrote :

Yes, I think this should be resolved in STX 6.0.

Changed in starlingx:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.