StarlingX

Backup & Restore: Calico pods don't recover after restore on a multi-node system

Bug #1893149 reported by Ghada Khalil on 2020-08-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Cole Walker

Bug Description

Brief Description
-----------------
Restore fails on a system with multiple worker nodes. This happens under a specific scenario where the calico-kube-controllers pod is running on a node other than controller-0 when the backup is taken.

Severity
--------
Major

Steps to Reproduce
------------------
- On a multi-node system, ensure the calico-kube-controllers pod is running on a node other than controller-0
- Perform a backup
- Perform a restore

Expected Behavior
------------------
The restore should pass

Actual Behavior
----------------
The restore fails
2020-08-07 20:32:04,505 p=11111 u=sysadmin | TASK [bootstrap/bringup-essential-services : Fail if any of the Kubernetes component, Networking and
Tiller pods is not ready by this time] ***
2020-08-07 20:32:04,551 p=11111 u=sysadmin | failed: [localhost] (item={'_ansible_parsed': True, 'stderr_lines': [u'error: timed out waiting for the condition on pods/calico-kube-controllers-5cd4695574-9kg45'], u'changed': True, u'stderr': u'error: timed out waiting for the condition on pods

Reproducibility
---------------
Reproducible under the conditions explained above

System Configuration
--------------------
multi-node system with > 1 worker node

Branch/Pull Time/Commit
-----------------------
stx master as of 2020-06-28, but expected to be a day 1 issue

Last Pass
---------
Other B&R tests pass, but this particular config was not explicitly tested previously

Timestamp/Logs
--------------
**Start of first restore attempt**
2020-08-07T20:16:19.000 localhost sh: info HISTORY: PID=11108 UID=42425 ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_platform.
yml -e 'initial_backup_dir=/home/sysadmin ansible_become_pass=4MTCE-Edge! admin_password=xxxxxX backup_filename=cr3_localhost_platform_backup_2020_08_07_19_23_54.tgz'

**First failure from ansible log* (My opinion based on below failures is that the calico networking pod failing to start is the trigger)
2020-08-07 20:32:04,505 p=11111 u=sysadmin | TASK [bootstrap/bringup-essential-services : Fail if any of the Kubernetes component, Networking and
Tiller pods is not ready by this time] ***
2020-08-07 20:32:04,551 p=11111 u=sysadmin | failed: [localhost] (item=

{'_ansible_parsed': True, 'stderr_lines': [u'error: timed out waiting for t he condition on pods/calico-kube-controllers-5cd4695574-9kg45'], u'changed': True, u'stderr': u'error: timed out waiting for the condition on pods **Calico** start": "2020-08-07 20:30:27.693427", "stderr": "error: timed out waiting for the condition on po ds/calico-kube-controllers-5cd4695574-9kg45", "stderr_lines": ["error: timed out waiting for the condition on pods/calico-kube-controllers-5cd46955 74-9kg45"], "stdout": "", "stdout_lines": []}
, "msg": "Pod k8s-app=calico-kube-controllers is still not ready."}

*Tiller*
"2020-08-07 20:30:32.1
15038", "stderr": "error: timed out waiting for the condition on pods/tiller-deploy-5c8dd9fb56-2v674", "stderr_lines": ["error: timed out waiting for
or the condition on pods/tiller-deploy-5c8dd9fb56-2v674"], "stdout": "", "stdout_lines": []}, "msg": "Pod app=helm is still not ready."}

**Tiller and calico-kube-controllers pods are deleted.

**Second restore run.
2020-08-07T21:45:44.000 localhost sh: info HISTORY: PID=229428 UID=42425 ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_platform
.yml -e 'initial_backup_dir=/home/sysadmin ansible_become_pass=4MTCE-Edge! admin_password=xxxxxx backup_filename=cr3_localhost_platform_backup_2020_08_07_19_23_54.tgz'

Test Activity
-------------
Testing

Workaround
----------
Unknown

Tags:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-08-27:

stx.5.0 / medium priority - specific B&R failure

Changed in starlingx:
assignee:	nobody → Cole Walker (cwalops)
tags:	added: stx.5.0 stx.containers stx.update
Changed in starlingx:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Cole Walker (cwalops) wrote on 2020-08-28:

Issue has been reproduced, cause identified.

Scenario:

System with at least 4 nodes

All nodes are in the same "zone" (ie. no zone label (default), or all have the same zone label failure-domain.beta.kubernetes.io/zone=foo)

When backup is taken, calico-kube-controllers-xyz pod must be running on a different node than the one that the restore will be run from (ie. something other than controller-0)

Attempting to restore will result in the same failure (calico-kube-controller does not start because it is not evicted from the node it was on during backup)

Cause:

        PartialDisruption condition is set when:
        NotReady nodes >=3
        NotReady nodes in zone) / (Total nodes in zone) >= unhealthy-zone-threshold (default value = 0.55)
        So in this case, during restore: 3/4 = 0.75

PartialDisruption state causes the eviction rate to be set to 0 (pods will never be evicted)
Pod for calico-kube-controller is never evicted from the downed node and recreated, causing Ansible failure

See: https://github.com/kubernetes/kubernetes/blob/7879fc12a63337efff607952a323df90cdc7a335/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L1519

Solutions:

    A more robust solution could be to label the kubernetes master nodes as part of a different zone than the workers. This would prevent the PartialDisruption state from being reached.
        This would require further analysis and testing, unsure what other implications this change might have
        Recommend testing this solution as a next step

Issue has been reproduced, cause identified.

Scenario:

System with at least 4 nodes

All nodes are in the same "zone" (ie. no zone label (default), or all have the same zone label failure-domain.beta.kubernetes.io/zone=foo)

When backup is taken, calico-kube-controllers-xyz pod must be running on a different node than the one that the restore will be run from (ie. something other than controller-0)
        
        Attempting to restore will result in the same failure (calico-kube-controller does not start because it is not evicted from the node it was on during backup)

Cause:

The indicative log can be found in kube-controller-manager-controller-0 after an attempted restore:
        2020-08-07T20:28:57.139321428Z stderr F I0807 20:28:57.139254 1 node_lifecycle_controller.go:1249] Controller detected that zone is now in state PartialDisruption.
    
        PartialDisruption condition is set when:
        NotReady nodes >=3
        NotReady nodes in zone) / (Total nodes in zone) >= unhealthy-zone-threshold (default value = 0.55)
        So in this case, during restore: 3/4 = 0.75
    
        PartialDisruption state causes the eviction rate to be set to 0 (pods will never be evicted)
        Pod for calico-kube-controller is never evicted from the downed node and recreated, causing Ansible failure

See: https://github.com/kubernetes/kubernetes/blob/7879fc12a63337efff607952a323df90cdc7a335/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L1519

Solutions:

The bandaid solution for any given cluster is to update the unhealthy-zone-threshold value. I don't feel that this is something we want to change for all deployments, but a user can test this by adding the flag to kube-controller-manager and restarting the related container
        vim /etc/kubernetes/manifests/kube-controller-manager.yaml
        add flag --unhealthy-zone-threshold=<some value higher than 0.75>
        Restart kube-controller-manager container: crictl stop kube-controller-manager; crictl start kube-controller-manager
        After a short period of time, pods will begin to be evicted, calico-kube-controller pod will move to an active node
        NOTE: this could have unintended consequences. If there are many pods being evicted, it could result in many pods moving to the active node.
    
    A more robust solution could be to label the kubernetes master nodes as part of a different zone than the workers. This would prevent the PartialDisruption state from being reached.
        This would require further analysis and testing, unsure what other implications this change might have
        Recommend testing this solution as a next step

Ghada Khalil (gkhalil) on 2020-09-03

tags:	added: stx.networking
Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-03: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/749815

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-18: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/749815
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=27c5fa22864c452b8adfdece3f19968f01d100ed
Submitter: Zuul
Branch: master

commit 27c5fa22864c452b8adfdece3f19968f01d100ed
Author: Cole Walker <email address hidden>
Date: Thu Sep 3 14:26:16 2020 -0400

Rework restore functionality for 4 or more nodes

    This commit reworks the tasks in the bringup-essential-services role to
    handle cases where a system with 4 or more nodes would be unable to
    bring up the required services.

    Testing performed:
    - Backup and restore on AIO-DX+ (5 nodes)
    - Backup and restore on AIO-SX (Single node)
    - Fresh install (AIO-SX)
    - Bootstrap replay (AIO-SX)

    The problem is caused by the kube-controller-manager placing the cluster
    zone in the PartialDisruption state which causes pods to never be
    evicted from downed nodes. This state can be reached when a single node
    is up (ie. controller-0 during restore) and 3 or more nodes are down. In
    this state, essential service pods like calico-kube-controllers, armada,
    coredns etc, might be stuck on one of the nodes that are down and will
    not be evicted to be recreated on controller-0. This is most commonly an
    issue for pods running as a deployment.

    The fix provided here is to scale the required deployments down to 0 and
    then scale them back to their required values. This forces the pods to
    be rescheduled onto the active node and allows the services to become
    available.

    This fix also reworks how we are checking the availability of services
    running as deployments. The status of the deployment is now checked
    rather than just the presence of pods.

    Coredns was previously being checked via grepping for a running pod, but
    this could give a false-positive in the PartialDisruption state, because
    the active node during restore would see coredns as running on one of
    the downed nodes. We now check the status of the deployment.

    Other possible approaches would be to alter the unhealthy-zone-threshold
    value for kube-controller-manager or by setting controller-0 as part of
    a different k8s zone. Both of these approaches would cause every
    deployment pod on the downed nodes to be evicted to the active node, and
    this behaviour is probably unwanted for heavily utilized systems.

    Testing also revealed an issue with the task "Get wait tasks results"
    where restoring a deployment with many nodes would time out before the
    asynchronous tasks that check for k8s components completed. Reworked the
    task to wait for a longer amount of time based on the number of nodes in
    the system when performing a restore.

Closes-Bug: 1893149

Change-Id: I641125ab1c32dcc55d46009efd4654b6e388d621
Signed-off-by: Cole Walker <email address hidden>

Reviewed:  https://review.opendev.org/749815
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=27c5fa22864c452b8adfdece3f19968f01d100ed
Submitter: Zuul
Branch:    master

commit 27c5fa22864c452b8adfdece3f19968f01d100ed
Author: Cole Walker <cole.walker@windriver.com>
Date:   Thu Sep 3 14:26:16 2020 -0400

Rework restore functionality for 4 or more nodes
    
    This commit reworks the tasks in the bringup-essential-services role to
    handle cases where a system with 4 or more nodes would be unable to
    bring up the required services.
    
    Testing performed:
    - Backup and restore on AIO-DX+ (5 nodes)
    - Backup and restore on AIO-SX (Single node)
    - Fresh install (AIO-SX)
    - Bootstrap replay (AIO-SX)
    
    The problem is caused by the kube-controller-manager placing the cluster
    zone in the PartialDisruption state which causes pods to never be
    evicted from downed nodes. This state can be reached when a single node
    is up (ie. controller-0 during restore) and 3 or more nodes are down. In
    this state, essential service pods like calico-kube-controllers, armada,
    coredns etc, might be stuck on one of the nodes that are down and will
    not be evicted to be recreated on controller-0. This is most commonly an
    issue for pods running as a deployment.
    
    The fix provided here is to scale the required deployments down to 0 and
    then scale them back to their required values. This forces the pods to
    be rescheduled onto the active node and allows the services to become
    available.
    
    This fix also reworks how we are checking the availability of services
    running as deployments. The status of the deployment is now checked
    rather than just the presence of pods.
    
    Coredns was previously being checked via grepping for a running pod, but
    this could give a false-positive in the PartialDisruption state, because
    the active node during restore would see coredns as running on one of
    the downed nodes. We now check the status of the deployment.
    
    Other possible approaches would be to alter the unhealthy-zone-threshold
    value for kube-controller-manager or by setting controller-0 as part of
    a different k8s zone. Both of these approaches would cause every
    deployment pod on the downed nodes to be evicted to the active node, and
    this behaviour is probably unwanted for heavily utilized systems.
    
    Testing also revealed an issue with the task "Get wait tasks results"
    where restoring a deployment with many nodes would time out before the
    asynchronous tasks that check for k8s components completed. Reworked the
    task to wait for a longer amount of time based on the number of nodes in
    the system when performing a restore.
    
    Closes-Bug: 1893149
    
    Change-Id: I641125ab1c32dcc55d46009efd4654b6e388d621
    Signed-off-by: Cole Walker <cole.walker@windriver.com>

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.