Bug #2056326 “k8s upgrade fails while upgrading control plane” : Bugs : StarlingX

Saba Touheed Mujawar (smujawar) on 2024-03-06

Changed in starlingx:
assignee:	nobody → Saba Touheed Mujawar (smujawar)

OpenStack Infra (hudson-openstack) on 2024-03-06

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-15: Fix proposed to integ (master)

#1

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/913422

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-19: Fix merged to nfv (master)

#2

Reviewed: https://review.opendev.org/c/starlingx/nfv/+/912806
Committed: https://opendev.org/starlingx/nfv/commit/471d1001e0eba0adeba8bdfed020df3f3a0b83f9
Submitter: "Zuul (22348)"
Branch: master

commit 471d1001e0eba0adeba8bdfed020df3f3a0b83f9
Author: Saba Touheed Mujawar <email address hidden>
Date: Wed Mar 13 12:54:40 2024 -0400

Set timeout for KubeHostUpgradeControlPlaneStep to 420s

    The history for KubeHostUpgradeControlPlaneStep timeout of 600s
    was to give significant headroom in doing control-plane upgrade.
    This step was known to run long, but we had limited data, so
    we set the value large. The underlying kubeadm
    UpgradeManifestTimeout was 5 minutes, so timeout larger than
    300s was ineffective.

    This updates KubeHostUpgradeControlPlaneStep timeout
    to 420s. This is intentionally engineered to be larger than
    the resultant time for sysinv code to reach completion of the
    Kubernetes Upgrade control-plane step with retries and
    accounting for failure.

    The timeout is engineered using the following equation.
    This accounts for retries, hitting kubeadm upgrade timeout
    each try, and some buffer for the sysinv report callback
    mechanism.

nfv_timeout = ImageDownloadTime + retries*
(UpgradeControlPlaneTimeout + buffer)

Following are the engineered parameters:

    ImageDownloadTime = 0s (images are pre-pull before this step)
    UpgradeManifestTimeout = 3 minutes
    buffer = 30s
    2 retries

    Result:
    Engineered puppet timeout for upgrade control-plane:
    = UpgradeControlPlaneTimeout + buffer = 3*60s + 30s = 210s

Engineered NFV timeout:
= 0s + 2(180s + 30s) = 420s

    Test Plan:
    PASS: Perform orchestrated k8s upgrade, manually STOP kubeadm process
          during k8s upgrade control-plane step. Check logs to verify
          puppet timeout and also verify sysinv attempts retry mechanism
          before nfv timeout.

Partial-Bug: 2056326

Change-Id: I73ab8ea7cd7fc3816372260983c4b54a02cdcc4c
Signed-off-by: Saba Touheed Mujawar <email address hidden>

Reviewed:  https://review.opendev.org/c/starlingx/nfv/+/912806
Committed: https://opendev.org/starlingx/nfv/commit/471d1001e0eba0adeba8bdfed020df3f3a0b83f9
Submitter: "Zuul (22348)"
Branch:    master

commit 471d1001e0eba0adeba8bdfed020df3f3a0b83f9
Author: Saba Touheed Mujawar <sabatouheed.mujawar@windriver.com>
Date:   Wed Mar 13 12:54:40 2024 -0400

Set timeout for KubeHostUpgradeControlPlaneStep to 420s
    
    The history for KubeHostUpgradeControlPlaneStep timeout of 600s
    was to give significant headroom in doing control-plane upgrade.
    This step was known to run long, but we had limited data, so
    we set the value large. The underlying kubeadm
    UpgradeManifestTimeout was 5 minutes, so timeout larger than
    300s was ineffective.
    
    This updates KubeHostUpgradeControlPlaneStep timeout
    to 420s. This is intentionally engineered to be larger than
    the resultant time for sysinv code to reach completion of the
    Kubernetes Upgrade control-plane step with retries and
    accounting for failure.
    
    The timeout is engineered using the following equation.
    This accounts for retries, hitting kubeadm upgrade timeout
    each try, and some buffer for the sysinv report callback
    mechanism.
    
    nfv_timeout = ImageDownloadTime + retries*
                            (UpgradeControlPlaneTimeout + buffer)
    
    Following are the engineered parameters:
    
    ImageDownloadTime = 0s (images are pre-pull before this step)
    UpgradeManifestTimeout = 3 minutes
    buffer = 30s
    2 retries
    
    Result:
    Engineered puppet timeout for upgrade control-plane:
    = UpgradeControlPlaneTimeout + buffer = 3*60s + 30s = 210s
    
    Engineered NFV timeout:
    = 0s + 2(180s + 30s) = 420s
    
    Test Plan:
    PASS: Perform orchestrated k8s upgrade, manually STOP kubeadm process
          during k8s upgrade control-plane step. Check logs to verify
          puppet timeout and also verify sysinv attempts retry mechanism
          before nfv timeout.
    
    Partial-Bug: 2056326
    
    Change-Id: I73ab8ea7cd7fc3816372260983c4b54a02cdcc4c
    Signed-off-by: Saba Touheed Mujawar <sabatouheed.mujawar@windriver.com>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-19: Fix merged to config (master)

#3

Reviewed: https://review.opendev.org/c/starlingx/config/+/911100
Committed: https://opendev.org/starlingx/config/commit/4c42927040f93ff68f3521b8f2408b26de8d4212
Submitter: "Zuul (22348)"
Branch: master

commit 4c42927040f93ff68f3521b8f2408b26de8d4212
Author: Saba Touheed Mujawar <email address hidden>
Date: Tue Mar 5 08:06:16 2024 -0500

Add retry robustness for Kubernetes upgrade control plane

    In the case of a rare intermittent failure behaviour during the
    upgrading control plane step where puppet hits timeout first before
    the upgrade is completed or kubeadm hits its own Upgrade Manifest
    timeout (at 5m).

    This change will retry running the process by
    reporting failure to conductor when puppet manifest apply fails.
    Since it is using RPC to send messages with options, we don't get
    the return code directly and hence, cannot use a retry decorator.
    So we use the sysinv report callback feature to handle the
    success/failure path.

    TEST PLAN:
    PASS: Perform simplex and duplex k8s upgrade successfully.
    PASS: Install iso successfully.
    PASS: Manually send STOP signal to pause the process so that
          puppet manifest timeout and check whether retry code works
          and in retry attempts the upgrade completes.
    PASS: Manually decrease the puppet timeout to very low number
          and verify that code retries 2 times and updates failure
          state
    PASS: Perform orchestrated k8s upgrade, Manually send STOP
          signal to pause the kubeadm process during step
          upgrading-first-master and perform system kube-upgrade-abort.
          Verify that upgrade-aborted successfully and also verify
          that code does not try the retry mechanism for
          k8s upgrade control-plane as it is not in desired
          KUBE_UPGRADING_FIRST_MASTER or KUBE_UPGRADING_SECOND_MASTER
          state
    PASS: Perform manual k8s upgrade, for k8s upgrade control-plane
          failure perform manual upgrade-abort successfully.
          Perform Orchestrated k8s upgrade, for k8s upgrade control-plane
          failure after retries nfv aborts automatically.

Closes-Bug: 2056326

    Depends-on: https://review.opendev.org/c/starlingx/nfv/+/912806
                https://review.opendev.org/c/starlingx/stx-puppet/+/911945
                https://review.opendev.org/c/starlingx/integ/+/913422

Change-Id: I5dc3b87530be89d623b40da650b7ff04c69f1cc5
Signed-off-by: Saba Touheed Mujawar <email address hidden>

Reviewed:  https://review.opendev.org/c/starlingx/config/+/911100
Committed: https://opendev.org/starlingx/config/commit/4c42927040f93ff68f3521b8f2408b26de8d4212
Submitter: "Zuul (22348)"
Branch:    master

commit 4c42927040f93ff68f3521b8f2408b26de8d4212
Author: Saba Touheed Mujawar <sabatouheed.mujawar@windriver.com>
Date:   Tue Mar 5 08:06:16 2024 -0500

Add retry robustness for Kubernetes upgrade control plane
    
    In the case of a rare intermittent failure behaviour during the
    upgrading control plane step where puppet hits timeout first before
    the upgrade is completed or kubeadm hits its own Upgrade Manifest
    timeout (at 5m).
    
    This change will retry running the process by
    reporting failure to conductor when puppet manifest apply fails.
    Since it is using RPC to send messages with options, we don't get
    the return code directly and hence, cannot use a retry decorator.
    So we use the sysinv report callback feature to handle the
    success/failure path.
    
    TEST PLAN:
    PASS: Perform simplex and duplex k8s upgrade successfully.
    PASS: Install iso successfully.
    PASS: Manually send STOP signal to pause the process so that
          puppet manifest timeout and check whether retry code works
          and in retry attempts the upgrade completes.
    PASS: Manually decrease the puppet timeout to very low number
          and verify that code retries 2 times and updates failure
          state
    PASS: Perform orchestrated k8s upgrade, Manually send STOP
          signal to pause the kubeadm process during step
          upgrading-first-master and perform system kube-upgrade-abort.
          Verify that upgrade-aborted successfully and also verify
          that code does not try the retry mechanism for
          k8s upgrade control-plane as it is not in desired
          KUBE_UPGRADING_FIRST_MASTER or KUBE_UPGRADING_SECOND_MASTER
          state
    PASS: Perform manual k8s upgrade, for k8s upgrade control-plane
          failure perform manual upgrade-abort successfully.
          Perform Orchestrated k8s upgrade, for k8s upgrade control-plane
          failure after retries nfv aborts automatically.
    
    Closes-Bug: 2056326
    
    Depends-on: https://review.opendev.org/c/starlingx/nfv/+/912806
                https://review.opendev.org/c/starlingx/stx-puppet/+/911945
                https://review.opendev.org/c/starlingx/integ/+/913422
    
    Change-Id: I5dc3b87530be89d623b40da650b7ff04c69f1cc5
    Signed-off-by: Saba Touheed Mujawar <sabatouheed.mujawar@windriver.com>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-19: Fix merged to integ (master)

#4

Reviewed: https://review.opendev.org/c/starlingx/integ/+/913422
Committed: https://opendev.org/starlingx/integ/commit/6633522643550037186d924de982823f017a6c15
Submitter: "Zuul (22348)"
Branch: master

commit 6633522643550037186d924de982823f017a6c15
Author: Ramesh Kumar Sivanandam <email address hidden>
Date: Fri Mar 15 12:08:59 2024 -0400

Set kubernetes kubeadm UpgradeManifestTimeout to 3 minutes

    This modifies kubeadm UpgradeManifestTimeout from 5 minutes default
    to 3 minutes to reduce the unnecessary delay in retries during
    kubeadm-upgrade-apply failures.

The typical control-plane upgrade of static pods is 75 to 85 seconds,
so 3 minutes gives adequate buffer to complete the operation.

    TEST PLAN:
    PASS: All Kubernetes packages build successfully from 1.24 to 1.28.
    PASS: Perform k8s upgrade and verify kubeadm-upgrade-apply.log
          shows the UpgradeManifestTimeout value as 3 minutes.

Partial-Bug: 2056326

Change-Id: Ief35c63dacc92af861525f03fa25ceb7b8253622
Signed-off-by: Ramesh Kumar Sivanandam <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-19: Fix merged to stx-puppet (master)

#5

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/911945
Committed: https://opendev.org/starlingx/stx-puppet/commit/6c15b7a41b950a102e96e55d16be4df8acffe06b
Submitter: "Zuul (22348)"
Branch: master

commit 6c15b7a41b950a102e96e55d16be4df8acffe06b
Author: Saba Touheed Mujawar <email address hidden>
Date: Thu Mar 7 11:39:15 2024 -0500

Set Kubernetes control-plane upgrade timeout to 210s

    In the case of a rare intermittent failure behaviour during the
    upgrading control plane step where puppet hits timeout first before
    the upgrade is completed or kubeadm hits its own Upgrade Manifest
    timeout (at 5m).

    This change sets puppet timeouts slightly larger than the
    engineered kubeadm timeout settings. Typical puppet apply times
    are less than 90 seconds, though we have seen infrequent outliers
    hit the default 5m timeout.

    We engineer the timeout for kubeadm-upgrade-apply and
    kubeadm-upgrade-node to 210 seconds, based on setting 3 minute
    kubeadm UpgrademManifestTimeout and 30 second buffer.

Note: 'kubeadm-upgrade-apply' and 'kubeadm-upgrade-node' take the
same amount of time for the control-plane upgrade.

    TEST PLAN:
    PASS: Perform k8s upgrade and verify puppet does not timeout
          before kubeadm-upgrade-apply and kubeadm-upgrade-node .

Partial-Bug: 2056326

Change-Id: Iec60476c964140f7b717c6d4dcdb266b0229b556
Signed-off-by: Saba Touheed Mujawar <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-24: Fix proposed to config (master)

#6

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/914038

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-25: Fix merged to config (master)

#7

Reviewed: https://review.opendev.org/c/starlingx/config/+/914038
Committed: https://opendev.org/starlingx/config/commit/4522150c87f635dfdacfbced00174130d39a62c5
Submitter: "Zuul (22348)"
Branch: master

commit 4522150c87f635dfdacfbced00174130d39a62c5
Author: Jim Gauld <email address hidden>
Date: Sat Mar 23 19:56:04 2024 -0400

Correct Kubernetes control-plane upgrade robustness skip_update_config

    This removes the skip_update_config parameter from the
    _config_apply_runtime_manifest() call when upgrading Kubernetes
    control-plane. This parameter was unintentially set to True,
    so this configuration step did not persist. This caused
    generation of 250.001 config-out-of-date alarms during kube
    upgrade.

The review that introduced the bug:
https://review.opendev.org/c/starlingx/config/+/911100

    TEST PLAN:
    - watch /var/log/nfv-vim.log for each orchestrated upgrade
    PASS: orchestrated k8s upgrade (no faults)
          - AIO-SX, AIO-DX, Standard

    PASS: orchestrated k8s upgrade, with fault insertion during
          control-plane upgrade first attempt
          - AIO-SX
          - AIO-DX (both controller-0, controller-1)
          - Standard (both controller-0, controller-1)

    PASS: orchestrated k8s upgrade, with fault insertion during
          control-plane upgrade first and second attempt, trigger abort
          - AIO-SX
          - AIO-DX (first controller)

Closes-Bug: 2056326

Change-Id: I629c8133312faa5c95d06960b15d3e516e48e4cb
Signed-off-by: Jim Gauld <email address hidden>

Ghada Khalil (gkhalil) on 2024-04-10

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.10.0 stx.containers

StarlingX

k8s upgrade fails while upgrading control plane

Bug Description

Other bug subscribers

Remote bug watches