Standalone SX K8s upgrade from 1.21 to 1.24 failed at upgrade-aborting-failed

Bug #2042353 reported by Boovan Rajendran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Boovan Rajendran

Bug Description

Brief Description

Standalone SX upgrade k8s from 1.21 to 1.24 after platform upgrade failed at upgrade-aborting-failed

There are 2 problems here:

1) 1.22.5 control-plane upgrade failed

2) upgrade-abort-failed

Severity:

Critical

Steps to Reproduce:

Multi-version Kubernetes Version Upgrade Cloud Orchestration Strategy Procedure (Simplex)

sw-manager kube-upgrade-strategy create --to-version v1.24.4
sw-manager kube-upgrade-strategy apply

Expected Behavior:

K8s upgrade must be successful

Actual Behavior:

[sysadmin@controller-0 ~(keystone_admin)]$ sw-manager kube-upgrade-strategy show
Strategy Kubernetes Upgrade Strategy:
  strategy-uuid: 496f68ef-66dd-41df-b1a8-394f22477ac0
  controller-apply-type: serial
  storage-apply-type: serial
  worker-apply-type: serial
  default-instance-action: stop-start
  alarm-restrictions: strict
  current-phase: abort
  current-phase-completion: 100%
  state: abort-failed
  apply-result: timed-out
  apply-reason:
  abort-result: failed
  abort-reason: Unexpected state: upgrade-aborting

[sysadmin@controller-0 scratch(keystone_admin)]$ system kube-upgrade-show
+--------------+--------------------------------------+
| Property | Value |
+--------------+--------------------------------------+
| uuid | ee87012f-f2d2-418c-8156-2b399aece993 |
| from_version | v1.21.8 |
| to_version | v1.24.4 |
| state | upgrade-aborting-failed |
| created_at | 2023-10-22T03:31:54.187459+00:00 |
| updated_at | 2023-10-22T13:00:42.916532+00:00 |
+--------------+--------------------------------------+

Reproducibility:

100%

System Configuration:

AIO-SX

Last Pass

NA

Timestamp/Logs

sysinv 2023-10-22 09:31:24.175 16390 ERROR sysinv.puppet.common [-] Failed to execute runtime manifest for host abcd:204::3: subprocess.CalledProcessError: Command '['/usr/local/bin/puppet-manifest-apply.sh', '/opt/platform/puppet/22.12/hieradata', 'abcd:204::3', 'controller', 'runtime', '/tmp/tmpxk901ovy.yaml']' returned non-zero exit status 1.
2023-10-22 09:31:24.175 16390 ERROR sysinv.puppet.common Traceback (most recent call last):
2023-10-22 09:31:24.175 16390 ERROR sysinv.puppet.common File "/usr/lib/python3/dist-packages/sysinv/puppet/common.py", line 91, in puppet_apply_manifest
2023-10-22 09:31:24.175 16390 ERROR sysinv.puppet.common subprocess.check_call(cmd, stdout=fnull, stderr=fnull) # pylint: disable=not-callable
2023-10-22 09:31:24.175 16390 ERROR sysinv.puppet.common File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
2023-10-22 09:31:24.175 16390 ERROR sysinv.puppet.common raise CalledProcessError(retcode, cmd)
2023-10-22 09:31:24.175 16390 ERROR sysinv.puppet.common subprocess.CalledProcessError: Command '['/usr/local/bin/puppet-manifest-apply.sh', '/opt/platform/puppet/22.12/hieradata', 'abcd:204::3', 'controller', 'runtime', '/tmp/tmpxk901ovy.yaml']' returned non-zero exit status 1.

sysinv 2023-10-22 09:31:24.192 16390 INFO sysinv.agent.manager [-] Manifests application failed. Reporting failure to conductor. Details: {'personalities': ['controller'], 'classes': ['platform::kubernetes::upgrade_abort'], 'report_status': 'upgrade_abort', 'force': False, 'config_type': 'config_apply_runtime_manifest', 'created_at': '2023-10-22T03:49:37.552467', 'host_uuids': ['be9b4716-05f2-432d-9de5-334f7c3d2b49'], 'host_uuid': 'be9b4716-05f2-432d-9de5-334f7c3d2b49'}.
sysinv 2023-10-22 09:31:24.193 16390 ERROR sysinv.openstack.common.rpc.common [-] Returning exception Failed to execute runtime manifest for host abcd:204::3 to caller: sysinv.common.exception.SysinvException: Failed to execute runtime manifest for host abcd:204::3
sysinv 2023-10-22 09:31:24.193 16390 ERROR sysinv.openstack.common.rpc.common [-] ['Traceback (most recent call last):\n', ' File "/usr/lib/python3/dist-packages/sysinv/puppet/common.py", line 91, in puppet_apply_manifest\n subprocess.check_call(cmd, stdout=fnull, stderr=fnull) # pylint: disable=not-callable\n', ' File "/usr/lib/python3.9/subprocess.py", line 373, in check_call\n raise CalledProcessError(retcode, cmd)\n', "subprocess.CalledProcessError: Command '['/usr/local/bin/puppet-manifest-apply.sh', '/opt/platform/puppet/22.12/hieradata', 'abcd:204::3', 'controller', 'runtime', '/tmp/tmpxk901ovy.yaml']' returned non-zero exit status 1.\n", '\nDuring handling of the above exception, another exception occurred:\n\n', 'Traceback (most recent call last):\n', ' File "/usr/lib/python3/dist-packages/sysinv/agent/manager.py", line 1926, in config_apply_runtime_manifest\n self._apply_runtime_manifest(config_dict)\n', ' File "/usr/lib/python3/dist-packages/sysinv/agent/manager.py", line 1994, in _apply_runtime_manifest\n puppet.puppet_apply_manifest(self._mgmt_ip,\n', ' File "/usr/lib/python3/dist-packages/sysinv/puppet/common.py", line 96, in puppet_apply_manifest\n raise exception.SysinvException(_(msg))\n', 'sysinv.common.exception.SysinvException: Failed to execute runtime manifest for host abcd:204::3\n']: sysinv.common.exception.SysinvException: Failed to execute runtime manifest for host abcd:204::3
sysinv 2023-10-22 09:31:24.200 20887 WARNING sysinv.conductor.manager [-] k8s upgrade abort failed 3 times, giving up

Alarms:

NA

Test Activity:

Feature Testing

Workaround:

NA

Changed in starlingx:
assignee: nobody → Boovan Rajendran (brajendr)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/899743

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/899743
Committed: https://opendev.org/starlingx/stx-puppet/commit/7837dd0206084abbf26bfaa074684899959e6a2f
Submitter: "Zuul (22348)"
Branch: master

commit 7837dd0206084abbf26bfaa074684899959e6a2f
Author: Boovan Rajendran <email address hidden>
Date: Tue Oct 31 14:19:59 2023 -0400

    Fix for cordon command failure

    During k8s upgrade abort process cordon is failing for a pod which
    is not evicting due to podDisruptionBudget setting, causing retries
    to occur for every 5s until puppet default 300s timeout expired.
    This change will allow the cordon to succeed even if some pods
    cannot evict.

    --timeout=60s is added in the command to limit the cordon operation
    before puppet timeout.

    Test Plan:
    Pass: Perform k8s upgrade abort and make cordon fail and verify
    k8s upgrade aborted successfully.

    Closes-Bug: 2042353

    Change-Id: I1af8e7f0d06a2bfda28b5f319366b0a2e36b4cdf
    Signed-off-by: Boovan Rajendran <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/900272

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/900272
Committed: https://opendev.org/starlingx/config/commit/bbab84b094bbd1a40ab38ccf3c69b97ae2a9fabf
Submitter: "Zuul (22348)"
Branch: master

commit bbab84b094bbd1a40ab38ccf3c69b97ae2a9fabf
Author: Boovan Rajendran <email address hidden>
Date: Tue Nov 7 03:55:54 2023 -0500

    Fix for cordon command failure during k8s upgrade

    During k8s upgrade process cordon is failing for a pod which
    is not evicting due to podDisruptionBudget setting.
    This change will allow the cordon to succeed even if some
    pods cannot evict.

    --timeout=60s is added in the command to limit the cordon operation.

    Test Plan:
    Pass: Test by running 'system kube-host-cordon controller-0' on
    AIO-SX and make cordon operation fail by setting pod disruption budget
    for a pod and verify cordon operation completed successfully.
    Pass: Perform k8s upgrade using orchestration method on AIO-SX and make
    cordon operation fail by setting pod disruption budget for
    a pod and verify k8s upgraded successfully.

    Closes-Bug: 2042353

    Change-Id: I314ae3f678aca90888e6e7ff39a61c5d08ed674f
    Signed-off-by: Boovan Rajendran <email address hidden>

Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/900576

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/900577

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/900576
Committed: https://opendev.org/starlingx/config/commit/25a42e06dfdd49ae1876841b05ed281f3097b6b7
Submitter: "Zuul (22348)"
Branch: master

commit 25a42e06dfdd49ae1876841b05ed281f3097b6b7
Author: Boovan Rajendran <email address hidden>
Date: Fri Nov 10 03:11:45 2023 -0500

    Increase timeout for cordon operation

    Based on lab testing we need to increase the overall timeout for the
    host cordon operation to give pods more time to shut down cleanly.
    In lab testing 150 seconds was sufficient, but it's possible that
    some conditions exist which might still cause us to hit the timeout.
    If this happens, we still want to treat the cordon operation as a
    success.

    Test Plan:
    Pass: Test by running 'system kube-host-cordon controller-0' on
    AIO-SX and verify cordon operation completed successfully.
    Pass: Perform k8s upgrade using orchestration method on AIO-SX
    and verify k8s upgraded successfully.

    Closes-Bug: 2042353

    Change-Id: I47fad26c98297227f6352c2df666c384be42d252
    Signed-off-by: Boovan Rajendran <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/900577
Committed: https://opendev.org/starlingx/stx-puppet/commit/f4e384c03bb3ff8282afc9dbef70fa27eefbf2f3
Submitter: "Zuul (22348)"
Branch: master

commit f4e384c03bb3ff8282afc9dbef70fa27eefbf2f3
Author: Boovan Rajendran <email address hidden>
Date: Fri Nov 10 03:40:42 2023 -0500

    Increase timeout for cordon operation during k8s upgrade abort

    This change is to increase the timeout of cordon command
    from 60 sec to 150 sec during k8s upgrade abort operation.

    Based on lab testing, we need to increase the overall timeout for the
    host cordon operation (during the K8s upgrade abort operation) to
    give pods more time to shut down cleanly. In lab testing 150 seconds
    was sufficient.

    Test Plan:
    Pass: Perform k8s upgrade abort and verify k8s upgrade
    aborted successfully.

    Closes-Bug: 2042353

    Change-Id: Iaf39fb80a64c90af615f7f6dad0dc4efb2066faa
    Signed-off-by: Boovan Rajendran <email address hidden>

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.containers stx.update
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.