Subcloud K8s Upgrade | Vim strategy apply failed | Problem getting kubelet versions

Bug #2003360 reported by Boovan Rajendran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Boovan Rajendran

Bug Description

Brief Description

Subcloud k8s upgrade orchestration failed as it had a problem getting kubelet versions.

$ dcmanager strategy-step list

| subcloud229  | 1 | failed | kube applying vim kube upgrade strategy: (kube-upgrade) Vim strategy apply failed. Unexpected State: aborted. | 2022-12-08 16:40:07.163173 | 2022-12-08 17:05:26.93196
0 |

Severity

Major.

Steps to Reproduce

System Controller running with 1000 subclouds
Check there's 50ms delay between System Controller and subclouds. If not, add delay using Delayomatic.
Apply Subcloud k8s upgrade orchestration (prerequisites: Install System Controller and subclouds with 1.23 K8s then upgrade K8s on the system contoller first).
$ dcmanager kube-upgrade-strategy create --max-parallel-subclouds 250 --subcloud-apply-type parallel --to-version v1.24.4
$ dcmanager kube-upgrade-strategy apply

Expected Behavior

Subcloud K8s upgraded to 1.24.4

Actual Behavior

K8s upgrade failed

Reproducibility

6 out of 1000 subclouds.

System Configuration

Distributed Cloud (DC1000-2)

Last Pass

NA

Timestamp/Logs

// Collect all

System Controller: /folk/cgts_logs/CGTS-41773/ALL_NODES_20221208.172849.tar
Subcloud: /folk/cgts_logs/CGTS-41773/subcloud229_20221208.174415.tar
...
sysinv 2022-12-08 17:02:50.620 74053 WARNING urllib3.connectionpool [-] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f62567a3640>: Failed to establish a new connection: [Errno 111] ECONNREFUSED')': /api/v1/nodes
sysinv 2022-12-08 17:02:50.620 74053 WARNING urllib3.connectionpool [-] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f62569e7d60>: Failed to establish a new connection: [Errno 111] ECONNREFUSED')': /api/v1/nodes
sysinv 2022-12-08 17:02:50.621 74053 WARNING urllib3.connectionpool [-] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f62569e7640>: Failed to establish a new connection: [Errno 111] ECONNREFUSED')': /api/v1/nodes
sysinv 2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager [-] Problem getting kubelet versions.: urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='aefd::1', port=6443): Max retries exceeded with url: /api/v1/nodes (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f62569e76d0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'))
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager Traceback (most recent call last):
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in _new_conn
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     conn = connection.create_connection(
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 96, in create_connection
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     raise err
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 86, in create_connection
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     sock.connect(sa)
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/eventlet/greenio/base.py", line 253, in connect
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     socket_checkerr(fd)
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/eventlet/greenio/base.py", line 51, in socket_checkerr
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     raise socket.error(err, errno.errorcode[err])
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager During handling of the above exception, another exception occurred:
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager Traceback (most recent call last):
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     httplib_response = self._make_request(
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 382, in _make_request
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     self._validate_conn(conn)
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1012, in _validate_conn
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     conn.connect()
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 353, in connect
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     conn = self._new_conn()
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 181, in _new_conn
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     raise NewConnectionError(
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f62569e76d0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager During handling of the above exception, another exception occurred:
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager Traceback (most recent call last):
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/sysinv/conductor/manager.py", line 14818, in kube_upgrade_kubelet
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     kubelet_versions = kube_operator.kube_get_kubelet_versions()
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/sysinv/common/kubernetes.py", line 893, in kube_get_kubelet_versions
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     api_response = c.list_node()
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/kubernetes/client/api/core_v1_api.py", line 16414, in list_node
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     return self.list_node_with_http_info(**kwargs)  # noqa: E501
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/kubernetes/client/api/core_v1_api.py", line 16517, in list_node_with_http_info
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     return self.api_client.call_api(
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/kubernetes/client/api_client.py", line 348, in call_api
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     return self.__call_api(resource_path, method,
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/kubernetes/client/api_client.py", line 180, in __call_api
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     response_data = self.request(
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/kubernetes/client/api_client.py", line 373, in request
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     return self.rest_client.GET(url,
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/kubernetes/client/rest.py", line 239, in GET
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     return self.request("GET", url,
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/kubernetes/client/rest.py", line 212, in request
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     r = self.pool_manager.request(method, url,
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/request.py", line 74, in request
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     return self.request_encode_url(
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/request.py", line 96, in request_encode_url
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     return self.urlopen(method, url, **extra_kw)
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 375, in urlopen
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     response = conn.urlopen(method, u.request_uri, **kw)
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 783, in urlopen
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     return self.urlopen(
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 783, in urlopen
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     return self.urlopen(
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 783, in urlopen
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     return self.urlopen(
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 755, in urlopen
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     retries = retries.increment(
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager   File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 574, in increment
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager     raise MaxRetryError(_pool, url, error or ResponseError(cause))
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='aefd::1', port=6443): Max retries exceeded with url: /api/v1/nodes (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f62569e76d0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'))
2022-12-08 17:02:50.622 74053 ERROR sysinv.conductor.manager
sysinv 2022-12-08 17:03:00.634 74053 WARNING urllib3.connectionpool [-] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f62569e7af0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED')': /api/v1/nodes
sysinv 2022-12-08 17:03:00.635 74053 WARNING urllib3.connectionpool [-] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f62569e7280>: Failed to establish a new connection: [Errno 111] ECONNREFUSED')': /api/v1/nodes
sysinv 2022-12-08 17:03:00.637 74053 WARNING urllib3.connectionpool [-] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f6257bf8220>: Failed to establish a new connection: [Errno 111] ECONNREFUSED')': /api/v1/nodes
sysinv 2022-12-08 17:03:00.638 74053 ERROR sysinv.conductor.manager [-] Problem getting kubelet versions.: urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='aefd::1', port=6443): Max retries exceeded with url: /api/v1/nodes (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f6256711280>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'))
...

...
sysinv 2022-12-08 17:03:28.592 79432 ERROR wsme.api [-] Server-side error: "min() arg is an empty sequence". Detail:
Traceback (most recent call last):  File "/usr/lib/python3/dist-packages/wsmeext/pecan.py", line 84, in callfunction
    result = f(self, *args, **kwargs)  File "/usr/lib/python3/dist-packages/sysinv/api/controllers/v1/kube_host_upgrade.py", line 159, in get_all
    cp_versions = self._kube_operator.kube_get_control_plane_versions()  File "/usr/lib/python3/dist-packages/sysinv/common/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)  File "/usr/lib/python3/dist-packages/sysinv/common/retrying.py", line 206, in call
    return attempt.get(self._wrap_exception)  File "/usr/lib/python3/dist-packages/sysinv/common/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])  File "/usr/lib/python3/dist-packages/six.py", line 719, in reraise
    raise value  File "/usr/lib/python3/dist-packages/sysinv/common/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)  File "/usr/lib/python3/dist-packages/sysinv/common/kubernetes.py", line 883, in kube_get_control_plane_versions
    node_versions[node_name] = str(min(versions))ValueError: min() arg is an empty sequence

Alarms

NA

Test Activity

Scalability Testing

Workaround

Re-apply the k8s strategy.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/871114

Changed in starlingx:
status: New → In Progress
Changed in starlingx:
assignee: nobody → Boovan Rajendran (brajendr)
Ghada Khalil (gkhalil)
tags: added: stx.containers stx.update
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/871114
Committed: https://opendev.org/starlingx/config/commit/af012621885c637aea4c64aa043af8d353082d0a
Submitter: "Zuul (22348)"
Branch: master

commit af012621885c637aea4c64aa043af8d353082d0a
Author: Boovan Rajendran <email address hidden>
Date: Thu Jan 19 12:00:17 2023 -0500

    Fix getting kubelet versions during k8s api server down

    When k8s api server is down, "system kube-host-upgrade
    controller-0 kubelet" fails during k8s upgrade.

    The fix is to add retry when receives max retries
    exceeded exception, so that k8s upgrade will not fail.

    This handles an exception seen in kube_get_control_plane_versions
    when k8s server is down. We now retry the function when there are
    no versions, so that k8s upgrade will not fail.

    Test Plan:
    PASS: Manually disrupt kube-apiserver by temporarily removing and
    later adding the file: /etc/kubernetes/manifests/kube-apiserver.yaml.
    PASS: Tested by manually killing kube-apiserver process.
    PASS: Tested by deleting kube-apiserver pod.

    Closes-Bug: 2003360

    Signed-off-by: Boovan Rajendran <email address hidden>
    Change-Id: I86c231fdeec16abacf2d1667c21eea471b9ecc9e

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.9.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.