Charm AWS Kubernetes Storage

charm errors during update-status hook with 502 Gateway Error

Bug #1999427 reported by Alexander Balderson on 2022-12-12

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Azure Cloud Provider	Fix Released	Medium	Mateo Florido	Azure Cloud Provider 1.26+ck1
Charm AWS Kubernetes Storage	Fix Released	Medium	Adam Dyess	Charm AWS Kubernetes Storage 1.26+ck1
Charm GCP Kubernetes Storage	Fix Released	Medium	Adam Dyess	Charm GCP Kubernetes Storage 1.26+ck1
Multus Charm	Fix Released	Medium	Mateo Florido	Multus Charm 1.26+ck1
OPA Gatekeeper Operator	Fix Released	Medium	Mateo Florido	OPA Gatekeeper Operator 1.26+ck1
SR-IOV CNI Charm	Fix Released	Medium	Adam Dyess	SR-IOV CNI Charm 1.26+ck1
vSphere Cloud Provider Charm	Fix Released	Medium	Adam Dyess	vSphere Cloud Provider Charm 1.26+ck1

Bug Description

During a test of k8s 1.26 GA, we bumped into an error where the aws-k8s-storage charm went into an error state because a request to kubeapi-loadbalancer returned a 502.

The failure happened during the `test_audit_empty_policy` testcase from the k8s validation suite.

from the aws-k8s-storage juju log:
unit-aws-k8s-storage-0: 23:20:56 DEBUG unit.aws-k8s-storage/0.juju-log HTTP Request: GET https://18.215.245.193/api/v1/namespaces/kube-system/secrets/aws-secret "HTTP/1.1 502 Bad Gateway"
unit-aws-k8s-storage-0: 23:20:56 ERROR unit.aws-k8s-storage/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/./src/charm.py", line 183, in <module>
    main(AwsK8sStorageCharm)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/main.py", line 438, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/main.py", line 150, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/framework.py", line 355, in emit
    framework._emit(event) # noqa
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/framework.py", line 856, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/framework.py", line 931, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/./src/charm.py", line 87, in _update_status
    unready = self.collector.unready
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/collector.py", line 90, in unready
    return sorted(
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/collector.py", line 93, in <genexpr>
    for obj in manifest.status()
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/manifest.py", line 192, in status
    return frozenset(_ for _ in self.installed_resources() if _.status_conditions)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/manifest.py", line 199, in installed_resources
    next_rsc = self.client.get(
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/client.py", line 108, in get
    return self._client.request("get", res=res, name=name, namespace=namespace)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 245, in request
    return self.handle_response(method, resp, br)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 196, in handle_response
    self.raise_for_status(resp)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 190, in raise_for_status
    raise transform_exception(e)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 188, in raise_for_status
    resp.raise_for_status()
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/httpx/_models.py", line 745, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Server error '502 Bad Gateway' for url 'https://18.215.245.193/api/v1/namespaces/kube-system/secrets/aws-secret'
For more information check: https://httpstatuses.com/502

looking forward at what was 18.215.245.193 (172.31.35.42), it was the kubeapi-loadbalancer, the error log at that time shows:

2022/12/09 23:20:56 [error] 42357#42357: *439 no live upstreams while connecting to upstream, client: 172.31.40.35, server: server_443, request: "PUT /apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-31-40-35.ec2.internal?timeout=10s HTTP/2.0", upstream: "https://upstream_443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-31-40-35.ec2.internal?timeout=10s", host: "172.31.35.42:443"
2022/12/09 23:20:56 [error] 42357#42357: *439 no live upstreams while connecting to upstream, client: 172.31.40.35, server: server_443, request: "PUT /apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-31-40-35.ec2.internal?timeout=10s HTTP/2.0", upstream: "https://upstream_443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-31-40-35.ec2.internal?timeout=10s", host: "172.31.35.42:443"
2022/12/09 23:20:56 [error] 42357#42357: *439 no live upstreams while connecting to upstream, client: 172.31.40.35, server: server_443, request: "PUT /apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-31-40-35.ec2.internal?timeout=10s HTTP/2.0", upstream: "https://upstream_443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-31-40-35.ec2.internal?timeout=10s", host: "172.31.35.42:443"

172.31.40.35 is one of the k8s-workers (kubernetes-worker/0) but i dont see anything weird going on in its logs.

however one of the k8s-cp units (the leader) was restarting the kube-apiserver.daemon at this exact same time, which i suspect caused the failure. from the show-status-log for k8s-cp_1:
09 Dec 2022 23:20:35Z juju-unit executing running config-changed hook
09 Dec 2022 23:20:55Z workload maintenance Restarting snap.kube-apiserver.daemon service
09 Dec 2022 23:21:47Z juju-unit idle

I'm not sure what an ideal fix here would be maybe doing an additional few tries with a back off in case services are restarting somewhere?

The testrun can be found at:
https://solutions.qa.canonical.com/v2/testruns/de06d7a8-8680-40c8-a8d9-10a2ab73e1d5/
with the crashdump at:
https://oil-jenkins.canonical.com/artifacts/de06d7a8-8680-40c8-a8d9-10a2ab73e1d5/generated/generated/kubernetes-aws/juju-crashdump-kubernetes-aws-2022-12-09-23.21.31.tar.gz

Tags:

Adam Dyess (addyess) on 2022-12-14

Changed in charm-aws-k8s-storage:
status:	New → In Progress

Revision history for this message

Adam Dyess (addyess) wrote on 2022-12-14:

Affected library:
https://github.com/canonical/ops-lib-manifest/pull/11

Revision history for this message

Adam Dyess (addyess) wrote on 2022-12-14:

Workaround is to just resolve the charm and the issue will go away in most cases.

Changed in charm-aws-k8s-storage:
milestone:	none → 1.26+ck1
Changed in charm-azure-cloud-provider:
milestone:	none → 1.26+ck1
Changed in charm-gcp-k8s-storage:
milestone:	none → 1.26+ck1
Changed in charm-multus:
milestone:	none → 1.26+ck1
Changed in charm-vsphere-cloud-provider:
milestone:	none → 1.25
Changed in charm-aws-k8s-storage:
importance:	Undecided → Medium
Changed in charm-azure-cloud-provider:
importance:	Undecided → Medium
Changed in charm-gcp-k8s-storage:
importance:	Undecided → Medium
Changed in charm-multus:
importance:	Undecided → Medium
Changed in charm-vsphere-cloud-provider:
importance:	Undecided → Medium
Changed in charm-aws-k8s-storage:
assignee:	nobody → Adam Dyess (addyess)
Changed in charm-azure-cloud-provider:
assignee:	nobody → Adam Dyess (addyess)
Changed in charm-gcp-k8s-storage:
assignee:	nobody → Adam Dyess (addyess)
Changed in charm-multus:
assignee:	nobody → Adam Dyess (addyess)
Changed in charm-vsphere-cloud-provider:
assignee:	nobody → Adam Dyess (addyess)

Adam Dyess (addyess) on 2022-12-14

summary:

- charm errors if querying kube-system/secrets/aws-secret times errors
+ charm errors during update-status hook with 502 Gateway Error

Revision history for this message

Adam Dyess (addyess) wrote on 2022-12-14 (last edit on 2022-12-14):

The charm's status may appear like this:

aws-k8s-storage/0* error idle 54.80.73.214 hook failed: "update-status"

When the affected charms are deployed on a cloud with a
`kube-api-loadbalancer`, the load-balancer can respond
to client requests with a 502 Gateway Error, among
other error statuses not produced by the API server
itself. The charm's kubernetes client library raises an
unhandled exception in this case. This results is the
charm being in an error state which is easily resolved
by running

  ```bash
  juju resolve <charm/unit>
  ```

Adam Dyess (addyess) on 2022-12-15

Changed in opa-gatekeeper-operator:
importance:	Undecided → Medium
assignee:	nobody → Adam Dyess (addyess)
milestone:	none → 1.26+ck1
Changed in charm-sriov-cni:
importance:	Undecided → Medium
assignee:	nobody → Adam Dyess (addyess)
milestone:	none → 1.26+ck1

Revision history for this message

Adam Dyess (addyess) wrote on 2022-12-16 (last edit on 2022-12-16):

https://github.com/charmed-kubernetes/aws-k8s-storage/pull/6
https://github.com/charmed-kubernetes/charm-azure-cloud-provider/pull/14
https://github.com/charmed-kubernetes/gcp-k8s-storage/pull/4
https://github.com/charmed-kubernetes/vsphere-cloud-provider/pull/20
https://github.com/charmed-kubernetes/charm-sriov-cni/pull/18

Changed in charm-azure-cloud-provider:
status:	New → In Progress
assignee:	Adam Dyess (addyess) → Mateo Florido (mateoflorido)
Changed in charm-multus:
assignee:	Adam Dyess (addyess) → Mateo Florido (mateoflorido)
status:	New → In Progress
Changed in charm-gcp-k8s-storage:
status:	New → In Progress

Adam Dyess (addyess) on 2022-12-16

Changed in charm-vsphere-cloud-provider:
status:	New → In Progress
milestone:	1.25 → 1.26+ck1

Adam Dyess (addyess) on 2022-12-16

Changed in charm-sriov-cni:
status:	New → In Progress
Changed in opa-gatekeeper-operator:
assignee:	Adam Dyess (addyess) → Mateo Florido (mateoflorido)
status:	New → In Progress

Revision history for this message

Mateo Florido (mateoflorido) wrote on 2022-12-16:

https://github.com/charmed-kubernetes/charm-multus/pull/22
https://github.com/charmed-kubernetes/opa-gatekeeper-operators/pull/12

Adam Dyess (addyess) on 2023-01-10

Changed in charm-aws-k8s-storage:
status:	In Progress → Fix Committed
Changed in charm-gcp-k8s-storage:
status:	In Progress → Fix Committed
Changed in charm-sriov-cni:
status:	In Progress → Fix Committed
Changed in charm-vsphere-cloud-provider:
status:	In Progress → Fix Committed

Mateo Florido (mateoflorido) on 2023-01-12

Changed in charm-azure-cloud-provider:
status:	In Progress → Fix Committed
Changed in charm-multus:
status:	In Progress → Fix Committed
Changed in opa-gatekeeper-operator:
status:	In Progress → Fix Committed

Adam Dyess (addyess) on 2023-01-12

tags:

added: backport-needed

Revision history for this message

Adam Dyess (addyess) wrote on 2023-01-12 (last edit on 2023-01-12):

* sriov-cni backport completed
* aws-k8s-storage completed
* azure-cloud-provider PR open
* vsphere-cloud-provider PR open
* gcp-k8s-storage PR open
* multus completed
* opa-gatekeeper-operators PR open

Adam Dyess (addyess) on 2023-01-12

tags:

removed: backport-needed

Mateo Florido (mateoflorido) on 2023-01-16

Changed in charm-aws-k8s-storage:
status:	Fix Committed → Fix Released
Changed in charm-azure-cloud-provider:
status:	Fix Committed → Fix Released
Changed in charm-gcp-k8s-storage:
status:	Fix Committed → Fix Released
Changed in charm-multus:
status:	Fix Committed → Fix Released
Changed in opa-gatekeeper-operator:
status:	Fix Committed → Fix Released
Changed in charm-sriov-cni:
status:	Fix Committed → Fix Released
Changed in charm-vsphere-cloud-provider:
status:	Fix Committed → Fix Released

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2023-04-17:

I'm seeing this again on 1.26 with 3.1.2/candidate, except that the bad gateway is for a different URL:

httpx.HTTPStatusError: Server error '502 Bad Gateway' for url 'https://54.224.75.143/apis/apiextensions.k8s.io/v1/customresourcedefinitions'

Is this the same issue, or should we open a new bug.

Logs can be found here: https://oil-jenkins.canonical.com/artifacts/1593612e-7c7c-4717-ae3d-4969fb00ff09/index.html

Revision history for this message

George Kraft (cynerva) wrote on 2023-04-17 (last edit on 2023-04-17):

Download full text (3.6 KiB)

Traceback from failure in comment #7:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/manifest.py", line 100, in client
    load_in_cluster_generic_resources(client)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/generic_resource.py", line 206, in load_in_cluster_generic_resources
    for crd in crds:
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 252, in list
    cont, chunk = self.handle_response('list', resp, br)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 196, in handle_response
    self.raise_for_status(resp)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 190, in raise_for_status
    raise transform_exception(e)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 188, in raise_for_status
    resp.raise_for_status()
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/httpx/_models.py", line 749, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Server error '502 Bad Gateway' for url 'https://54.224.75.143/apis/apiextensions.k8s.io/v1/customresourcedefinitions'
For more information check: https://httpstatuses.com/502

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./src/charm.py", line 206, in <module>
    main(AwsK8sStorageCharm)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/main.py", line 438, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/main.py", line 150, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/framework.py", line 355, in emit
    framework._emit(event) # noqa
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/framework.py", line 856, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/framework.py", line 931, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 99, in _update_status
    unready = self.collector.unready
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/collector.py", line 92, in unready
    for (name, obj), cond in self.conditions.items()
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/collector.py", line 108, in conditions
    return {
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/collector.py", line 111, in <dictcomp>
    for obj in manifest.status()
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/manifest.py", line 226, in status
    return frozenset(_ for _ in self.installed_resources() if _.status_conditions)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/manifest.py", line 233, in installed_resources
    next_rsc = self.client.get(
  File "/usr/lib/python3.8/...

Duplicates of this bug

Bug #2003799

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.