charm errors during update-status hook with 502 Gateway Error

Bug #1999427 reported by Alexander Balderson
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Charm AWS Kubernetes Storage
Fix Released
Medium
Adam Dyess
Charm Azure Cloud Provider
Fix Released
Medium
Mateo Florido
Charm GCP Kubernetes Storage
Fix Released
Medium
Adam Dyess
Multus Charm
Fix Released
Medium
Mateo Florido
OPA Gatekeeper Operator
Fix Released
Medium
Mateo Florido
SR-IOV CNI Charm
Fix Released
Medium
Adam Dyess
vSphere Cloud Provider Charm
Fix Released
Medium
Adam Dyess

Bug Description

During a test of k8s 1.26 GA, we bumped into an error where the aws-k8s-storage charm went into an error state because a request to kubeapi-loadbalancer returned a 502.

The failure happened during the `test_audit_empty_policy` testcase from the k8s validation suite.

from the aws-k8s-storage juju log:
unit-aws-k8s-storage-0: 23:20:56 DEBUG unit.aws-k8s-storage/0.juju-log HTTP Request: GET https://18.215.245.193/api/v1/namespaces/kube-system/secrets/aws-secret "HTTP/1.1 502 Bad Gateway"
unit-aws-k8s-storage-0: 23:20:56 ERROR unit.aws-k8s-storage/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/./src/charm.py", line 183, in <module>
    main(AwsK8sStorageCharm)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/main.py", line 438, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/main.py", line 150, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/framework.py", line 355, in emit
    framework._emit(event) # noqa
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/framework.py", line 856, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/framework.py", line 931, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/./src/charm.py", line 87, in _update_status
    unready = self.collector.unready
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/collector.py", line 90, in unready
    return sorted(
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/collector.py", line 93, in <genexpr>
    for obj in manifest.status()
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/manifest.py", line 192, in status
    return frozenset(_ for _ in self.installed_resources() if _.status_conditions)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/manifest.py", line 199, in installed_resources
    next_rsc = self.client.get(
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/client.py", line 108, in get
    return self._client.request("get", res=res, name=name, namespace=namespace)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 245, in request
    return self.handle_response(method, resp, br)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 196, in handle_response
    self.raise_for_status(resp)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 190, in raise_for_status
    raise transform_exception(e)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 188, in raise_for_status
    resp.raise_for_status()
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/httpx/_models.py", line 745, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Server error '502 Bad Gateway' for url 'https://18.215.245.193/api/v1/namespaces/kube-system/secrets/aws-secret'
For more information check: https://httpstatuses.com/502

looking forward at what was 18.215.245.193 (172.31.35.42), it was the kubeapi-loadbalancer, the error log at that time shows:

2022/12/09 23:20:56 [error] 42357#42357: *439 no live upstreams while connecting to upstream, client: 172.31.40.35, server: server_443, request: "PUT /apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-31-40-35.ec2.internal?timeout=10s HTTP/2.0", upstream: "https://upstream_443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-31-40-35.ec2.internal?timeout=10s", host: "172.31.35.42:443"
2022/12/09 23:20:56 [error] 42357#42357: *439 no live upstreams while connecting to upstream, client: 172.31.40.35, server: server_443, request: "PUT /apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-31-40-35.ec2.internal?timeout=10s HTTP/2.0", upstream: "https://upstream_443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-31-40-35.ec2.internal?timeout=10s", host: "172.31.35.42:443"
2022/12/09 23:20:56 [error] 42357#42357: *439 no live upstreams while connecting to upstream, client: 172.31.40.35, server: server_443, request: "PUT /apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-31-40-35.ec2.internal?timeout=10s HTTP/2.0", upstream: "https://upstream_443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-31-40-35.ec2.internal?timeout=10s", host: "172.31.35.42:443"

172.31.40.35 is one of the k8s-workers (kubernetes-worker/0) but i dont see anything weird going on in its logs.

however one of the k8s-cp units (the leader) was restarting the kube-apiserver.daemon at this exact same time, which i suspect caused the failure. from the show-status-log for k8s-cp_1:
09 Dec 2022 23:20:35Z juju-unit executing running config-changed hook
09 Dec 2022 23:20:55Z workload maintenance Restarting snap.kube-apiserver.daemon service
09 Dec 2022 23:21:47Z juju-unit idle

I'm not sure what an ideal fix here would be maybe doing an additional few tries with a back off in case services are restarting somewhere?

The testrun can be found at:
https://solutions.qa.canonical.com/v2/testruns/de06d7a8-8680-40c8-a8d9-10a2ab73e1d5/
with the crashdump at:
https://oil-jenkins.canonical.com/artifacts/de06d7a8-8680-40c8-a8d9-10a2ab73e1d5/generated/generated/kubernetes-aws/juju-crashdump-kubernetes-aws-2022-12-09-23.21.31.tar.gz

Adam Dyess (addyess)
Changed in charm-aws-k8s-storage:
status: New → In Progress
Revision history for this message
Adam Dyess (addyess) wrote :
Revision history for this message
Adam Dyess (addyess) wrote :

Workaround is to just resolve the charm and the issue will go away in most cases.

Changed in charm-aws-k8s-storage:
milestone: none → 1.26+ck1
Changed in charm-azure-cloud-provider:
milestone: none → 1.26+ck1
Changed in charm-gcp-k8s-storage:
milestone: none → 1.26+ck1
Changed in charm-multus:
milestone: none → 1.26+ck1
Changed in charm-vsphere-cloud-provider:
milestone: none → 1.25
Changed in charm-aws-k8s-storage:
importance: Undecided → Medium
Changed in charm-azure-cloud-provider:
importance: Undecided → Medium
Changed in charm-gcp-k8s-storage:
importance: Undecided → Medium
Changed in charm-multus:
importance: Undecided → Medium
Changed in charm-vsphere-cloud-provider:
importance: Undecided → Medium
Changed in charm-aws-k8s-storage:
assignee: nobody → Adam Dyess (addyess)
Changed in charm-azure-cloud-provider:
assignee: nobody → Adam Dyess (addyess)
Changed in charm-gcp-k8s-storage:
assignee: nobody → Adam Dyess (addyess)
Changed in charm-multus:
assignee: nobody → Adam Dyess (addyess)
Changed in charm-vsphere-cloud-provider:
assignee: nobody → Adam Dyess (addyess)
Adam Dyess (addyess)
summary: - charm errors if querying kube-system/secrets/aws-secret times errors
+ charm errors during update-status hook with 502 Gateway Error
Revision history for this message
Adam Dyess (addyess) wrote (last edit ):

The charm's status may appear like this:

aws-k8s-storage/0* error idle 54.80.73.214 hook failed: "update-status"

When the affected charms are deployed on a cloud with a
`kube-api-loadbalancer`, the load-balancer can respond
to client requests with a 502 Gateway Error, among
other error statuses not produced by the API server
itself. The charm's kubernetes client library raises an
unhandled exception in this case. This results is the
charm being in an error state which is easily resolved
by running

  ```bash
  juju resolve <charm/unit>
  ```

Adam Dyess (addyess)
Changed in opa-gatekeeper-operator:
importance: Undecided → Medium
assignee: nobody → Adam Dyess (addyess)
milestone: none → 1.26+ck1
Changed in charm-sriov-cni:
importance: Undecided → Medium
assignee: nobody → Adam Dyess (addyess)
milestone: none → 1.26+ck1
Revision history for this message
Adam Dyess (addyess) wrote (last edit ):
Changed in charm-azure-cloud-provider:
status: New → In Progress
assignee: Adam Dyess (addyess) → Mateo Florido (mateoflorido)
Changed in charm-multus:
assignee: Adam Dyess (addyess) → Mateo Florido (mateoflorido)
status: New → In Progress
Changed in charm-gcp-k8s-storage:
status: New → In Progress
Adam Dyess (addyess)
Changed in charm-vsphere-cloud-provider:
status: New → In Progress
milestone: 1.25 → 1.26+ck1
Adam Dyess (addyess)
Changed in charm-sriov-cni:
status: New → In Progress
Changed in opa-gatekeeper-operator:
assignee: Adam Dyess (addyess) → Mateo Florido (mateoflorido)
status: New → In Progress
Revision history for this message
Mateo Florido (mateoflorido) wrote :
Adam Dyess (addyess)
Changed in charm-aws-k8s-storage:
status: In Progress → Fix Committed
Changed in charm-gcp-k8s-storage:
status: In Progress → Fix Committed
Changed in charm-sriov-cni:
status: In Progress → Fix Committed
Changed in charm-vsphere-cloud-provider:
status: In Progress → Fix Committed
Changed in charm-azure-cloud-provider:
status: In Progress → Fix Committed
Changed in charm-multus:
status: In Progress → Fix Committed
Changed in opa-gatekeeper-operator:
status: In Progress → Fix Committed
Adam Dyess (addyess)
tags: added: backport-needed
Revision history for this message
Adam Dyess (addyess) wrote (last edit ):

* sriov-cni backport completed
* aws-k8s-storage completed
* azure-cloud-provider PR open
* vsphere-cloud-provider PR open
* gcp-k8s-storage PR open
* multus completed
* opa-gatekeeper-operators PR open

Adam Dyess (addyess)
tags: removed: backport-needed
Changed in charm-aws-k8s-storage:
status: Fix Committed → Fix Released
Changed in charm-azure-cloud-provider:
status: Fix Committed → Fix Released
Changed in charm-gcp-k8s-storage:
status: Fix Committed → Fix Released
Changed in charm-multus:
status: Fix Committed → Fix Released
Changed in opa-gatekeeper-operator:
status: Fix Committed → Fix Released
Changed in charm-sriov-cni:
status: Fix Committed → Fix Released
Changed in charm-vsphere-cloud-provider:
status: Fix Committed → Fix Released
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

I'm seeing this again on 1.26 with 3.1.2/candidate, except that the bad gateway is for a different URL:

httpx.HTTPStatusError: Server error '502 Bad Gateway' for url 'https://54.224.75.143/apis/apiextensions.k8s.io/v1/customresourcedefinitions'

Is this the same issue, or should we open a new bug.

Logs can be found here: https://oil-jenkins.canonical.com/artifacts/1593612e-7c7c-4717-ae3d-4969fb00ff09/index.html

Revision history for this message
George Kraft (cynerva) wrote (last edit ):
Download full text (3.6 KiB)

Traceback from failure in comment #7:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/manifest.py", line 100, in client
    load_in_cluster_generic_resources(client)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/generic_resource.py", line 206, in load_in_cluster_generic_resources
    for crd in crds:
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 252, in list
    cont, chunk = self.handle_response('list', resp, br)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 196, in handle_response
    self.raise_for_status(resp)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 190, in raise_for_status
    raise transform_exception(e)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/lightkube/core/generic_client.py", line 188, in raise_for_status
    resp.raise_for_status()
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/httpx/_models.py", line 749, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Server error '502 Bad Gateway' for url 'https://54.224.75.143/apis/apiextensions.k8s.io/v1/customresourcedefinitions'
For more information check: https://httpstatuses.com/502

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./src/charm.py", line 206, in <module>
    main(AwsK8sStorageCharm)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/main.py", line 438, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/main.py", line 150, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/framework.py", line 355, in emit
    framework._emit(event) # noqa
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/framework.py", line 856, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/framework.py", line 931, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 99, in _update_status
    unready = self.collector.unready
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/collector.py", line 92, in unready
    for (name, obj), cond in self.conditions.items()
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/collector.py", line 108, in conditions
    return {
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/collector.py", line 111, in <dictcomp>
    for obj in manifest.status()
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/manifest.py", line 226, in status
    return frozenset(_ for _ in self.installed_resources() if _.status_conditions)
  File "/var/lib/juju/agents/unit-aws-k8s-storage-0/charm/venv/ops/manifests/manifest.py", line 233, in installed_resources
    next_rsc = self.client.get(
  File "/usr/lib/python3.8/...

Read more...

Revision history for this message
George Kraft (cynerva) wrote :

It definitely looks related, but the traceback is in a different code path. I think there is a fix on the way already via https://github.com/canonical/ops-lib-manifest/pull/21 and https://github.com/charmed-kubernetes/aws-k8s-storage/pull/8.

I would recommend opening a new issue so we can track the new fix for the new code path separately.

Revision history for this message
George Kraft (cynerva) wrote :

Scratch what I said in comment #9. There's still an uncaught ManifestClientError via self.collector.unready and I don't think the open PRs will fix it. I'm also seeing now that the tracebacks share a common path through self.collector.unready.

This might be the same issue. I'll get Adam to take a look.

Revision history for this message
Adam Dyess (addyess) wrote :

I agree, it's likely the same root cause just manifesting itself in a different place -- this time in the `install_resources`

https://github.com/canonical/ops-lib-manifest/pull/23 to address and will require a force rebuild of the above set of charms.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.