Bug #2036594 “controller restart meant sidecar charm k8s workloa...” : Bugs : Canonical Juju

Tom Haddon (mthaddon) on 2023-09-19

tags:	added: canonical-is
description:	updated

Revision history for this message

Tom Haddon (mthaddon) wrote on 2023-09-19:

#1

I've been able to reproduce this locally. If I deploy juju 3.1.5 on microk8s and then deploy a sidecar charm into a model (in my case I've been testing with discourse-k8s) I'm able to go from the application working fine to the charm container being restarted by running `/opt/pebble stop jujud` in the api-server container of the controller-0 pod.

Here are the logs from the charm container before it's killed from the point I run `/opt/pebble stop jujud`:

2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 INFO juju.worker.logger logger.go:136 logger worker stopped
2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 INFO juju.worker.uniter uniter.go:338 unit "discourse-k8s/0" shutting down: catacomb 0xc00054e000 is dying
2023-09-19T15:25:51.971Z [pebble] Check "liveness" failure 1 (threshold 3): received non-20x status code 404
2023-09-19T15:25:51.972Z [pebble] Check "readiness" failure 1 (threshold 3): received non-20x status code 404
2023-09-19T15:26:01.972Z [pebble] Check "liveness" failure 2 (threshold 3): received non-20x status code 404
2023-09-19T15:26:01.972Z [pebble] Check "readiness" failure 2 (threshold 3): received non-20x status code 404
2023-09-19T15:26:04.589Z [container-agent] 2023-09-19 15:26:04 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [b7ee1c] "unit-discourse-k8s-0" cannot open api: unable to connect to API: dial tcp 10.152.183.49:17070: connect: connection refused
2023-09-19T15:26:11.970Z [pebble] Check "readiness" failure 3 (threshold 3): received non-20x status code 404
2023-09-19T15:26:11.970Z [pebble] Check "readiness" failure threshold 3 hit, triggering action
2023-09-19T15:26:11.970Z [pebble] Check "liveness" failure 3 (threshold 3): received non-20x status code 404
2023-09-19T15:26:11.970Z [pebble] Check "liveness" failure threshold 3 hit, triggering action
2023-09-19T15:26:21.970Z [pebble] Check "readiness" failure 4 (threshold 3): received non-20x status code 404
2023-09-19T15:26:21.970Z [pebble] Check "liveness" failure 4 (threshold 3): received non-20x status code 404
2023-09-19T15:26:25.552Z [container-agent] 2023-09-19 15:26:25 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [b7ee1c] "unit-discourse-k8s-0" cannot open api: unable to connect to API: dial tcp 10.152.183.49:17070: connect: connection refused
2023-09-19T15:26:31.970Z [pebble] Check "liveness" failure 5 (threshold 3): received non-20x status code 404
2023-09-19T15:26:31.970Z [pebble] Check "readiness" failure 5 (threshold 3): received non-20x status code 404
2023-09-19T15:26:41.970Z [pebble] Check "liveness" failure 6 (threshold 3): received non-20x status code 404
2023-09-19T15:26:41.970Z [pebble] Check "readiness" failure 6 (threshold 3): received non-20x status code 404

I've been able to reproduce this locally. If I deploy juju 3.1.5 on microk8s and then deploy a sidecar charm into a model (in my case I've been testing with discourse-k8s) I'm able to go from the application working fine to the charm container being restarted by running `/opt/pebble stop jujud` in the api-server container of the controller-0 pod.

Here are the logs from the charm container before it's killed from the point I run `/opt/pebble stop jujud`:

2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 INFO juju.worker.logger logger.go:136 logger worker stopped
2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 INFO juju.worker.uniter uniter.go:338 unit "discourse-k8s/0" shutting down: catacomb 0xc00054e000 is dying
2023-09-19T15:25:51.971Z [pebble] Check "liveness" failure 1 (threshold 3): received non-20x status code 404
2023-09-19T15:25:51.972Z [pebble] Check "readiness" failure 1 (threshold 3): received non-20x status code 404
2023-09-19T15:26:01.972Z [pebble] Check "liveness" failure 2 (threshold 3): received non-20x status code 404
2023-09-19T15:26:01.972Z [pebble] Check "readiness" failure 2 (threshold 3): received non-20x status code 404
2023-09-19T15:26:04.589Z [container-agent] 2023-09-19 15:26:04 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [b7ee1c] "unit-discourse-k8s-0" cannot open api: unable to connect to API: dial tcp 10.152.183.49:17070: connect: connection refused
2023-09-19T15:26:11.970Z [pebble] Check "readiness" failure 3 (threshold 3): received non-20x status code 404
2023-09-19T15:26:11.970Z [pebble] Check "readiness" failure threshold 3 hit, triggering action
2023-09-19T15:26:11.970Z [pebble] Check "liveness" failure 3 (threshold 3): received non-20x status code 404
2023-09-19T15:26:11.970Z [pebble] Check "liveness" failure threshold 3 hit, triggering action
2023-09-19T15:26:21.970Z [pebble] Check "readiness" failure 4 (threshold 3): received non-20x status code 404
2023-09-19T15:26:21.970Z [pebble] Check "liveness" failure 4 (threshold 3): received non-20x status code 404
2023-09-19T15:26:25.552Z [container-agent] 2023-09-19 15:26:25 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [b7ee1c] "unit-discourse-k8s-0" cannot open api: unable to connect to API: dial tcp 10.152.183.49:17070: connect: connection refused
2023-09-19T15:26:31.970Z [pebble] Check "liveness" failure 5 (threshold 3): received non-20x status code 404
2023-09-19T15:26:31.970Z [pebble] Check "readiness" failure 5 (threshold 3): received non-20x status code 404
2023-09-19T15:26:41.970Z [pebble] Check "liveness" failure 6 (threshold 3): received non-20x status code 404
2023-09-19T15:26:41.970Z [pebble] Check "readiness" failure 6 (threshold 3): received non-20x status code 404

Haw Loeung (hloeung) on 2023-09-19

Changed in juju:
status:	New → Confirmed

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2023-09-19 (last edit on 2023-09-19):

#2

I think the correct course of action here is to change the uniter's influence on the readiness/liveness to be nil and just have influence over the startup probe.

Changed in juju:
importance:	Undecided → High
milestone:	none → 2.9.46
status:	Confirmed → Triaged
assignee:	nobody → Harry Pidcock (hpidcock)

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2023-09-27:

#3

Fix for https://bugs.launchpad.net/juju/+bug/2037478 mitigates this somewhat, reducing the importance of this one.

Changed in juju:
importance:	High → Medium

Revision history for this message

Haw Loeung (hloeung) wrote on 2023-09-27:

#4

What's changed in LP:2037478? I see it's linked to https://github.com/juju/juju/pull/16325/files which doesn't have much?

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2023-09-27:

#5

LP:2037478 is dealing with specifically with if the controller addresses have changed (i.e. model migration, api addresses changing, new ha controller machines etc) or something else in agent.conf changed, that if this error (LP:2036594) is triggered, it causes a failure that requires manual intervention (i.e. delete the pods or manually update the template-agent.conf).

If we just fix LP:2037478, worst case the charm containers just bounce, pod becomes unhealthy. Still fixing this bug, it just might happen in a few weeks.

John A Meinel (jameinel) on 2023-10-05

Changed in juju:
importance:	Medium → High

Ian Booth (wallyworld) on 2023-11-27

no longer affects:

juju/3.2

Revision history for this message

Ian Booth (wallyworld) wrote on 2023-11-29:

#6

The next 2.9.46 candidate release will not include a fix for this bug and we don't plan on any more 2.9 releases. As such it is being removed from its 2.9 milestone.

If the bug is still important to you, let us know and we can consider it for inclusion on a 3.x milestone.

no longer affects:	juju/3.1
Changed in juju:
milestone:	2.9.46 → none

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-02-06:

#7

We recently experienced an outage on juju 3.1.6 controllers, with messages in the controller logs like the following:

2024-02-06 10:16:07 INFO juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [e57cec] "machine-0" cannot open api: try again (try again)
2024-02-06 10:16:07 INFO juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [869b95] "machine-0" cannot open api: try again (try again)

During this, I noticed that the `charm` container for pods connected to this controller were experiencing problems, but the workload container (`discourse`) had experienced no restarts during the period of the outage. However, from looking at the kubernetes service that was defined for this application it had removed all pods from the service definition, so no traffic was reaching the backends.

Looking at the output of `kubectl describe pod` we get this https://pastebin.canonical.com/p/RmKwCFdRxz/ (sorry, Canonical only). As you can see the `charm` container has restarted more recently than the `discourse` container, but looking at this particular section:

Conditions:
  Type Status
  Initialized True
  Ready False
  ContainersReady False
  PodScheduled True

The pod itself is marked as "Ready False", so no traffic is being sent to it.

Canonical Juju

controller restart meant sidecar charm k8s workloads restarts

Bug Description

Other bug subscribers

Remote bug watches