controller restart meant sidecar charm k8s workloads restarts

Bug #2036594 reported by Tom Haddon
52
This bug affects 10 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Harry Pidcock

Bug Description

We recently had a controller restart to run a mgopurge to try and address some performance issues with the controllers (juju status taking more than 2 minutes on particular models, for instance). Here's what was done (sorry, Canonical internal only): https://pastebin.canonical.com/p/rkH6RNJXgJ/

In doing so, we saw k8s models attached to this cluster get pods rescheduled. We assume this is because pebble was having problems contacted the controller during the restarts. Here's a charm log from the time of the incident: https://pastebin.canonical.com/p/8JWNMkB8y3/

The controller and model version is juju 2.9.44.

Tags: canonical-is
Tom Haddon (mthaddon)
tags: added: canonical-is
description: updated
Revision history for this message
Tom Haddon (mthaddon) wrote :

I've been able to reproduce this locally. If I deploy juju 3.1.5 on microk8s and then deploy a sidecar charm into a model (in my case I've been testing with discourse-k8s) I'm able to go from the application working fine to the charm container being restarted by running `/opt/pebble stop jujud` in the api-server container of the controller-0 pod.

Here are the logs from the charm container before it's killed from the point I run `/opt/pebble stop jujud`:

2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 INFO juju.worker.logger logger.go:136 logger worker stopped
2023-09-19T15:25:45.468Z [container-agent] 2023-09-19 15:25:45 INFO juju.worker.uniter uniter.go:338 unit "discourse-k8s/0" shutting down: catacomb 0xc00054e000 is dying
2023-09-19T15:25:51.971Z [pebble] Check "liveness" failure 1 (threshold 3): received non-20x status code 404
2023-09-19T15:25:51.972Z [pebble] Check "readiness" failure 1 (threshold 3): received non-20x status code 404
2023-09-19T15:26:01.972Z [pebble] Check "liveness" failure 2 (threshold 3): received non-20x status code 404
2023-09-19T15:26:01.972Z [pebble] Check "readiness" failure 2 (threshold 3): received non-20x status code 404
2023-09-19T15:26:04.589Z [container-agent] 2023-09-19 15:26:04 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [b7ee1c] "unit-discourse-k8s-0" cannot open api: unable to connect to API: dial tcp 10.152.183.49:17070: connect: connection refused
2023-09-19T15:26:11.970Z [pebble] Check "readiness" failure 3 (threshold 3): received non-20x status code 404
2023-09-19T15:26:11.970Z [pebble] Check "readiness" failure threshold 3 hit, triggering action
2023-09-19T15:26:11.970Z [pebble] Check "liveness" failure 3 (threshold 3): received non-20x status code 404
2023-09-19T15:26:11.970Z [pebble] Check "liveness" failure threshold 3 hit, triggering action
2023-09-19T15:26:21.970Z [pebble] Check "readiness" failure 4 (threshold 3): received non-20x status code 404
2023-09-19T15:26:21.970Z [pebble] Check "liveness" failure 4 (threshold 3): received non-20x status code 404
2023-09-19T15:26:25.552Z [container-agent] 2023-09-19 15:26:25 ERROR juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [b7ee1c] "unit-discourse-k8s-0" cannot open api: unable to connect to API: dial tcp 10.152.183.49:17070: connect: connection refused
2023-09-19T15:26:31.970Z [pebble] Check "liveness" failure 5 (threshold 3): received non-20x status code 404
2023-09-19T15:26:31.970Z [pebble] Check "readiness" failure 5 (threshold 3): received non-20x status code 404
2023-09-19T15:26:41.970Z [pebble] Check "liveness" failure 6 (threshold 3): received non-20x status code 404
2023-09-19T15:26:41.970Z [pebble] Check "readiness" failure 6 (threshold 3): received non-20x status code 404

Haw Loeung (hloeung)
Changed in juju:
status: New → Confirmed
Revision history for this message
Harry Pidcock (hpidcock) wrote (last edit ):

I think the correct course of action here is to change the uniter's influence on the readiness/liveness to be nil and just have influence over the startup probe.

Changed in juju:
importance: Undecided → High
milestone: none → 2.9.46
status: Confirmed → Triaged
assignee: nobody → Harry Pidcock (hpidcock)
Revision history for this message
Harry Pidcock (hpidcock) wrote :

Fix for https://bugs.launchpad.net/juju/+bug/2037478 mitigates this somewhat, reducing the importance of this one.

Changed in juju:
importance: High → Medium
Revision history for this message
Haw Loeung (hloeung) wrote :

What's changed in LP:2037478? I see it's linked to https://github.com/juju/juju/pull/16325/files which doesn't have much?

Revision history for this message
Harry Pidcock (hpidcock) wrote :

LP:2037478 is dealing with specifically with if the controller addresses have changed (i.e. model migration, api addresses changing, new ha controller machines etc) or something else in agent.conf changed, that if this error (LP:2036594) is triggered, it causes a failure that requires manual intervention (i.e. delete the pods or manually update the template-agent.conf).

If we just fix LP:2037478, worst case the charm containers just bounce, pod becomes unhealthy. Still fixing this bug, it just might happen in a few weeks.

John A Meinel (jameinel)
Changed in juju:
importance: Medium → High
Ian Booth (wallyworld)
no longer affects: juju/3.2
Revision history for this message
Ian Booth (wallyworld) wrote :

The next 2.9.46 candidate release will not include a fix for this bug and we don't plan on any more 2.9 releases. As such it is being removed from its 2.9 milestone.

If the bug is still important to you, let us know and we can consider it for inclusion on a 3.x milestone.

no longer affects: juju/3.1
Changed in juju:
milestone: 2.9.46 → none
Revision history for this message
Tom Haddon (mthaddon) wrote :

We recently experienced an outage on juju 3.1.6 controllers, with messages in the controller logs like the following:

2024-02-06 10:16:07 INFO juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [e57cec] "machine-0" cannot open api: try again (try again)
2024-02-06 10:16:07 INFO juju.worker.dependency engine.go:695 "api-caller" manifold worker returned unexpected error: [869b95] "machine-0" cannot open api: try again (try again)

During this, I noticed that the `charm` container for pods connected to this controller were experiencing problems, but the workload container (`discourse`) had experienced no restarts during the period of the outage. However, from looking at the kubernetes service that was defined for this application it had removed all pods from the service definition, so no traffic was reaching the backends.

Looking at the output of `kubectl describe pod` we get this https://pastebin.canonical.com/p/RmKwCFdRxz/ (sorry, Canonical only). As you can see the `charm` container has restarted more recently than the `discourse` container, but looking at this particular section:

Conditions:
  Type Status
  Initialized True
  Ready False
  ContainersReady False
  PodScheduled True

The pod itself is marked as "Ready False", so no traffic is being sent to it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.