Bug #2056193 “action stuck in pending” : Bugs : Canonical Juju

Tom Haddon (mthaddon) on 2024-03-05

tags:

added: canonical-is

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-07:

#1

we've now done a charm upgrade to fix the error in django-app-pebble-ready so the charm in question is now not displaying any errors in juju debug-log, but rerunning the action and it's still stuck in pending.

Revision history for this message

Joseph Phillips (manadart) wrote on 2024-03-07:

#2

Not quite clear on the "fixed and now does not show errors part".

Does every run create a new pending action, or is it just that one action?

It is a charm-encoded action correct? Not doing some out-of-band trick with exec?

Changed in juju:
status:	New → Incomplete

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-07:

#3

I mean that the django-app-pebble-ready error included when the bug was originally filed has been fixed but we're still seeing problems with any new action we run being stuck. These are charm defined actions.

Tom Haddon (mthaddon) on 2024-03-07

Changed in juju:
status:	Incomplete → New

Revision history for this message

John A Meinel (jameinel) wrote on 2024-03-07:

#4

I did mention trying out:
```
juju exec --execution-group=abcfd -- command
```

Which lets you use a different named execution group. But I don't think we expose --execution-group on the CLI.
According to Ian as a charmer you can put the execution-group into the action definition (as part of actions.yaml). But that wouldn't be a workaround here.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-08:

#5

A further update here, since I just learned of the `juju operations` command:

```
$ juju operations
ID Status Started Finished Task IDs Summary
7 pending 8 list-domains run on unit-httprequest-lego-provider-0
$ juju show-operation 7
summary: list-domains run on unit-httprequest-lego-provider-0
status: pending
action:
  name: list-domains
  parameters:
    username: prod-is-cos-ps6
timing:
  enqueued: 2024-03-07 08:47:26 +0000 UTC
tasks:
  "8":
    host: httprequest-lego-provider/0
    status: pending
    timing:
      enqueued: 2024-03-07 08:47:26 +0000 UTC
$ juju show-task 8
Task 8 pending

$ date
Fri Mar 8 12:31:38 UTC 2024
```

Revision history for this message

Ian Booth (wallyworld) wrote on 2024-03-12:

#6

We'll need to see some additional info to help diagnose the issue:
- the content of the unitstates collection relevant to the affected unit
- the content of the action and actionnotifications collections relevant to the pending task
- the content of the operations collection relevant to the pending operation
- logging for the affected unit agent with "juju.worker.uniter=DEBUG"

There's been a couple of fixes landed to better handle orphaned actions, ie those actions which the unit might be executing but the action record no longer exists in state. Juju 3.1.6 should have the earlier fixes. There's a recent one

https://github.com/juju/juju/pull/16922

But that handled the case where an action had already been started, not pending. It might be that the fix is also relevant to this case. But we'd need to first see via the info above what actions are queued to run for the unit and whether the actions are stored in the db.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-12:

#7

Thanks for the update. Can you provide a bit more information on how retrieve the following:

- the content of the unitstates collection relevant to the affected unit
- the content of the action and actionnotifications collections relevant to the pending task
- the content of the operations collection relevant to the pending operation

Revision history for this message

Joseph Phillips (manadart) wrote on 2024-03-12:

#8

If you connect to Mongo...

unitstates - ID has the model UUID and unit ID, for example:

db.unitstates.find({"_id" : "bdd12a44-2fdb-4596-87e5-a2fe851f36de:u#ubuntu/0#charm"}).pretty()

actionnotifications is similar, but the suffix is _a_<task ID>:

db.actionnotifications.find({"_id" : "bdd12a44-2fdb-4596-87e5-a2fe851f36de:mongodb/0_a_4"}).pretty()

operations is just with the operation ID:

db.operations.find({"_id" : "bdd12a44-2fdb-4596-87e5-a2fe851f36de:3"}).pretty()

Task and operation ID are returned when you run an action:

Running operation 7 with 1 task
- task 8 on unit-mongodb-0

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-12:

#9

Thanks. I'm not sure if this includes sensitive information so I've used a canonical pastebin in case it does: https://pastebin.canonical.com/p/wY5435q25F/

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-12:

#10

I've also run `juju model-config logging-config='<root>=INFO;juju.worker.uniter=DEBUG'` on the model in question, but all I see (for now) in the logs is:

```
unit-httprequest-lego-provider-0: 16:12:14 DEBUG juju.worker.uniter.remotestate got update status interval change for httprequest-lego-provider/0: ok=true
unit-nginx-ingress-integrator-0: 16:12:14 DEBUG juju.worker.uniter.remotestate got update status interval change for nginx-ingress-integrator/0: ok=true
```

Juju operations still reports the task as pending:

```
$ juju operations
ID Status Started Finished Task IDs Summary
7 pending 8 list-domains run on unit-httprequest-lego-provider-0
```

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-12:

#11

Download full text (3.3 KiB)

Ok, a bit more output now:

```
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.remotestate update status timer triggered for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter no operations in progress; waiting for changes
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "postgresql/0" already joined relation 2
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "postgresql/1" already joined relation 2
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "postgresql/2" already joined relation 2
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "nginx-ingress-integrator/0" already joined relation 3
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "prometheus-scrape/0" already joined relation 4
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "grafana/0" already joined relation 5
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "loki/0" already joined relation 6
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.operation running operation run update-status hook for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.operation acquiring machine lock for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.operation preparing operation "run update-status hook" for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.operation executing operation "run update-status hook" for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.runner starting jujuc server {unix @/var/lib/juju/agents/unit-httprequest-lego-provider-0/agent.socket <nil>}
unit-httprequest-lego-provider-0: 16:17:52 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.operation committing operation "run update-status hook" for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.operation lock released for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter no operations in progress; waiting for changes
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "grafana/0" already joined relation 5
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "loki/0" already joined relation 6
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "postgresql/0" already joined relation 2
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "postgresql/1" already joined relation 2
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "postgresql/2" already joined relation 2
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "nginx-ingress-integrator/0" already joi...

Are you 100% sure comment #16 captures ALL of the logs after an agent restart?

eg I deployed postgresql and stopped the unit agent. I then ran an action which will stay pending since the agent isn't running to process it. I then restarted the agent and you can see there should be an initial message about starting the workers and then an initial message from all the watchers. Note the actions watcher which reports on the pending action. That's what we're looking for here. The entry in the actionsnotification collection should trigger an event from the actions watcher on agent startup, and we need to see what happens next.

We could set "juju.worker.uniter=TRACE" as well just to get everything.

unit-postgresql-0: 11:08:07 INFO juju Starting unit workers for "postgresql/0"
unit-postgresql-0: 11:08:07 INFO juju.agent.setup setting logging config to "juju.worker.uniter=DEBUG"
unit-postgresql-0: 11:08:07 WARNING juju.agent.setup developer feature flags enabled: "actions-v2", "developer-mode", "image-metadata", "secrets", "secrets-stores"
unit-postgresql-0: 11:08:07 DEBUG juju.worker.uniter.relation unit "postgresql/0" (leader=true) entered scope for relation "postgresql:database-peers"
unit-postgresql-0: 11:08:07 DEBUG juju.worker.uniter.relation unit "postgresql/0" (leader=true) entered scope for relation "postgresql:restart"
unit-postgresql-0: 11:08:07 DEBUG juju.worker.uniter.relation unit "postgresql/0" (leader=true) entered scope for relation "postgresql:upgrade"
unit-postgresql-0: 11:08:07 DEBUG juju.worker.uniter starting local juju-exec listener on {unix /var/lib/juju/agents/unit-postgresql-0/run.socket <nil>}
unit-postgresql-0: 11:08:07 INFO juju.worker.uniter unit "postgresql/0" started
unit-postgresql-0: 11:08:07 DEBUG juju.worker.uniter juju-exec listener running
unit-postgresql-0: 11:08:07 INFO juju.worker.uniter hooks are retried true
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate starting secrets rotation watcher
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate starting secret revisions expiry watcher
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate starting obsolete secret revisions watcher (leader=true)
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.secretsrotate got rotate secret changes: []watcher.SecretTriggerChange{}
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got config change for postgresql/0: ok=true, hashes=[426e5e77af179901fdcde601e0edf8d10a6586e6918ccfcbd7baca0361b786f7]
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got unit change for postgresql/0
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.secretrevisionsexpire got revision expiry secret changes: []watcher.SecretTriggerChange{}
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got upgrade series change
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate no upgrade series in progress, reinitializing local upgrade series state
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got leader settings change for postgresql/0: ok=true
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got application change for postgresql/0
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got obsolete secret revisions change for postgresql: []
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate obsolete secret revisions: map[]
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate deleted secrets: []
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got secrets change for postgresql/0: []
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got latest secret info: map[string]secrets.SecretRevisionInfo{}
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate deleted secrets: []
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate obsolete secrets: map[]
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got action change for postgresql/0: [4] ok=true
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got trust config change for postgresql/0: ok=true, hashes=[18cbe02213f17ac8ed6e4e9a5f7636aeb4767855001f64053217a2e5ee21ceb6]
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got address change for postgresql/0: ok=true, hashes=[fbe283e419f2712254935a0b0c635f40d916537bde7e4fb789323f7c41fa7085]
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got storage change for postgresql/0: [pgdata/0] ok=true
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got update status interval change for postgresql/0: ok=true
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got relations change for postgresql/0: ok=true: ["postgresql:upgrade" "postgresql:database-peers" "postgresql:restart"]
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.operation running operation run action 4 for postgresql/0
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.operation acquiring machine lock for postgresql/0
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.operation preparing operation "run action 4" for postgresql/0
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.operation executing operation "run action 4" for postgresql/0
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter [AGENT-STATUS] executing: running action get-password
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.runner running action "get-password" on 1
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.runner starting jujuc server  {unix @/var/lib/juju/agents/unit-postgresql-0/agent.socket <nil>}
unit-postgresql-0: 11:08:09 DEBUG juju.worker.uniter.operation committing operation "run action 4" for postgresql/0
unit-postgresql-0: 11:08:09 DEBUG juju.worker.uniter.operation lock released for postgresql/0
unit-postgresql-0: 11:08:09 DEBUG juju.worker.uniter.storage next hook op for storage-pgdata-0: {Kind:2 Life:alive Attached:true Location:/var/snap/charmed-postgresql/common}
unit-postgresql-0: 11:08:09 DEBUG juju.worker.uniter no operations in progress; waiting for changes
unit-postgresql-0: 11:08:09 DEBUG juju.worker.uniter [AGENT-STATUS] idle: 
unit-postgresql-0: 11:08:09 DEBUG juju.worker.uniter.remotestate storage attachment change for postgresql/0: {storage-pgdata-0 {2 alive true /var/snap/charmed-postgresql/common}}
unit-postgresql-0: 11:08:09 DEBUG juju.worker.uniter.storage next hook op for storage-pgdata-0: {Kind:2 Life:alive Attached:true Location:/var/snap/charmed-postgresql/common}
unit-postgresql-0: 11:08:09 DEBUG juju.worker.uniter no operations in progress; waiting for changes

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-14:

#23

Here's the engine report https://pastebin.canonical.com/p/Pb8SzXHPbk/

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-14:

#24

I don't see "Starting unit workers" in the logs at all. Can you confirm if the way to restart a unit agent on k8s is `kill -9 1`? Having sourced `/etc/profile.d/juju-introspection.sh` I tried the following but it doesn't seem to be available on k8s:
```
# juju_stop_unit httprequest-lego-provider/0 && juju_start_unit httprequest-lego-provider/0
error: stop units for {[httprequest-lego-provider/0]} not supported
error: start units for {[httprequest-lego-provider/0]} not supported
```

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-14:

#25

Having restarted the container agent (restarting the pod via `kill -9 15`) I see the following in the logs: https://pastebin.canonical.com/p/3yC7Hwhsx9/. The task is still saying it's "cancelled", fwiw. Let me know if I should retry another action.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-14:

#26

After some discussions on Matrix I tried running the action again. It is also in "pending" state again, and all I could see in the logs was this:
```
unit-httprequest-lego-provider-0: 10:51:45 DEBUG juju.worker.uniter.remotestate got action change for httprequest-lego-provider/0: [16] ok=true
unit-httprequest-lego-provider-0: 10:51:45 DEBUG juju.worker.uniter no operations in progress; waiting for changes
```

Revision history for this message

Ian Booth (wallyworld) wrote on 2024-03-15:

#27

For k8s charms, when the unit agent first starts up, actions are blocked from running until Juju records that the default container is initialised and running. The running status is updated when the agent starts, and thereafter when a container status change in the cluster is detected. The latter would emit a log line:

got running status change ...

I don't see any of these. So we must be polling the cluster on agent start and not getting any subsequent changes. And the initial poll at start up must not be returning that the pod container is running. To diagnose further, can we please get:

- the recorded "provider id" value on httprequest-lego-provider/0, available by running juju status --format yaml (if it'snot there, will be on the unit doc in mongo)
- model logging config updated to also include "juju.worker.caasoperator=DEBUG"
- kubectl get -o yaml pods/httprequest-lego-provider-0
- kubectl describe pods/httprequest-lego-provider-0

An inspection using kubectl might even show that there's an issue with the containers in the pod and they are not running, hence the actions being blocked.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-15:

#28

Sure, thx.

- the recorded "provider id" value on httprequest-lego-provider/0, available by running juju status --format yaml (if it'snot there, will be on the unit doc in mongo)

The unit has `provider-id: httprequest-lego-provider-0` but the application has `provider-id: 4d8dbb8e-41f2-483c-b31d-fe09ac238084`.

- model logging config updated to also include "juju.worker.caasoperator=DEBUG"

This has been done

- kubectl get -o yaml pods/httprequest-lego-provider-0
- kubectl describe pods/httprequest-lego-provider-0

Here's the info for the two items above https://pastebin.canonical.com/p/6vtBBj8YKX/. And here's `kubectl get pods` for the namespace:

```
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
httprequest-lego-provider-0 2/2 Running 1 (22h ago) 7d23h
modeloperator-5cf8598576-d6k4r 1/1 Running 0 34d
nginx-ingress-integrator-0 1/1 Running 1 (18d ago) 27d
```

The restart on httprequest-lego-provider-0 is from when I ran `kill -9 15` as part of troubleshooting above.

Revision history for this message

Ian Booth (wallyworld) wrote on 2024-03-15:

#29

Can we get the new logs with the extra logging from another container restart?
The k8s info shows the containers are running. It's currently a mystery why actions which are being notified to the unit agent are not being acted on. I assume it's just this one unit that is affected, which makes it more weird.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-15:

#30

Sure, https://pastebin.canonical.com/p/dX6y9mp2Nk/.

Revision history for this message

Ian Booth (wallyworld) wrote on 2024-03-18:

#31

I realise after looking into the latest info that the block action processing is not relevant for sidebar charms. It was intended for the case where an old style charm acted as an operator and created the workload in a separate pod.

I added some extra local debugging and can see that "ActionsBlocked" for sidevar charm should always be false; it should be that notification of pending actions is always acted on, and yet the logs show an action notification that is ignored.

Given hooks appear to be running, there's nothing obvious to explain why actions are being ignored.
There's also not enough logging in the right places to shed more light on it. And I can't reproduce locally with the snappass-test charm. I tried emulating what happened by first forcing a hook to fail and then refreshing the charm. Even with the failed hook, an action could still be run as is expected.

To make this easier to debug, more logging is needed but it doesn't help right now with the immediate problem.

Revision history for this message

Ian Booth (wallyworld) wrote on 2024-03-18:

#32

I also deployed httprequest-lego-provider locally on microk8s. Since I didn't configure it, it complained about missing config - git repo, git ssh keys - but I could still run the list-domains action. I supplied bogus values for the missing config, resulting in an error running the config-changed hook. Again, I could still run the list-domains action. So even under adverse circumstances with a charm in error, actions still get run.

I wonder if it's possible in you deployment scenario to delete the httprequest-lego-provider app and once cleaned up, deploy it again.

Revision history for this message

Ian Booth (wallyworld) wrote on 2024-03-18:

#33

It won't help right now, but I've added some extra logging for next time.

https://github.com/juju/juju/pull/17051

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-18:

#34

We'll give that a try (removing the application and redeploying). I'm out of the office until Wednesday, but will report back then.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-20:

#35

Ok, I tried redeploying the application (I did a `juju remove-application httprequest-lego-provider`, waited til that had completed, and confirmed there were no k8s resources associated with that application in the k8s namespace for the model). Unfortunately actions are still stuck. If I run `juju operations` I see previous operations I've tried to execute on the model, including those from the previous iteration of the httprequest-lego-provider charm. I assume that's expected, but just mentioning it in case.

I also tried running an action while running `juju debug-code` on the application, but that didn't seem to work (it didn't intercept the charm code at all). I also tried adding some logging to the charm to notify when various parts of the code are triggered, but from doing so it seems like when running the action the charm's __init__ method isn't even called (which possibly explains why `juju debug-code` doesn't intercept it).

I'm kind of out of ideas here, apart from potentially trying to deploy it under a different name. I'd really appreciate a live debugging session with someone who understands the internals of this.

Revision history for this message

Ian Booth (wallyworld) wrote on 2024-03-21:

#36

debug-code won't help with actions - the agent just execs the relevant "binary" without going through the hook dispatch workflow.

Without extra debug logging, which isn't in the codebase, it's hard to figure out what's going on. From memory, we saw that the unit agent noticed an action is pending. Can we confirm that the agent is idle and nothing is holding the "machine" execution lock? Presumably hooks continue to run, including update-status every 5 minutes. If that's the case, it's a mystery why actions are stalled. It may have been mentioned - is it just this charm/app or do actions not work on any app in the model?

Revision history for this message

Joseph Phillips (manadart) wrote on 2024-03-21:

#37

Good point on the lock. Is there a machine-lock.log on K8s? It's in /var/log/juju/machine-log.lock on machines.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-21:

#38

The unit is idle when looking at juju status:

```
Unit Workload Agent Address Ports Message
httprequest-lego-provider/0* active idle 192.168.101.81
```

And it looks like the machine lock confirms nothing's holding it:
```
2024-03-21 09:35:27 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 0s
2024-03-21 09:37:38 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
2024-03-21 09:42:08 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
2024-03-21 09:44:56 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 1s
2024-03-21 09:48:05 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 1s
2024-03-21 09:49:03 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 1s
2024-03-21 09:53:47 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 0s
2024-03-21 09:53:53 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
2024-03-21 09:58:10 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 0s
2024-03-21 09:59:37 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
2024-03-21 10:04:39 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
```
It does look like update-status (and other hooks) are running fine.

I have tested running an action on another app in this model (nginx-ingress-integrator) and that worked fine.

The unit is idle when looking at juju status:

```
Unit                          Workload  Agent  Address         Ports  Message
httprequest-lego-provider/0*  active    idle   192.168.101.81         
```

And it looks like the machine lock confirms nothing's holding it:
```
2024-03-21 09:35:27 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 0s
2024-03-21 09:37:38 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
2024-03-21 09:42:08 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
2024-03-21 09:44:56 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 1s
2024-03-21 09:48:05 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 1s
2024-03-21 09:49:03 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 1s
2024-03-21 09:53:47 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 0s
2024-03-21 09:53:53 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
2024-03-21 09:58:10 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 0s
2024-03-21 09:59:37 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
2024-03-21 10:04:39 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
```
It does look like update-status (and other hooks) are running fine.

I have tested running an action on another app in this model (nginx-ingress-integrator) and that worked fine.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-03-21:

#39

(The machine lock seems to be /var/log/machine-lock.log on k8s fwiw)

Revision history for this message

John A Meinel (jameinel) wrote on 2024-03-22:

#40

Summary from today's investigation.

We checked that the actions were still listed as "pending" but nothing was running. We *were* able to do "juju cancel-task XX" and have it move from "pending" to "cancelled"

We also saw that neither "juju exec httprequest-lego-provider/0 -- hostname" would not run, neither would "juju exec --execution-group=test -- hostname"
Both of them show up in `juju operations` but show up as "pending".

We also tested and saw that "juju exec nginx-ingress-integrator/0 -- hostname" worked correctly, showed up in `juju operations` and returned the output to the juju CLI correctly. (so we know exec as actions works, just not for this unit agent)

We then dug into `juju debug-log` a bit, turned up some logging for the `juju.worker.uniter` to TRACE and `juju.worker.cassoperator` to TRACE.

We didn't get a chance to try restarting just the unit agent, and seeing if there were any obvious failures in the log.

We did inspect /var/log/juju/machine-lock.log and could see that it has content for hook executions (update-status-hook) but does not appear to ever hold anything for actions. (we did not see any record of anything in nginx-ingress-integrator which did have successful hook runs)

We did see one thing that was a red-herring, there was a debug log entry that was "juju-exec listener stopping", because that is the local 'as a separate process, I want to be able to run a command inside a charm context".

We confirmed that on the local machine `/usr/bin/juju-exec httprequest-lego-provider/0 relation-ids` returned correct information. (it doesn't use Actions to run the command, but at least the unit agent is receiving requests and responding)

we sourced /etc/profile.d/juju_introspection.sh and ran `juju_engine_report`
https://pastebin.canonical.com/p/BGXRjzYvNF/

Everything seems to be reasonable. All the started counts are 1, and the things that are stopped seemed reasonable:
```
  signal-handler:
    error: '"dead-flag" not set: dependency not available'
    inputs:
    - dead-flag
    state: stopped
  uniter:
...
    report:
      local-state:
        installed: true
        leader: true
        operation-kind: continue
        operation-step: pending
        removed: false
        started: true
        stopped: false
  caas-zombie-prober:
    error: '"dead-flag" not set: dependency not available'
    inputs:
    - probe-http-server
    - dead-flag
    state: stopped
  upgrade-steps-runner:
    inputs:
    - agent
    - api-caller
    - upgrade-steps-gate
    - not-dead-flag
    start-count: 1
    state: stopped
```