action stuck in pending

Bug #2056193 reported by Tom Haddon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
New
High
Unassigned

Bug Description

I've got a k8s charm running on juju 3.1.6, and have run an action against it over an hour ago. The action is simply running a django management command which performs a read only query against a PostgreSQL database. When testing locally returns within a few seconds. For some reason on this production instance the action is stuck in pending:

```
$ juju show-task 2
Task 2 pending
```

I can't see any process on the charm or workload container related to this.

The charm itself has errors in the debug-log related to being unable to connect to external resources to download a binary for the loki charm:
```
unit-httprequest-lego-provider-0: 14:38:24 ERROR unit.httprequest-lego-provider/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/usr/lib/python3.10/urllib/request.py", line 1348, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/lib/python3.10/http/client.py", line 1283, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output
    self.send(msg)
  File "/usr/lib/python3.10/http/client.py", line 976, in send
    self.connect()
  File "/usr/lib/python3.10/http/client.py", line 1448, in connect
    super().connect()
  File "/usr/lib/python3.10/http/client.py", line 942, in connect
    self.sock = self._create_connection(
  File "/usr/lib/python3.10/socket.py", line 845, in create_connection
    raise err
  File "/usr/lib/python3.10/socket.py", line 833, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-httprequest-lego-provider-0/charm/./src/charm.py", line 98, in <module>
    ops.main.main(DjangoCharm)
  File "/var/lib/juju/agents/unit-httprequest-lego-provider-0/charm/venv/ops/main.py", line 456, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-httprequest-lego-provider-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-httprequest-lego-provider-0/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-httprequest-lego-provider-0/charm/venv/ops/framework.py", line 865, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-httprequest-lego-provider-0/charm/venv/ops/framework.py", line 955, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-httprequest-lego-provider-0/charm/lib/charms/loki_k8s/v0/loki_push_api.py", line 1821, in _on_pebble_ready
    self._setup_promtail()
  File "/var/lib/juju/agents/unit-httprequest-lego-provider-0/charm/lib/charms/loki_k8s/v0/loki_push_api.py", line 2318, in _setup_promtail
    self._obtain_promtail(promtail_binaries[self._arch])
  File "/var/lib/juju/agents/unit-httprequest-lego-provider-0/charm/lib/charms/loki_k8s/v0/loki_push_api.py", line 1991, in _obtain_promtail
    self._download_and_push_promtail_to_workload(promtail_info)
  File "/var/lib/juju/agents/unit-httprequest-lego-provider-0/charm/lib/charms/loki_k8s/v0/loki_push_api.py", line 2132, in _download_and_push_promtail_to_workload
    with opener.open(promtail_info["url"]) as r:
  File "/usr/lib/python3.10/urllib/request.py", line 519, in open
    response = self._open(req, data)
  File "/usr/lib/python3.10/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
unit-httprequest-lego-provider-0: 14:38:24 ERROR juju.worker.uniter.operation hook "django-app-pebble-ready" (via hook dispatching script: dispatch) failed: exit status 1
```
This is being fixed separately (by adding support for proxies in this situation). Is it possible that this is blocking the action(s) from running? I don't see anything related to actions in `juju debug-log --no-tail --replay`.

Please let me know how do debug this further.

Tags: canonical-is
Tom Haddon (mthaddon)
tags: added: canonical-is
Revision history for this message
Tom Haddon (mthaddon) wrote :

we've now done a charm upgrade to fix the error in django-app-pebble-ready so the charm in question is now not displaying any errors in juju debug-log, but rerunning the action and it's still stuck in pending.

Revision history for this message
Joseph Phillips (manadart) wrote :

Not quite clear on the "fixed and now does not show errors part".

Does every run create a new pending action, or is it just that one action?

It is a charm-encoded action correct? Not doing some out-of-band trick with exec?

Changed in juju:
status: New → Incomplete
Revision history for this message
Tom Haddon (mthaddon) wrote :

I mean that the django-app-pebble-ready error included when the bug was originally filed has been fixed but we're still seeing problems with any new action we run being stuck. These are charm defined actions.

Tom Haddon (mthaddon)
Changed in juju:
status: Incomplete → New
Revision history for this message
John A Meinel (jameinel) wrote :

I did mention trying out:
```
juju exec --execution-group=abcfd -- command
```

Which lets you use a different named execution group. But I don't think we expose --execution-group on the CLI.
According to Ian as a charmer you can put the execution-group into the action definition (as part of actions.yaml). But that wouldn't be a workaround here.

Revision history for this message
Tom Haddon (mthaddon) wrote :

A further update here, since I just learned of the `juju operations` command:

```
$ juju operations
ID Status Started Finished Task IDs Summary
 7 pending 8 list-domains run on unit-httprequest-lego-provider-0
$ juju show-operation 7
summary: list-domains run on unit-httprequest-lego-provider-0
status: pending
action:
  name: list-domains
  parameters:
    username: prod-is-cos-ps6
timing:
  enqueued: 2024-03-07 08:47:26 +0000 UTC
tasks:
  "8":
    host: httprequest-lego-provider/0
    status: pending
    timing:
      enqueued: 2024-03-07 08:47:26 +0000 UTC
$ juju show-task 8
Task 8 pending

$ date
Fri Mar 8 12:31:38 UTC 2024
```

Revision history for this message
Ian Booth (wallyworld) wrote :

We'll need to see some additional info to help diagnose the issue:
- the content of the unitstates collection relevant to the affected unit
- the content of the action and actionnotifications collections relevant to the pending task
- the content of the operations collection relevant to the pending operation
- logging for the affected unit agent with "juju.worker.uniter=DEBUG"

There's been a couple of fixes landed to better handle orphaned actions, ie those actions which the unit might be executing but the action record no longer exists in state. Juju 3.1.6 should have the earlier fixes. There's a recent one

https://github.com/juju/juju/pull/16922

But that handled the case where an action had already been started, not pending. It might be that the fix is also relevant to this case. But we'd need to first see via the info above what actions are queued to run for the unit and whether the actions are stored in the db.

Revision history for this message
Tom Haddon (mthaddon) wrote :

Thanks for the update. Can you provide a bit more information on how retrieve the following:

- the content of the unitstates collection relevant to the affected unit
- the content of the action and actionnotifications collections relevant to the pending task
- the content of the operations collection relevant to the pending operation

Revision history for this message
Joseph Phillips (manadart) wrote :

If you connect to Mongo...

unitstates - ID has the model UUID and unit ID, for example:

db.unitstates.find({"_id" : "bdd12a44-2fdb-4596-87e5-a2fe851f36de:u#ubuntu/0#charm"}).pretty()

actionnotifications is similar, but the suffix is _a_<task ID>:

db.actionnotifications.find({"_id" : "bdd12a44-2fdb-4596-87e5-a2fe851f36de:mongodb/0_a_4"}).pretty()

operations is just with the operation ID:

db.operations.find({"_id" : "bdd12a44-2fdb-4596-87e5-a2fe851f36de:3"}).pretty()

Task and operation ID are returned when you run an action:

Running operation 7 with 1 task
  - task 8 on unit-mongodb-0

Revision history for this message
Tom Haddon (mthaddon) wrote :

Thanks. I'm not sure if this includes sensitive information so I've used a canonical pastebin in case it does: https://pastebin.canonical.com/p/wY5435q25F/

Revision history for this message
Tom Haddon (mthaddon) wrote :

I've also run `juju model-config logging-config='<root>=INFO;juju.worker.uniter=DEBUG'` on the model in question, but all I see (for now) in the logs is:

```
unit-httprequest-lego-provider-0: 16:12:14 DEBUG juju.worker.uniter.remotestate got update status interval change for httprequest-lego-provider/0: ok=true
unit-nginx-ingress-integrator-0: 16:12:14 DEBUG juju.worker.uniter.remotestate got update status interval change for nginx-ingress-integrator/0: ok=true
```

Juju operations still reports the task as pending:

```
$ juju operations
ID Status Started Finished Task IDs Summary
 7 pending 8 list-domains run on unit-httprequest-lego-provider-0
```

Revision history for this message
Tom Haddon (mthaddon) wrote :
Download full text (3.3 KiB)

Ok, a bit more output now:

```
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.remotestate update status timer triggered for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter no operations in progress; waiting for changes
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "postgresql/0" already joined relation 2
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "postgresql/1" already joined relation 2
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "postgresql/2" already joined relation 2
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "nginx-ingress-integrator/0" already joined relation 3
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "prometheus-scrape/0" already joined relation 4
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "grafana/0" already joined relation 5
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "loki/0" already joined relation 6
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.operation running operation run update-status hook for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.operation acquiring machine lock for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.operation preparing operation "run update-status hook" for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.operation executing operation "run update-status hook" for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.runner starting jujuc server {unix @/var/lib/juju/agents/unit-httprequest-lego-provider-0/agent.socket <nil>}
unit-httprequest-lego-provider-0: 16:17:52 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.operation committing operation "run update-status hook" for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.operation lock released for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter no operations in progress; waiting for changes
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "grafana/0" already joined relation 5
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "loki/0" already joined relation 6
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "postgresql/0" already joined relation 2
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "postgresql/1" already joined relation 2
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "postgresql/2" already joined relation 2
unit-httprequest-lego-provider-0: 16:17:52 DEBUG juju.worker.uniter.relation unit "nginx-ingress-integrator/0" already joi...

Read more...

Revision history for this message
Ian Booth (wallyworld) wrote :

Thanks for the pastebin. We seem to be missing the content of the actions collection though. We have actionnotifications but not actions.

Also, the operations collection query is on id "8" but this is the id of the action. The operation id is "7".

The logging output doesn't have much. What would be useful, if possible, would be to keep the log level at DEBUG and restart the affected unit agent. This will trigger the pending action number 8 to be notified to the unit agent and we can see what happens as it is processed.

Revision history for this message
Tom Haddon (mthaddon) wrote :

Ah, sorry for getting the wrong ID for the operations. Here's the correct one https://pastebin.canonical.com/p/nGPgWffj4T/. Can you let me know the format for the actions collection? I tried `db.actions.find({"_id" : "${model-uuid}:${unit-and-id}_a_8"}).pretty()` but that didn't seem to work. Not quite sure how to restart the unit agent on k8s as we don't seem to have `juju_` bash functions. Will ask in the Juju channel on Matrix.

Revision history for this message
Joseph Phillips (manadart) wrote (last edit ):

Actions key is model:task, for example:

{"_id" : "bdd12a44-2fdb-4596-87e5-a2fe851f36de:6"}

For K8s units I think you want to kill the entry process to bounce it. Something like:

kubectl exec POD_NAME -c CONTAINER_NAME kill -9 1

Revision history for this message
Tom Haddon (mthaddon) wrote :

Here's the actions collection output https://pastebin.canonical.com/p/WNYhcm2Hn3/

Revision history for this message
Tom Haddon (mthaddon) wrote :
Download full text (6.7 KiB)

I've tried the `kill -9 1` and then had this in the logs:

```
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter.remotestate update status timer triggered for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter no operations in progress; waiting for changes
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter.relation unit "prometheus-scrape/0" already joined relation 4
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter.relation unit "grafana/0" already joined relation 5
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter.relation unit "loki/0" already joined relation 6
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter.relation unit "postgresql/0" already joined relation 2
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter.relation unit "postgresql/1" already joined relation 2
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter.relation unit "postgresql/2" already joined relation 2
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter.relation unit "nginx-ingress-integrator/0" already joined relation 3
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter.operation running operation run update-status hook for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter.operation acquiring machine lock for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter.operation preparing operation "run update-status hook" for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter.operation executing operation "run update-status hook" for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 10:41:03 DEBUG juju.worker.uniter.runner starting jujuc server {unix @/var/lib/juju/agents/unit-httprequest-lego-provider-0/agent.socket <nil>}
unit-httprequest-lego-provider-0: 10:41:04 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-httprequest-lego-provider-0: 10:41:04 DEBUG juju.worker.uniter.operation committing operation "run update-status hook" for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 10:41:04 DEBUG juju.worker.uniter.operation lock released for httprequest-lego-provider/0
unit-httprequest-lego-provider-0: 10:41:04 DEBUG juju.worker.uniter no operations in progress; waiting for changes
unit-httprequest-lego-provider-0: 10:41:04 DEBUG juju.worker.uniter.relation unit "loki/0" already joined relation 6
unit-httprequest-lego-provider-0: 10:41:04 DEBUG juju.worker.uniter.relation unit "postgresql/0" already joined relation 2
unit-httprequest-lego-provider-0: 10:41:04 DEBUG juju.worker.uniter.relation unit "postgresql/1" already joined relation 2
unit-httprequest-lego-provider-0: 10:41:04 DEBUG juju.worker.uniter.relation unit "postgresql/2" already joined relation 2
unit-httprequest-lego-provider-0: 10:41:04 DEBUG juju.worker.uniter.relation unit "nginx-ingress-integrator/0" already joined relation 3
unit-httprequest-lego-provider-0: 10:41:04 DEBUG juju.worker.uniter.relati...

Read more...

Revision history for this message
Joseph Phillips (manadart) wrote :

This seems like the action claims to have "successfully" spawned the task, but the task has gone missing as far as the agent is concerned.

The start-up logs have nothing about it, because the agent doesn't think there is something to be done.

This looks similar to the issue that occurred when the charm was updating. I'll need to do some archaeology for it.

Revision history for this message
Tom Haddon (mthaddon) wrote (last edit ):

I've just tried to run another action on a different charm in this model to see what happens, in case that's helpful:

```
prod-httprequest-lego-provider-k8s@is-bastion-ps6:~$ juju run nginx-ingress-integrator/0 get-certificate
Running operation 13 with 1 task
  - task 14 on unit-nginx-ingress-integrator-0

Waiting for task 14...
ERROR the following task failed:
 - id "14" with return code 1

use 'juju show-task' to inspect the failure

prod-httprequest-lego-provider-k8s@is-bastion-ps6:~$ juju show-task 14
prod-httprequest-lego-provider-k8s@is-bastion-ps6:~$
```

The failure is expected because I didn't pass a parameter that it needs (we need to update the charm to make that a required parameter).

Revision history for this message
Joseph Phillips (manadart) wrote :

The above does not appear to be correct. The spawned task count does not appear to be factored by the uniter itself.

Revision history for this message
Joseph Phillips (manadart) wrote :

Can you run "juju cancel-task" on it?

Revision history for this message
Tom Haddon (mthaddon) wrote :

```
prod-httprequest-lego-provider-k8s@is-bastion-ps6:~$ juju cancel-task --verbose 8
actions:
- action: list-domains
  completed at: "2024-03-13 17:12:42"
  id: "8"
  status: cancelled
  unit: httprequest-lego-provider/0
prod-httprequest-lego-provider-k8s@is-bastion-ps6:~$ juju show-task 8
Task 8 cancelled

```

Revision history for this message
Ian Booth (wallyworld) wrote :
Download full text (6.5 KiB)

Are you 100% sure comment #16 captures ALL of the logs after an agent restart?

eg I deployed postgresql and stopped the unit agent. I then ran an action which will stay pending since the agent isn't running to process it. I then restarted the agent and you can see there should be an initial message about starting the workers and then an initial message from all the watchers. Note the actions watcher which reports on the pending action. That's what we're looking for here. The entry in the actionsnotification collection should trigger an event from the actions watcher on agent startup, and we need to see what happens next.

We could set "juju.worker.uniter=TRACE" as well just to get everything.

unit-postgresql-0: 11:08:07 INFO juju Starting unit workers for "postgresql/0"
unit-postgresql-0: 11:08:07 INFO juju.agent.setup setting logging config to "juju.worker.uniter=DEBUG"
unit-postgresql-0: 11:08:07 WARNING juju.agent.setup developer feature flags enabled: "actions-v2", "developer-mode", "image-metadata", "secrets", "secrets-stores"
unit-postgresql-0: 11:08:07 DEBUG juju.worker.uniter.relation unit "postgresql/0" (leader=true) entered scope for relation "postgresql:database-peers"
unit-postgresql-0: 11:08:07 DEBUG juju.worker.uniter.relation unit "postgresql/0" (leader=true) entered scope for relation "postgresql:restart"
unit-postgresql-0: 11:08:07 DEBUG juju.worker.uniter.relation unit "postgresql/0" (leader=true) entered scope for relation "postgresql:upgrade"
unit-postgresql-0: 11:08:07 DEBUG juju.worker.uniter starting local juju-exec listener on {unix /var/lib/juju/agents/unit-postgresql-0/run.socket <nil>}
unit-postgresql-0: 11:08:07 INFO juju.worker.uniter unit "postgresql/0" started
unit-postgresql-0: 11:08:07 DEBUG juju.worker.uniter juju-exec listener running
unit-postgresql-0: 11:08:07 INFO juju.worker.uniter hooks are retried true
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate starting secrets rotation watcher
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate starting secret revisions expiry watcher
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate starting obsolete secret revisions watcher (leader=true)
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.secretsrotate got rotate secret changes: []watcher.SecretTriggerChange{}
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got config change for postgresql/0: ok=true, hashes=[426e5e77af179901fdcde601e0edf8d10a6586e6918ccfcbd7baca0361b786f7]
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got unit change for postgresql/0
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.secretrevisionsexpire got revision expiry secret changes: []watcher.SecretTriggerChange{}
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got upgrade series change
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate no upgrade series in progress, reinitializing local upgrade series state
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got leader settings change for postgresql/0: ok=true
unit-postgresql-0: 11:08:08 DEBUG juju.worker.uniter.remotestate got application change for postgr...

Read more...

Revision history for this message
Tom Haddon (mthaddon) wrote :
Revision history for this message
Tom Haddon (mthaddon) wrote :

I don't see "Starting unit workers" in the logs at all. Can you confirm if the way to restart a unit agent on k8s is `kill -9 1`? Having sourced `/etc/profile.d/juju-introspection.sh` I tried the following but it doesn't seem to be available on k8s:
```
# juju_stop_unit httprequest-lego-provider/0 && juju_start_unit httprequest-lego-provider/0
error: stop units for {[httprequest-lego-provider/0]} not supported
error: start units for {[httprequest-lego-provider/0]} not supported
```

Revision history for this message
Tom Haddon (mthaddon) wrote :

Having restarted the container agent (restarting the pod via `kill -9 15`) I see the following in the logs: https://pastebin.canonical.com/p/3yC7Hwhsx9/. The task is still saying it's "cancelled", fwiw. Let me know if I should retry another action.

Revision history for this message
Tom Haddon (mthaddon) wrote :

After some discussions on Matrix I tried running the action again. It is also in "pending" state again, and all I could see in the logs was this:
```
unit-httprequest-lego-provider-0: 10:51:45 DEBUG juju.worker.uniter.remotestate got action change for httprequest-lego-provider/0: [16] ok=true
unit-httprequest-lego-provider-0: 10:51:45 DEBUG juju.worker.uniter no operations in progress; waiting for changes
```

Revision history for this message
Ian Booth (wallyworld) wrote :

For k8s charms, when the unit agent first starts up, actions are blocked from running until Juju records that the default container is initialised and running. The running status is updated when the agent starts, and thereafter when a container status change in the cluster is detected. The latter would emit a log line:

got running status change ...

I don't see any of these. So we must be polling the cluster on agent start and not getting any subsequent changes. And the initial poll at start up must not be returning that the pod container is running. To diagnose further, can we please get:

- the recorded "provider id" value on httprequest-lego-provider/0, available by running juju status --format yaml (if it'snot there, will be on the unit doc in mongo)
- model logging config updated to also include "juju.worker.caasoperator=DEBUG"
- kubectl get -o yaml pods/httprequest-lego-provider-0
- kubectl describe pods/httprequest-lego-provider-0

An inspection using kubectl might even show that there's an issue with the containers in the pod and they are not running, hence the actions being blocked.

Revision history for this message
Tom Haddon (mthaddon) wrote :

Sure, thx.

- the recorded "provider id" value on httprequest-lego-provider/0, available by running juju status --format yaml (if it'snot there, will be on the unit doc in mongo)

The unit has `provider-id: httprequest-lego-provider-0` but the application has `provider-id: 4d8dbb8e-41f2-483c-b31d-fe09ac238084`.

- model logging config updated to also include "juju.worker.caasoperator=DEBUG"

This has been done

- kubectl get -o yaml pods/httprequest-lego-provider-0
- kubectl describe pods/httprequest-lego-provider-0

Here's the info for the two items above https://pastebin.canonical.com/p/6vtBBj8YKX/. And here's `kubectl get pods` for the namespace:

```
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
httprequest-lego-provider-0 2/2 Running 1 (22h ago) 7d23h
modeloperator-5cf8598576-d6k4r 1/1 Running 0 34d
nginx-ingress-integrator-0 1/1 Running 1 (18d ago) 27d
```

The restart on httprequest-lego-provider-0 is from when I ran `kill -9 15` as part of troubleshooting above.

Revision history for this message
Ian Booth (wallyworld) wrote :

Can we get the new logs with the extra logging from another container restart?
The k8s info shows the containers are running. It's currently a mystery why actions which are being notified to the unit agent are not being acted on. I assume it's just this one unit that is affected, which makes it more weird.

Revision history for this message
Tom Haddon (mthaddon) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote :

I realise after looking into the latest info that the block action processing is not relevant for sidebar charms. It was intended for the case where an old style charm acted as an operator and created the workload in a separate pod.

I added some extra local debugging and can see that "ActionsBlocked" for sidevar charm should always be false; it should be that notification of pending actions is always acted on, and yet the logs show an action notification that is ignored.

Given hooks appear to be running, there's nothing obvious to explain why actions are being ignored.
There's also not enough logging in the right places to shed more light on it. And I can't reproduce locally with the snappass-test charm. I tried emulating what happened by first forcing a hook to fail and then refreshing the charm. Even with the failed hook, an action could still be run as is expected.

To make this easier to debug, more logging is needed but it doesn't help right now with the immediate problem.

Revision history for this message
Ian Booth (wallyworld) wrote :

I also deployed httprequest-lego-provider locally on microk8s. Since I didn't configure it, it complained about missing config - git repo, git ssh keys - but I could still run the list-domains action. I supplied bogus values for the missing config, resulting in an error running the config-changed hook. Again, I could still run the list-domains action. So even under adverse circumstances with a charm in error, actions still get run.

I wonder if it's possible in you deployment scenario to delete the httprequest-lego-provider app and once cleaned up, deploy it again.

Revision history for this message
Ian Booth (wallyworld) wrote :

It won't help right now, but I've added some extra logging for next time.

https://github.com/juju/juju/pull/17051

Revision history for this message
Tom Haddon (mthaddon) wrote :

We'll give that a try (removing the application and redeploying). I'm out of the office until Wednesday, but will report back then.

Revision history for this message
Tom Haddon (mthaddon) wrote :

Ok, I tried redeploying the application (I did a `juju remove-application httprequest-lego-provider`, waited til that had completed, and confirmed there were no k8s resources associated with that application in the k8s namespace for the model). Unfortunately actions are still stuck. If I run `juju operations` I see previous operations I've tried to execute on the model, including those from the previous iteration of the httprequest-lego-provider charm. I assume that's expected, but just mentioning it in case.

I also tried running an action while running `juju debug-code` on the application, but that didn't seem to work (it didn't intercept the charm code at all). I also tried adding some logging to the charm to notify when various parts of the code are triggered, but from doing so it seems like when running the action the charm's __init__ method isn't even called (which possibly explains why `juju debug-code` doesn't intercept it).

I'm kind of out of ideas here, apart from potentially trying to deploy it under a different name. I'd really appreciate a live debugging session with someone who understands the internals of this.

Revision history for this message
Ian Booth (wallyworld) wrote :

debug-code won't help with actions - the agent just execs the relevant "binary" without going through the hook dispatch workflow.

Without extra debug logging, which isn't in the codebase, it's hard to figure out what's going on. From memory, we saw that the unit agent noticed an action is pending. Can we confirm that the agent is idle and nothing is holding the "machine" execution lock? Presumably hooks continue to run, including update-status every 5 minutes. If that's the case, it's a mystery why actions are stalled. It may have been mentioned - is it just this charm/app or do actions not work on any app in the model?

Revision history for this message
Joseph Phillips (manadart) wrote :

Good point on the lock. Is there a machine-lock.log on K8s? It's in /var/log/juju/machine-log.lock on machines.

Revision history for this message
Tom Haddon (mthaddon) wrote :

The unit is idle when looking at juju status:

```
Unit Workload Agent Address Ports Message
httprequest-lego-provider/0* active idle 192.168.101.81
```

And it looks like the machine lock confirms nothing's holding it:
```
2024-03-21 09:35:27 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 0s
2024-03-21 09:37:38 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
2024-03-21 09:42:08 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
2024-03-21 09:44:56 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 1s
2024-03-21 09:48:05 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 1s
2024-03-21 09:49:03 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 1s
2024-03-21 09:53:47 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 0s
2024-03-21 09:53:53 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
2024-03-21 09:58:10 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run relation-changed (12; app: postgresql) hook), waited 0s, held 0s
2024-03-21 09:59:37 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
2024-03-21 10:04:39 unit-httprequest-lego-provider-0: httprequest-lego-provider/0 uniter (run update-status hook), waited 0s, held 0s
```
It does look like update-status (and other hooks) are running fine.

I have tested running an action on another app in this model (nginx-ingress-integrator) and that worked fine.

Revision history for this message
Tom Haddon (mthaddon) wrote :

(The machine lock seems to be /var/log/machine-lock.log on k8s fwiw)

Revision history for this message
John A Meinel (jameinel) wrote :

Summary from today's investigation.

We checked that the actions were still listed as "pending" but nothing was running. We *were* able to do "juju cancel-task XX" and have it move from "pending" to "cancelled"

We also saw that neither "juju exec httprequest-lego-provider/0 -- hostname" would not run, neither would "juju exec --execution-group=test -- hostname"
Both of them show up in `juju operations` but show up as "pending".

We also tested and saw that "juju exec nginx-ingress-integrator/0 -- hostname" worked correctly, showed up in `juju operations` and returned the output to the juju CLI correctly. (so we know exec as actions works, just not for this unit agent)

We then dug into `juju debug-log` a bit, turned up some logging for the `juju.worker.uniter` to TRACE and `juju.worker.cassoperator` to TRACE.

We didn't get a chance to try restarting just the unit agent, and seeing if there were any obvious failures in the log.

We did inspect /var/log/juju/machine-lock.log and could see that it has content for hook executions (update-status-hook) but does not appear to ever hold anything for actions. (we did not see any record of anything in nginx-ingress-integrator which did have successful hook runs)

We did see one thing that was a red-herring, there was a debug log entry that was "juju-exec listener stopping", because that is the local 'as a separate process, I want to be able to run a command inside a charm context".

We confirmed that on the local machine `/usr/bin/juju-exec httprequest-lego-provider/0 relation-ids` returned correct information. (it doesn't use Actions to run the command, but at least the unit agent is receiving requests and responding)

we sourced /etc/profile.d/juju_introspection.sh and ran `juju_engine_report`
https://pastebin.canonical.com/p/BGXRjzYvNF/

Everything seems to be reasonable. All the started counts are 1, and the things that are stopped seemed reasonable:
```
  signal-handler:
    error: '"dead-flag" not set: dependency not available'
    inputs:
    - dead-flag
    state: stopped
  uniter:
...
    report:
      local-state:
        installed: true
        leader: true
        operation-kind: continue
        operation-step: pending
        removed: false
        started: true
        stopped: false
  caas-zombie-prober:
    error: '"dead-flag" not set: dependency not available'
    inputs:
    - probe-http-server
    - dead-flag
    state: stopped
  upgrade-steps-runner:
    inputs:
    - agent
    - api-caller
    - upgrade-steps-gate
    - not-dead-flag
    start-count: 1
    state: stopped
```

Revision history for this message
John A Meinel (jameinel) wrote :

Next steps are to dig through the db dumps from Tom, and see if we can find the missing link between "we clearly see that there is an action that gets queued, but the unit agent doesn't actually see it and try to run it".
We also need to dig through the 3.1.6 code base to see what log messages might be useful. (the changes from Ian landed in 3.3 so will be long for us to use to debug this.)

The other hypothetical is that because httprequest-lego-provider/0 is the same unit after it was removed and redeployed (the change from K8s vs machines), I expect there is some cruft that we failed to cleanup on app removal and still remains when the new application is deployed.

Revision history for this message
Tom Haddon (mthaddon) wrote :

One thing to mention on the hypothetical is that because httprequest-lego-provider/0 is the same unit after it was removed and redeployed - we only redeployed because we were seeing the error reported in this bug. In other words, we were seeing this problem before the redeployment as well. We had some errors in the application when we initially deployed it (per the initial bug report) and we tried to run actions while it was in that state and they were blocked, but then remained blocked after that error was fixed so we're not sure if that was related in any way.

There was no change from k8s to machines. I tried to confirm there were no k8s resources related to this application after the application removal (kubectl get all | grep httprequest-lego-provider) before redeploying.

John A Meinel (jameinel)
Changed in juju:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Caner Derici (cderici) wrote :

I noticed in the engine report (one that John shared https://pastebin.canonical.com/p/BGXRjzYvNF/) that the in-peer field for one of the relations for the unit is true. So I'm wondering if we have a stale peer relation for that unit that hasn't been cleaned up during redeploy. Maybe that's something we could take a look at the db during the meeting.

```
      relations:
        "7":
          application-members:
            httprequest-lego-provider: 1
          dying: false
          endpoint: secret-storage
          is-peer: true
          members: {}
          relation: httprequest-lego-provider:secret-storage
```

https://github.com/juju/juju/blob/de682d4a80a1d0c104b2cfd31e58e149aaede9a1/state/application.go#L1387 should've cleaned it up.

Revision history for this message
Nicolas Vinuesa (nvinuesa) wrote :

Update from the debug session (25/03):

- Actions collection for the model:
https://pastebin.canonical.com/p/XxMRqC6VnV/

- unitstates collection on the httprequest-lego-provider unit:
https://pastebin.canonical.com/p/MVtsmRTcWt/

- unitstates collection on the nginx-ingress-integrator unit:
https://pastebin.canonical.com/p/j5TtfzJ8nM/

- $ juju operations:
https://pastebin.canonical.com/p/WdF7pwjVfp/

Revision history for this message
John A Meinel (jameinel) wrote :

Logs with "<root>=DEBUG;unit=DEBUG;juju.worker.uniter=TRACE;juju.worker.caas?=TRACE"
https://pastebin.canonical.com/p/WZdbkSS2PF/

One very interesting piece is that we do see:
unit-httprequest-lego-provider-0: 16:25:14 DEBUG juju.worker.uniter.remotestate got action change for httprequest-lego-provider/0: [4 6 24] ok=true
unit-httprequest-lego-provider-0: 16:35:48 DEBUG juju.worker.uniter.remotestate got action change for httprequest-lego-provider/0: [32] ok=true

I'm guessing the number (4 6 24) and (32) is the action tasks that it detected. But we are seeing that there is an action change. (we saw that log message show up after trying to run `juju exec ... -- hostname`)

We see a lot of DEBUG messages about the relation states, and what has already been seen (no relation-created to run, no -joined to run)

As Ian noted in his PR against 3.3, we also see
https://github.com/juju/juju/blob/f6a66aa91eec620f5ac04a19d8c06bef03ae6228/worker/uniter/actions/resolver.go#L59

That there are a couple of conditions `remoteState.ActionsBlocked` or `localState.OutdatedRemoteCharm`, which could cause us to not run any actions (but until his PR would also not generate any logging information for us to know that is happening).

The first condition might be set if:
        // ActionsBlocked is true on CAAS when actions cannot be run due to
        // pod initialization.
        ActionsBlocked bool

The only places that seem to set it are:
worker/uniter/remotestate/watcher.go
183: ActionsBlocked: config.ContainerRunningStatusChannel != nil,
1325: w.current.ActionsBlocked = !runningStatus.Running

The ContainerRunningStatusChannel seems to be about whether this is running locally or remotely. Eg validate considers it an error if:
        if w.ModelType == model.CAAS && !w.Sidecar {
                if w.ContainerRunningStatusChannel != nil &&
                        w.ContainerRunningStatusFunc == nil {
                        return errors.NotValidf("watcher config for CAAS model with nil container running status func")
                }
        }

However, IIRC httprequest-lego-provider *is* a sidecar charm. *And* this only triggers if you had the channel but no valid status func. So there is very little that says anything about when it is ok for it to be nil. (just one particular condition when if it isn't nil you can't have a nil status func.)

And caasoperator sets it when:

                                if op.deploymentMode != caas.ModeOperator {
                                        params.IsRemoteUnit = true
                                        params.ContainerRunningStatusChannel = unitRunningChannels[unitID]

That also doesn't do any checks that unitID is in the map, so it might be just returning nil there.

Revision history for this message
Nicolas Vinuesa (nvinuesa) wrote :

To complete John's remarks on the retrieved logs, here are other takeovers from today's debug session.

As Caner correctly pointed out, the django charm had the `peer` metadata set to false in one of the logs that we were verifying, and indeed the value should have been true as we verified on the charm itself. Now this might not have a direct impact in the bug we are seeing but it's something to follow up.

John also noted that we have a code path in the actionsResolver that will return an `ErrNoOperation` in some cases:

https://github.com/juju/juju/blob/3.1/worker/uniter/actions/resolver.go#L56-L71

We should better log, we have 3 possible return cases in that snippet.

We continued testing exec actions (`hostname`) into both units (`httprequest-lego-provider` and `nginx-ingress-integrator`) to double check that they got pending in httprequest-lego-provider until a timeout of 5 minutes.

On top of the db queries and logs that I posted in https://bugs.launchpad.net/juju/+bug/2056193/comments/44 we also retrieved a full log of the agent and controller for a detailed inspection.

The following steps are to analyze everything we were able to retrieve and therefore unblock Tom for further debug.

Changed in juju:
status: Triaged → New
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.