[3.0-rc2 / 2.9.36] juju-lint stuck executing

Bug #1993334 reported by Bas de Bruijne
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Critical
Heather Lanigan
charm-juju-lint
Fix Released
Medium
Gabriel Cocenza

Bug Description

Testrun https://solutions.qa.canonical.com/testruns/testRun/b08a074d-3587-43f9-8e7e-fe70ac4f09ac fails because juju-lint is stuck executing:

```
App Version Status Scale Charm Channel Rev Exposed Message
apache2 active 1 apache2 stable 38 yes Unit is ready
canonical-livepatch active 9 canonical-livepatch stable 48 no Running kernel 5.15.0-1021.25~20.04.1-aws, patchState: nothing-to-apply (source version/commit dad6199)
elasticsearch 6.8.23 active 3 elasticsearch candidate 66 no Unit is ready
filebeat 6.8.23 active 9 filebeat stable 38 no Filebeat ready.
grafana active 1 grafana stable 59 yes Ready
graylog 3.3.16-1 active 1 graylog stable 59 no Ready with: filebeat, elasticsearch, mongodb
juju-lint active 1 juju-lint stable 6 no Unit is ready
mongodb-graylog 3.6.8 active 1 mongodb stable 75 no Unit is ready
ntp 3.5 active 9 ntp stable 50 no chrony: Ready
prometheus active 1 prometheus2 stable 33 no Ready
telegraf active 9 telegraf stable 62 no Monitoring elasticsearch/1 (source version/commit d208a64)

Unit Workload Agent Machine Public address Ports Message
apache2/0* active idle 0 54.224.142.195 Unit is ready
  canonical-livepatch/1 active idle 54.224.142.195 Running kernel 5.15.0-1021.25~20.04.1-aws, patchState: nothing-to-apply (source version/commit dad6199)
  filebeat/1 active idle 54.224.142.195 Filebeat ready.
  ntp/1 active idle 54.224.142.195 123/udp chrony: Ready
  telegraf/1 active idle 54.224.142.195 9103/tcp Monitoring apache2/0 (source version/commit d208a64)
elasticsearch/0 active idle 1 54.198.106.29 9200/tcp Unit is ready
  canonical-livepatch/2 active idle 54.198.106.29 Running kernel 5.15.0-1021.25~20.04.1-aws, patchState: nothing-to-apply (source version/commit dad6199)
  filebeat/2 active idle 54.198.106.29 Filebeat ready.
  ntp/2 active idle 54.198.106.29 123/udp chrony: Ready
  telegraf/2 active idle 54.198.106.29 9103/tcp Monitoring elasticsearch/0 (source version/commit d208a64)
elasticsearch/1* active idle 2 52.90.226.82 9200/tcp Unit is ready
  canonical-livepatch/0* active idle 52.90.226.82 Running kernel 5.15.0-1021.25~20.04.1-aws, patchState: nothing-to-apply (source version/commit dad6199)
  filebeat/0* active idle 52.90.226.82 Filebeat ready.
  ntp/0* active idle 52.90.226.82 123/udp chrony: Ready
  telegraf/0* active idle 52.90.226.82 9103/tcp Monitoring elasticsearch/1 (source version/commit d208a64)
elasticsearch/2 active idle 3 54.81.113.57 9200/tcp Unit is ready
  canonical-livepatch/3 active idle 54.81.113.57 Running kernel 5.15.0-1021.25~20.04.1-aws, patchState: nothing-to-apply (source version/commit dad6199)
  filebeat/3 active idle 54.81.113.57 Filebeat ready.
  ntp/3 active idle 54.81.113.57 123/udp chrony: Ready
  telegraf/3 active idle 54.81.113.57 9103/tcp Monitoring elasticsearch/2 (source version/commit d208a64)
grafana/0* active idle 4 54.173.91.75 3000/tcp Ready
  canonical-livepatch/7 active idle 54.173.91.75 Running kernel 5.15.0-1021.25~20.04.1-aws, patchState: nothing-to-apply (source version/commit dad6199)
  filebeat/7 active idle 54.173.91.75 Filebeat ready.
  ntp/7 active idle 54.173.91.75 123/udp chrony: Ready
  telegraf/7 active idle 54.173.91.75 9103/tcp Monitoring grafana/0 (source version/commit d208a64)
graylog/0* active idle 5 3.91.13.20 5044,9000/tcp Ready with: filebeat, elasticsearch, mongodb
  canonical-livepatch/6 active idle 3.91.13.20 Running kernel 5.15.0-1021.25~20.04.1-aws, patchState: nothing-to-apply (source version/commit dad6199)
  filebeat/6 active idle 3.91.13.20 Filebeat ready.
  ntp/5 active idle 3.91.13.20 123/udp chrony: Ready
  telegraf/5 active idle 3.91.13.20 9103/tcp Monitoring graylog/0 (source version/commit d208a64)
juju-lint/0* active executing 6 54.146.51.220 (config-changed) Unit is ready
  canonical-livepatch/4 active idle 54.146.51.220 Running kernel 5.15.0-1021.25~20.04.1-aws, patchState: nothing-to-apply (source version/commit dad6199)
  filebeat/4 active idle 54.146.51.220 Filebeat ready.
  ntp/4 active idle 54.146.51.220 123/udp chrony: Ready
  telegraf/4 active idle 54.146.51.220 9103/tcp Monitoring juju-lint/0 (source version/commit d208a64)
mongodb-graylog/0* active idle 7 3.87.157.136 27017,27019,27021,28017/tcp Unit is ready
  canonical-livepatch/5 active idle 3.87.157.136 Running kernel 5.15.0-1021.25~20.04.1-aws, patchState: nothing-to-apply (source version/commit dad6199)
  filebeat/5 active idle 3.87.157.136 Filebeat ready.
  ntp/6 active idle 3.87.157.136 123/udp chrony: Ready
  telegraf/6 active idle 3.87.157.136 9103/tcp Monitoring mongodb-graylog/0 (source version/commit d208a64)
prometheus/0* active idle 8 54.92.140.189 9090,12321/tcp Ready
  canonical-livepatch/8 active idle 54.92.140.189 Running kernel 5.15.0-1021.25~20.04.1-aws, patchState: nothing-to-apply (source version/commit dad6199)
  filebeat/8 active idle 54.92.140.189 Filebeat ready.
  ntp/8 active idle 54.92.140.189 123/udp chrony: Ready
  telegraf/8 active idle 54.92.140.189 9103/tcp Monitoring prometheus/0 (source version/commit d208a64)

Machine State Address Inst id Base AZ Message
0 started 54.224.142.195 i-02d1a96e1005b4092 ubuntu:20.04 us-east-1c running
1 started 54.198.106.29 i-03f17eab6061298a7 ubuntu:20.04 us-east-1c running
2 started 52.90.226.82 i-0df579fab5115a6a5 ubuntu:20.04 us-east-1c running
3 started 54.81.113.57 i-0d7b15c74add6832b ubuntu:20.04 us-east-1c running
4 started 54.173.91.75 i-005b792bd9f7b2c7f ubuntu:20.04 us-east-1c running
5 started 3.91.13.20 i-0f87ddc05aa214c5e ubuntu:20.04 us-east-1c running
6 started 54.146.51.220 i-077aabcb44180d678 ubuntu:20.04 us-east-1c running
7 started 3.87.157.136 i-004e8efaf4cfcb371 ubuntu:20.04 us-east-1c running
8 started 54.92.140.189 i-0a716c52d306ee8c0 ubuntu:20.04 us-east-1c running
```

We saw this in earlier testruns of 3.0 as well, also on the juju-lint charm. This could be a bug in juju-lint bug I'm opening it here first.

In the juju-lint debug log we only see:
```
2022-10-18 15:46:26 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 15:50:31 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 15:55:24 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 15:59:31 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 16:03:47 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 16:07:57 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 16:13:17 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 16:19:15 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 16:24:04 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 16:29:13 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 16:33:39 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 16:37:42 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 16:43:18 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 16:47:52 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 16:51:57 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
2022-10-18 16:56:37 DEBUG juju.worker.uniter.remotestate watcher.go:801 update status timer triggered for juju-lint/0
```

So it looks like the `executing` status is incorrect.

Logs and configs can be found here:
https://oil-jenkins.canonical.com/artifacts/b08a074d-3587-43f9-8e7e-fe70ac4f09ac/index.html

Tags: cdo-qa

Related branches

Changed in juju:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 3.0-rc3
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

From the logs, looks like the charm is stuck in the config changed hook:

To start the hook executution:
2022-10-18 13:58:37 DEBUG juju.machinelock machinelock.go:176 machine lock "machine-lock" acquired for juju-lint/0 uniter (run config-changed hook)
2022-10-18 13:58:37 DEBUG juju.worker.uniter.operation executor.go:135 preparing operation "run config-changed hook" for juju-lint/0
219 2022-10-18 13:58:37 DEBUG juju.worker.uniter.operation executor.go:135 executing operation "run config-changed hook" for juju-lint/0

Usually when finished:
229 2022-10-18 13:58:37 DEBUG juju.worker.uniter.remotestate watcher.go:650 got config change for juju-lint/0: ok=true, hashes=[b9faa1f948a0e1b6cbb819711cc8f94a4c0fc21c0dd531b97 610cce62609fde9]
230 2022-10-18 13:58:37 INFO juju.worker.uniter.operation runhook.go:173 ran "config-changed" hook (via hook dispatching script: dispatch)
231 2022-10-18 13:58:37 DEBUG juju.worker.uniter.operation executor.go:135 committing operation "run config-changed hook" for juju-lint/0
232 2022-10-18 13:58:37 DEBUG juju.machinelock machinelock.go:206 created rotating log file "/var/log/juju/machine-lock.log" with max size 10 MB and max backups 5
233 2022-10-18 13:58:37 DEBUG juju.machinelock machinelock.go:190 machine lock "machine-lock" released for juju-lint/0 uniter (run config-changed hook)
234 2022-10-18 13:58:37 DEBUG juju.worker.uniter.operation executor.go:124 lock released for juju-lint/0

However the first last config changed hook, we never get the 2nd half. The machine lock is held.

Changed in juju:
status: Triaged → In Progress
Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

The juju-lint charm failing to return from a config changed (or install) hook can be reproduced with:

* Bootstrap lxd with juju 3.0-rc2
* juju add-model one
* juju deploy juju-lint (rev 6 installed)
* configure the application from `juju show-controller --show-password --format=json | jq` results

If you can verify the machine lock being held (this is an internal juju thing when a hook is being run by a charm, and released when the charm returns from the hook):
$ juju_machine_lock
machine-0:
  holder: none
unit-juju-lint-0:
  holder: juju-lint/0 uniter (run install hook), holding 2m24s

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Passing to the juju-lint charm for resolution

Changed in juju:
status: In Progress → Invalid
Revision history for this message
Gabriel Cocenza (gabrielcocenza) wrote :

After investigation, the problem is that with a controller using version 3.0-rc2, python-libjuju (3.0.2) it's not able to properly connect to a model and this hang the hook that is executing the script.

Steps to reproduce:

* Bootstrap lxd with juju 3.0-rc2
* juju add-model one
* pip3 install juju
* python3 -m asyncio
  from juju.model import Model
  model = Model()
  await model.connect_current()

This is the log that shows: https://pastebin.canonical.com/p/tTVKVcsCfx/

juju-lint can't have a health state if python-libjuju don't connect to a model.
We can enhance the code to have a timeout and better logs that will show this error more easily.

Revision history for this message
Bas de Bruijne (basdbruijne) wrote (last edit ):

This output looks very similar to LP: #1993533, which is using libjuju 2.*. It's good to know that this problem persists with libjuju 3.

@cderici Is this one of the known issues for libjuju 3.0.3?

Revision history for this message
Gabriel Cocenza (gabrielcocenza) wrote :
Revision history for this message
Caner Derici (cderici) wrote :

We did have a couple of updates on libjuju related to this, including https://github.com/juju/python-libjuju/pull/745 that updates the facades/clients. Also I'm unable to reproduce the connection error mentioned above with the libjuju head, so I think the next release will fix all of this.

Can any of you guys can confirm that things are working with the latest head of libjuju?

Changed in charm-juju-lint:
status: New → In Progress
assignee: nobody → Gabriel Angelo Sgarbi Cocenza (gabrielcocenza)
importance: Undecided → Medium
Revision history for this message
Gabriel Cocenza (gabrielcocenza) wrote :

@cderici I confirm that the head of libjuju it's working as expected.

Thanks

Revision history for this message
Moises Emilio Benzan Mora (moisesbenzan) wrote (last edit ):
Changed in juju:
milestone: 3.0-rc3 → 3.0.0
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

This seems to be a blocking issue for the juju 2.9.36 release as well.

Following the reproducer from #4, I can confirm that pylibjuju is working. Also, the test runs with juju 2.9.35 are working fine: https://solutions.qa.canonical.com/testruns/testRun/3ccba0e0-315f-4c53-bd37-b9198059b4a6. Is there something else that has changed that could influence juju-lint?

summary: - [3.0-rc2] juju-lint stuck executing
+ [3.0-rc2 / 2.9.36] juju-lint stuck executing
Revision history for this message
Gabriel Cocenza (gabrielcocenza) wrote :

We are facing the same issue described in this bug[0] on python-libjuju.

The change proposed on juju-lint charm will prevent the hook being held, but the unit can't be into a health state.

[0] https://github.com/juju/python-libjuju/issues/743

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

@gabrielcocenza did some investigation and found a reproducer for the problem: https://pastebin.canonical.com/p/C9YfB9DZTk/

I'm still a little confused about what is happening. When I deploy the lma charms with juju 2.9.35 there is not problem, but this bug shows up when I redeploy with juju 2.9.36. In both cases the pylibjuju version is 3.0.1.

Here is a crashdump from 2.9.35 (without this failure): https://oil-jenkins.canonical.com/artifacts/a619937a-e7b9-4957-bac9-6bc6940ca47e/generated/generated/lma-maas/juju-crashdump-lma-maas-2022-10-26-09.34.00.tar.gz
Here is a crashdump from 2.9.36 (with this failure): https://oil-jenkins.canonical.com/artifacts/7df9a1a2-cde9-48c1-94f7-3894c01d27bd/generated/generated/lma-maas/juju-crashdump-lma-maas-2022-10-26-06.57.34.tar.gz

@gabrielcocenza will try to pin the pylibjuju version for the juju-lint charm to a stable version. This may still be a release blocker though as, unlike with juju 3.0, I think we should expect full backwards compatibility with 2.9.35.

Changed in juju:
status: Invalid → New
Changed in juju:
status: New → Invalid
Revision history for this message
Juan M. Tirado (tiradojm) wrote :

Could you please try pylibjuju 3.0.4?

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

pylibjuju 3.0.4 solves the problem indeed, thanks!

I will ask bootstack to rebuild and upload the charm.

Revision history for this message
Gabriel Cocenza (gabrielcocenza) wrote :

@basdbruijne we have rebuild the charm and now it's available on the edge channel with pylibjuju 3.0.4 to test it.

Revision history for this message
Eric Chen (eric-chen) wrote :

It's also available in candidate channel too.

Changed in charm-juju-lint:
status: In Progress → Fix Committed
tags: added: cdo-qa
Changed in charm-juju-lint:
milestone: none → 23.01
Changed in charm-juju-lint:
status: Fix Committed → Fix Released
Changed in charm-juju-lint:
status: Fix Released → Fix Committed
Changed in charm-juju-lint:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.