[2.9] juju is slow processing relations

Bug #1931120 reported by Przemyslaw Hausman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
High
Unassigned

Bug Description

juju 2.9.3

Juju 2.9 processes relations much slower than previous juju versions.

Example 1:

`juju add-relation nrpe-host-storage nagios` between 11 nrpe-host-storage units (nrpe charm) and nagios took ~23 minutes. I'd expect this to happen in a fraction of this time. `juju debug-logs` with timestamps below.

unit-nagios-0: 12:11:04 INFO juju.worker.uniter.operation skipped "monitors-relation-created" hook (missing)
unit-nagios-0: 12:11:04 INFO juju.worker.uniter.operation skipped "monitors-relation-joined" hook (missing)
unit-nagios-0: 12:12:48 INFO juju.worker.uniter.operation ran "monitors-relation-changed" hook (via explicit, bespoke hook script)
unit-nagios-0: 12:12:49 INFO juju.worker.uniter.operation skipped "monitors-relation-joined" hook (missing)
unit-nagios-0: 12:14:34 INFO juju.worker.uniter.operation ran "monitors-relation-changed" hook (via explicit, bespoke hook script)
unit-nagios-0: 12:14:40 INFO juju.worker.uniter.operation skipped "monitors-relation-joined" hook (missing)
unit-nagios-0: 12:16:27 INFO juju.worker.uniter.operation ran "monitors-relation-changed" hook (via explicit, bespoke hook script)
unit-nagios-0: 12:16:27 INFO juju.worker.uniter.operation skipped "monitors-relation-joined" hook (missing)
unit-nagios-0: 12:18:17 INFO juju.worker.uniter.operation ran "monitors-relation-changed" hook (via explicit, bespoke hook script)
unit-nagios-0: 12:18:17 INFO juju.worker.uniter.operation skipped "monitors-relation-joined" hook (missing)
unit-nagios-0: 12:20:09 INFO juju.worker.uniter.operation ran "monitors-relation-changed" hook (via explicit, bespoke hook script)
unit-nagios-0: 12:20:15 INFO juju.worker.uniter.operation skipped "monitors-relation-joined" hook (missing)
unit-nagios-0: 12:22:08 INFO juju.worker.uniter.operation ran "monitors-relation-changed" hook (via explicit, bespoke hook script)
unit-nagios-0: 12:22:09 INFO juju.worker.uniter.operation skipped "monitors-relation-joined" hook (missing)
unit-nagios-0: 12:24:03 INFO juju.worker.uniter.operation ran "monitors-relation-changed" hook (via explicit, bespoke hook script)
unit-nagios-0: 12:24:04 INFO juju.worker.uniter.operation skipped "monitors-relation-joined" hook (missing)
unit-nagios-0: 12:26:01 INFO juju.worker.uniter.operation ran "monitors-relation-changed" hook (via explicit, bespoke hook script)
unit-nagios-0: 12:26:07 INFO juju.worker.uniter.operation skipped "monitors-relation-joined" hook (missing)
unit-nagios-0: 12:28:05 INFO juju.worker.uniter.operation ran "monitors-relation-changed" hook (via explicit, bespoke hook script)
unit-nagios-0: 12:28:05 INFO juju.worker.uniter.operation skipped "monitors-relation-joined" hook (missing)
unit-nagios-0: 12:30:06 INFO juju.worker.uniter.operation ran "monitors-relation-changed" hook (via explicit, bespoke hook script)
unit-nagios-0: 12:30:13 INFO juju.worker.uniter.operation skipped "monitors-relation-joined" hook (missing)
unit-nagios-0: 12:32:16 INFO juju.worker.uniter.operation ran "monitors-relation-changed" hook (via explicit, bespoke hook script)
unit-nagios-0: 12:34:20 INFO juju.worker.uniter.operation ran "monitors-relation-changed" hook (via explicit, bespoke hook script)
unit-nagios-0: 12:34:27 INFO juju.worker.uniter.operation skipped "update-status" hook (missing)

Example 2.

`juju add-relation woodpecker prometheus` between 100 woodpecker units and prometheus took more than 24 hours. I don't have logs for this though. But I clearly remember that even after 24 hours, prometheus still did not list all 100 targets.

Revision history for this message
Pen Gale (pengale) wrote :

I tried to do a fairly naïve reproduction of this on a localhost lxd cloud. The following commands did not reproduce the bug.

    juju deploy ubuntu
    juju add-unit -n 10 ubuntu
    juju deploy nrpe
    juju deploy nagios
    # wait for things to settle
    juju relate ubuntu nrpe && juju relate nrpe:monitors nagios:monitors

The last step above took 2 minutes.

What cloud is this running on? Are the boxen in question heavily loaded with workload related tasks? Do you have any advice on putting together a minimal repro case different from the naïve one above?

Revision history for this message
Przemyslaw Hausman (phausman) wrote :

Hi Pete, thanks for looking into this!

The cloud from example 1 is deployed onto bare metal nodes. It is disaggregated Ussuri-Focal, there are separate compute, storage, control and network nodes. LMA stack is deployed into KVMs on the MAAS (infrastructure) nodes. Juju controllers machines are also KVMs on MAAS nodes.

Example 2 was deployed on top of OpenStack cloud. There was a single juju controller VM on top of OpenStack, and juju model VMs: 100x woodpeckers, 1x prometheus, 1x grafana. VMs used ephemeral storage backed by hosts' local bcache devices.

Revision history for this message
John A Meinel (jameinel) wrote :

One thing to look at would be the /var/lib/juju/machine-lock.log

That will tell you if you have lots of agents running on the machine that are competing for runtime, and how long they are waiting for each other.

It is plausible that enough things are going on in the system that the unit agents aren't able to respond quickly.

` juju relate ubuntu nrpe && juju relate nrpe:monitors nagios:monitors

The last step above took 2 minutes.
`

Is that the 'relate' took 2 minutes, or the relate returned quickly but it took 2 minutes before it settled in the model?

Also note that relating means installing a charm which may itself install packages, etc.

Changed in juju:
status: New → Triaged
importance: Undecided → High
status: Triaged → Incomplete
Revision history for this message
Przemyslaw Hausman (phausman) wrote :

I have just run the same scenario as in the example 2, on a similar infrastructure, with juju 2.9.5 and I did not see the problem anymore.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.