<name>_relation_departed is sometimes called randomly or skipped

Bug #1945514 reported by Simon Déziel
36
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
High
Unassigned

Bug Description

The problem is twofold:

1) randomly calling relation_departed:

When adding an additional unit, the existing unit had it's relation_departed hook triggered out of the blue, apparently when I added an extra unit. Here's what I've done:

# deploy initial unit (https-client/33):
$ juju deploy ./https-client_ubuntu-20.04-amd64.charm
Located local charm "https-client", revision 30
Deploying "https-client" from local charm "https-client", revision 30

# establish relation with lxd:
$ juju relate https-client lxd

# so far so good so let's add one more unit (https-client/34):
$ juju add-unit https-client

# check the juju logs:

unit-https-client-34: 16:31:28 INFO juju Starting unit workers for "https-client/34"
unit-https-client-34: 16:31:28 INFO juju.worker.apicaller [5c4e7f] "unit-https-client-34" successfully connected to "[2602:fc62:b:1018:1::a025]:17070"
unit-https-client-34: 16:31:28 INFO juju.worker.apicaller [5c4e7f] password changed for "unit-https-client-34"
unit-https-client-34: 16:31:28 INFO juju.worker.apicaller [5c4e7f] "unit-https-client-34" successfully connected to "172.17.18.217:17070"
unit-https-client-34: 16:31:28 INFO juju.worker.migrationminion migration phase is now: NONE
unit-https-client-34: 16:31:28 INFO juju.worker.logger logger worker started
unit-https-client-34: 16:31:28 INFO juju.worker.upgrader no waiter, upgrader is done
unit-https-client-33: 16:31:29 DEBUG unit.https-client/33.juju-log https:39: Operator Framework 1.2.0 up and running.
unit-https-client-33: 16:31:29 DEBUG unit.https-client/33.juju-log https:39: Legacy hooks/https-relation-departed does not exist.
unit-https-client-33: 16:31:29 DEBUG unit.https-client/33.juju-log https:39: Using local storage: /var/lib/juju/agents/unit-https-client-33/charm/.unit-state.db already exists
unit-https-client-33: 16:31:29 DEBUG unit.https-client/33.juju-log https:39: Emitting Juju event https_relation_departed.
unit-lxd-21: 16:31:29 DEBUG unit.lxd/21.juju-log https:39: Operator Framework 1.2.0 up and running.
unit-lxd-21: 16:31:29 DEBUG unit.lxd/21.juju-log https:39: Legacy hooks/https-relation-departed does not exist.
unit-lxd-21: 16:31:29 DEBUG unit.lxd/21.juju-log https:39: Using local storage: /var/lib/juju/agents/unit-lxd-21/charm/.unit-state.db already exists
unit-lxd-21: 16:31:29 DEBUG unit.lxd/21.juju-log https:39: Emitting Juju event https_relation_departed.
unit-lxd-21: 16:31:29 DEBUG unit.lxd/21.juju-log https:39: _on_https_relation_departed
unit-lxd-21: 16:31:29 INFO unit.lxd/21.juju-log https:39: Deleting juju-relation-test-https-client/33:autoremove from trusted list

In the above, we see https-client/33's departed hook was called for no reason.

2) not calling relation_departed:

# after the unexpected departure of https-client/33, I added another unit (https-client/35):
$ juju add-unit https-client

# this one behaved fine (joined and didn't cause any disruption)

# now to break the relation:
$ juju remove-relation https-client lxd

# check the juju logs:

unit-lxd-21: 16:35:36 DEBUG unit.lxd/21.juju-log Operator Framework 1.2.0 up and running.
unit-lxd-21: 16:35:36 DEBUG unit.lxd/21.juju-log Legacy hooks/update-status does not exist.
unit-lxd-21: 16:35:36 DEBUG unit.lxd/21.juju-log Using local storage: /var/lib/juju/agents/unit-lxd-21/charm/.unit-state.db already exists
unit-lxd-21: 16:35:36 DEBUG unit.lxd/21.juju-log Emitting Juju event update_status.
unit-https-client-34: 16:35:47 DEBUG unit.https-client/34.juju-log Operator Framework 1.2.0 up and running.
unit-https-client-34: 16:35:47 DEBUG unit.https-client/34.juju-log Legacy hooks/update-status does not exist.
unit-https-client-34: 16:35:47 DEBUG unit.https-client/34.juju-log Using local storage: /var/lib/juju/agents/unit-https-client-34/charm/.unit-state.db already exists
unit-https-client-34: 16:35:47 DEBUG unit.https-client/34.juju-log Emitting Juju event update_status.
unit-https-client-35: 16:35:50 DEBUG unit.https-client/35.juju-log https:39: Operator Framework 1.2.0 up and running.
unit-https-client-34: 16:35:50 DEBUG unit.https-client/34.juju-log https:39: Operator Framework 1.2.0 up and running.
unit-https-client-34: 16:35:50 DEBUG unit.https-client/34.juju-log https:39: Legacy hooks/https-relation-broken does not exist.
unit-https-client-35: 16:35:50 DEBUG unit.https-client/35.juju-log https:39: Legacy hooks/https-relation-broken does not exist.
unit-https-client-33: 16:35:50 DEBUG unit.https-client/33.juju-log https:39: Operator Framework 1.2.0 up and running.
unit-https-client-33: 16:35:50 DEBUG unit.https-client/33.juju-log https:39: Legacy hooks/https-relation-broken does not exist.
unit-https-client-35: 16:35:50 DEBUG unit.https-client/35.juju-log https:39: Using local storage: /var/lib/juju/agents/unit-https-client-35/charm/.unit-state.db already exists
unit-https-client-33: 16:35:50 DEBUG unit.https-client/33.juju-log https:39: Using local storage: /var/lib/juju/agents/unit-https-client-33/charm/.unit-state.db already exists
unit-https-client-34: 16:35:50 DEBUG unit.https-client/34.juju-log https:39: Using local storage: /var/lib/juju/agents/unit-https-client-34/charm/.unit-state.db already exists
unit-https-client-35: 16:35:50 DEBUG unit.https-client/35.juju-log https:39: Emitting Juju event https_relation_broken.
unit-https-client-33: 16:35:50 DEBUG unit.https-client/33.juju-log https:39: Emitting Juju event https_relation_broken.
unit-https-client-34: 16:35:50 DEBUG unit.https-client/34.juju-log https:39: Emitting Juju event https_relation_broken.
unit-lxd-21: 16:35:51 DEBUG unit.lxd/21.juju-log https:39: Operator Framework 1.2.0 up and running.
unit-lxd-21: 16:35:51 DEBUG unit.lxd/21.juju-log https:39: Legacy hooks/https-relation-departed does not exist.
unit-lxd-21: 16:35:51 DEBUG unit.lxd/21.juju-log https:39: Using local storage: /var/lib/juju/agents/unit-lxd-21/charm/.unit-state.db already exists
unit-lxd-21: 16:35:51 DEBUG unit.lxd/21.juju-log https:39: Emitting Juju event https_relation_departed.
unit-lxd-21: 16:35:51 DEBUG unit.lxd/21.juju-log https:39: _on_https_relation_departed
unit-lxd-21: 16:35:51 INFO unit.lxd/21.juju-log https:39: Deleting juju-relation-test-https-client/34:autoremove from trusted list

In the above, none of the https-client units saw their relation_departed hook invoked, only their (non-existent) relation_broken hook. For some reason, the lxd unit received a relation_departed for https-client/34 (but no other).

Additional information:

$ juju --version
2.9.15-ubuntu-amd64

$ juju status --relations --color
Model Controller Cloud/Region Version SLA Timestamp
test overlord maas/default 2.9.14 unsupported 16:44:05Z

App Version Status Scale Charm Store Channel Rev OS Message
https-client active 3 https-client local 30 ubuntu
lxd active 1 lxd local 30 ubuntu

Unit Workload Agent Machine Public address Ports Message
https-client/33* active idle 68 2602:fc62:b:1018:0:1:0:c
https-client/34 active idle 69 2602:fc62:b:1018:0:1:0:f
https-client/35 active idle 72 2602:fc62:b:1018:0:1:0:e
lxd/21* active idle 57 2602:fc62:b:1018:0:1:0:b

Machine State DNS Inst id Series AZ Message
57 started 2602:fc62:b:1018:0:1:0:b cloud-vm08 focal default Deployed
68 started 2602:fc62:b:1018:0:1:0:c cloud-vm03 focal default Deployed
69 started 2602:fc62:b:1018:0:1:0:f cloud-vm04 focal default Deployed
71 started 2602:fc62:b:1018:0:1:0:7 cloud-vm05 focal default Deployed
72 started 2602:fc62:b:1018:0:1:0:e cloud-vm06 focal default Deployed
73 started 2602:fc62:b:1018:0:1:0:2 cloud-vm07 focal default Deployed
74 started 2602:fc62:b:1018:0:1:0:3 cloud-vm10 focal default Deployed
75 started 2602:fc62:b:1018:0:1:0:d cloud-vm11 focal default Deployed
76 started 2602:fc62:b:1018:0:1:0:12 cloud-vm14 focal default Deployed
77 started 2602:fc62:b:1018:0:1:: cloud-vm01 focal default Deployed
78 started 2602:fc62:b:1018:0:1:0:13 cloud-vm12 focal default Deployed
79 started 2602:fc62:b:1018:0:1:0:14 cloud-vm13 focal default Deployed
81 started 2602:fc62:b:1018:0:1:0:5 cloud-vm09 focal default Deployed
82 pending pending focal

Revision history for this message
Simon Déziel (sdeziel) wrote :
description: updated
Revision history for this message
John A Meinel (jameinel) wrote :

Certainly randomly running relation-departed when a unit hasn't actually been removed (and then not running it when the relation *is* removed) is a critical function. I can't say we've seen this elsewhere, but I'll dig and see if we can reproduce

Changed in juju:
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
Simon Déziel (sdeziel) wrote :
Revision history for this message
Simon Déziel (sdeziel) wrote :
Revision history for this message
Simon Déziel (sdeziel) wrote :

$ juju model-config logging-config
<root>=WARNING;unit=DEBUG

Revision history for this message
John A Meinel (jameinel) wrote :

so one bit that doesn't quite fit in the picture. Looking at https-client_34/var/log/juju/machine-lock.log we see:
2021-09-29 16:31:28 === agent unit-https-client-34 started ===
2021-09-29 16:31:28 unit-https-client-34: meterstatus (meter-status-changed), waited 0s, held 0s
2021-09-29 16:31:33 unit-https-client-34: https-client/34 uniter (run install hook), waited 0s, held 1s
2021-09-29 16:31:33 unit-https-client-34: https-client/34 uniter (run relation-created (39; app: lxd) hook), waited 0s, held 1s
2021-09-29 16:31:34 unit-https-client-34: https-client/34 uniter (run leader-settings-changed hook), waited 0s, held 1s
2021-09-29 16:31:35 unit-https-client-34: https-client/34 uniter (run config-changed hook), waited 0s, held 1s
2021-09-29 16:31:35 unit-https-client-34: https-client/34 uniter (run start hook), waited 0s, held 1s
2021-09-29 16:35:48 unit-https-client-34: https-client/34 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 16:35:51 unit-https-client-34: https-client/34 uniter (run relation-broken (39; app: lxd) hook), waited 0s, held 1s
2021-09-29 16:40:19 unit-https-client-34: https-client/34 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 16:46:02 unit-https-client-34: https-client/34 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 16:50:32 unit-https-client-34: https-client/34 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 16:56:10 unit-https-client-34: https-client/34 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 17:01:01 unit-https-client-34: https-client/34 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 17:06:29 unit-https-client-34: https-client/34 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 17:10:47 unit-https-client-34: https-client/34 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 17:16:28 unit-https-client-34: https-client/34 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 17:22:22 unit-https-client-34: https-client/34 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 17:27:54 unit-https-client-34: https-client/34 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 17:32:06 unit-https-client-34: https-client/34 uniter (run update-status hook), waited 0s, held 1s

That clearly points to it running relation-created, start and update-status, but never running relation-joined, and never running relation-departed.
I *would* expect that if a unit doesn't join a relation then it won't depart the relation, and thus you would get broken without departed.

however, I just noticed that the original bug was about unit 33, will check there.

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (5.2 KiB)

2021-09-29 16:30:50 === agent unit-https-client-33 started ===
2021-09-29 16:30:51 unit-https-client-33: meterstatus (meter-status-changed), waited 0s, held 0s
2021-09-29 16:30:56 unit-https-client-33: https-client/33 uniter (run install hook), waited 0s, held 1s
2021-09-29 16:30:56 unit-https-client-33: https-client/33 uniter (run leader-elected hook), waited 0s, held 1s
2021-09-29 16:30:57 unit-https-client-33: https-client/33 uniter (run config-changed hook), waited 0s, held 1s
2021-09-29 16:30:58 unit-https-client-33: https-client/33 uniter (run start hook), waited 0s, held 1s
2021-09-29 16:31:10 unit-https-client-33: https-client/33 uniter (run relation-created (39; app: lxd) hook), waited 0s, held 1s
2021-09-29 16:31:11 unit-https-client-33: https-client/33 uniter (run relation-joined (39; unit: lxd/21) hook), waited 0s, held 1s
2021-09-29 16:31:12 unit-https-client-33: https-client/33 uniter (run relation-changed (39; unit: lxd/21) hook), waited 0s, held 1s
2021-09-29 16:31:13 unit-https-client-33: https-client/33 uniter (run relation-changed (39; unit: lxd/21) hook), waited 0s, held 1s
2021-09-29 16:31:29 unit-https-client-33: https-client/33 uniter (run relation-departed (39; unit: lxd/21, departee: lxd/21) hook), waited 0s, held 1s
2021-09-29 16:35:51 unit-https-client-33: https-client/33 uniter (run relation-broken (39; app: lxd) hook), waited 0s, held 1s
2021-09-29 16:35:54 unit-https-client-33: https-client/33 uniter (run update-status hook), waited 0s, held 1s
...

Note that this correlates with:
lxd_21/var/log/juju/machine-lock.log
2021-09-29 16:30:25 unit-lxd-21: lxd/21 uniter (run relation-departed (38; unit: https-client/30, departee: https-client/30) hook), waited 0s, held 1s
2021-09-29 16:30:26 unit-lxd-21: lxd/21 uniter (run relation-departed (38; unit: https-client/31, departee: https-client/31) hook), waited 0s, held 1s
2021-09-29 16:30:27 unit-lxd-21: lxd/21 uniter (run relation-departed (38; unit: https-client/32, departee: https-client/32) hook), waited 0s, held 1s
2021-09-29 16:30:28 unit-lxd-21: lxd/21 uniter (run relation-broken (38; app: https-client) hook), waited 0s, held 1s
2021-09-29 16:31:10 unit-lxd-21: lxd/21 uniter (run relation-created (39; app: https-client) hook), waited 0s, held 1s
2021-09-29 16:31:11 unit-lxd-21: lxd/21 uniter (run relation-joined (39; unit: https-client/33) hook), waited 0s, held 1s
2021-09-29 16:31:13 unit-lxd-21: lxd/21 uniter (run relation-changed (39; unit: https-client/33) hook), waited 0s, held 1s
2021-09-29 16:31:22 unit-lxd-21: lxd/21 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 16:31:30 unit-lxd-21: lxd/21 uniter (run relation-departed (39; unit: https-client/33, departee: https-client/33) hook), waited 0s, held 1s
2021-09-29 16:31:34 unit-lxd-21: lxd/21 uniter (run relation-joined (39; unit: https-client/34) hook), waited 0s, held 1s
2021-09-29 16:31:35 unit-lxd-21: lxd/21 uniter (run relation-changed (39; unit: https-client/34) hook), waited 0s, held 1s
2021-09-29 16:33:53 unit-lxd-21: lxd/21 uniter (run relation-joined (39; unit: https-client/35) hook), waited 0s, held 1s
2021-09-29 16:33:54 unit-lxd-21: lxd/21 uniter (run relation-changed (39; unit: ht...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

So what this *appears* to be is that somehow we ended up with the 'juju add-unit https-client' suddenly treating the relation as having no units in the relation (which is why https-client/33 shows departed being called, and lxd/21 departs, and then why 34 and 35 show -created but never -joined or -changed, and why they then only show -broken and never -departed.)

I don't have any idea how Juju would see an established relation, and then remove the active units due to an add-unit.

Revision history for this message
John A Meinel (jameinel) wrote :

This may be a similar cause to:
https://bugs.launchpad.net/bugs/1940983

though in that case cross model relations were involved.

Revision history for this message
John A Meinel (jameinel) wrote :

To give a rough timeline:
lxd/21 is up and running
https-client/30, 31, 32 have just gone away

2021-09-29 16:30:27 unit-lxd-21: lxd/21 uniter (run relation-departed (38; unit: https-client/32, departee: https-client/32) hook), waited 0s, held 1s
2021-09-29 16:30:28 unit-lxd-21: lxd/21 uniter (run relation-broken (38; app: https-client) hook), waited 0s, held 1s

https-client/33 shows up and joins the relation:
2021-09-29 16:31:10 unit-lxd-21: lxd/21 uniter (run relation-created (39; app: https-client) hook), waited 0s, held 1s
2021-09-29 16:31:11 unit-lxd-21: lxd/21 uniter (run relation-joined (39; unit: https-client/33) hook), waited 0s, held 1s
2021-09-29 16:31:13 unit-lxd-21: lxd/21 uniter (run relation-changed (39; unit: https-client/33) hook), waited 0s, held 1s
2021-09-29 16:31:22 unit-lxd-21: lxd/21 uniter (run update-status hook), waited 0s, held 1s

https-client/34 and 35 are added, somehow removing 33:
2021-09-29 16:31:30 unit-lxd-21: lxd/21 uniter (run relation-departed (39; unit: https-client/33, departee: https-client/33) hook), waited 0s, held 1s
2021-09-29 16:31:34 unit-lxd-21: lxd/21 uniter (run relation-joined (39; unit: https-client/34) hook), waited 0s, held 1s
2021-09-29 16:31:35 unit-lxd-21: lxd/21 uniter (run relation-changed (39; unit: https-client/34) hook), waited 0s, held 1s
2021-09-29 16:33:53 unit-lxd-21: lxd/21 uniter (run relation-joined (39; unit: https-client/35) hook), waited 0s, held 1s
2021-09-29 16:33:54 unit-lxd-21: lxd/21 uniter (run relation-changed (39; unit: https-client/35) hook), waited 0s, held 1s
2021-09-29 16:35:37 unit-lxd-21: lxd/21 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 16:35:51 unit-lxd-21: lxd/21 uniter (run relation-departed (39; unit: https-client/34, departee: https-client/34) hook), waited 0s, held 1s

but also in some sort of fashion that 34 and 35 never see lxd/21
2021-09-29 16:31:35 unit-https-client-34: https-client/34 uniter (run start hook), waited 0s, held 1s
2021-09-29 16:35:48 unit-https-client-34: https-client/34 uniter (run update-status hook), waited 0s, held 1s
2021-09-29 16:35:51 unit-https-client-34: https-client/34 uniter (run relation-broken (39; app: lxd) hook), waited 0s, held 1s

Revision history for this message
John A Meinel (jameinel) wrote :

This may also be the case that was seen in
https://bugs.launchpad.net/juju/+bug/1931632

Revision history for this message
Ian Booth (wallyworld) wrote :

Ideally we'd turn on TRACE debugging for a couple of key modules so we can see in more detail what's happening when it goes wrong:

juju.state.relationunits
juju.worker.uniter.relation
juju.worker.uniter.relationunits

And also get a "juju dump-db" which produces a yaml db dump.
(you'd need to set JUJU_DEV_FEATURE_FLAGS=developer-mode on the client to get this command)

Revision history for this message
Simon Déziel (sdeziel) wrote :

@wallyworld, thanks. I made the following change and will provide the DB dump if/when I run into the issue again.

$ juju model-config logging-config
<root>=WARNING;unit=DEBUG;juju.state.relationunits=TRACE;juju.worker.uniter.relation=TRACE;juju.worker.uniter.relationunits=TRACE

Revision history for this message
Ian Booth (wallyworld) wrote :

Adding to a milestone so we can track the bug; we can better investigate once we get the dump-db output.

Changed in juju:
milestone: none → 2.9.17
Changed in juju:
milestone: 2.9.17 → 2.9.18
John A Meinel (jameinel)
Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9.18 → 2.9.19
Changed in juju:
milestone: 2.9.19 → 2.9.20
Changed in juju:
milestone: 2.9.20 → 2.9.21
Changed in juju:
milestone: 2.9.21 → 2.9.22
Changed in juju:
milestone: 2.9.22 → 2.9.23
Changed in juju:
milestone: 2.9.23 → 2.9.24
Changed in juju:
milestone: 2.9.24 → 2.9.25
Changed in juju:
milestone: 2.9.25 → 2.9.26
Changed in juju:
milestone: 2.9.26 → 2.9.27
Changed in juju:
milestone: 2.9.27 → 2.9.28
Changed in juju:
milestone: 2.9.28 → 2.9.29
Revision history for this message
Ian Booth (wallyworld) wrote :

I'll mark as incomplete until we get told of another case where it's happened again and we can get the necessary debug info.

Changed in juju:
assignee: Ian Booth (wallyworld) → nobody
status: Triaged → Incomplete
Changed in juju:
milestone: 2.9.29 → 2.9.30
John A Meinel (jameinel)
Changed in juju:
importance: Critical → High
milestone: 2.9.30 → 2.9-next
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9-next → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Canonical Juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.