Unit fails to start complaining there are members in the relation

Bug #1910958 reported by Andrey Grebennikov
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth

Bug Description

juju 2.8.7 on focal.

Deployed CDK, later added Ceph units and added relations between them. Something went wrong, the relations were tried to get removed, now all units of kubernetes-master are in the error state with "update-status" hook failed, in the logs it is a recurring message
> ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-kubernetes-master-0": cannot create relation state tracker: cannot remove persisted state, relation 25 has members

Attaching a crashdump and the dump_db.
https://drive.google.com/file/d/1u-ldXBTrIZ08rBluigTs5r1gTgwlJl2G/view?usp=sharing

Revision history for this message
Andrey Grebennikov (agrebennikov) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote :

The database dump shows that the kubernetes-master/0 unit has recorded some stale relation information. It is currently in relations with these ids:

2, 3, 6, 10, 11, 13, 14, 15, 16

but the unit state claims that these relations are in play

2, 3, 6, 10, 11, 13, 14, 15, 16, 25, 26

The relation ids 25 and 26 were for ceph-mon units 3, 4, 5.
But currently there's only ceph-mon units 9, 10, 11.
So 3, 4, 5 got deleted and the unit agent for kubernetes-master/0 did not get notified to clean up, or if it did, that failed.

The unit agent start up needs to be made more robust so that if it sees relation ids that no longer exist, it purges those from its state without complaining.

The next thing to figure out is how a relation got deleted without the unit agent cleaning up.

Changed in juju:
milestone: none → 2.9-rc4
importance: Undecided → High
status: New → Triaged
Revision history for this message
Ian Booth (wallyworld) wrote :

I may be missing it, but I can't see in the crashdump the logs for the controller nor the kubernetes-master/0 unit. That makes it hard to find how things got messed up in the first place.
But we can still make the unit agent more robust to bad data.

Revision history for this message
Ian Booth (wallyworld) wrote :

Comment #2 should say if the unit agent has relation state that needs to be deleted because the relation doesn't exist, if that relation state contains already removed units, that should not stop the relation state from being cleaned up.

Changed in juju:
milestone: 2.9-rc4 → 2.8.8
Revision history for this message
Andrey Grebennikov (agrebennikov) wrote :

included logs from k8s-master units and one of 3 controllers.

Ian Booth (wallyworld)
Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Revision history for this message
Ian Booth (wallyworld) wrote :

I can't reproduce the issue no matter how hard I try with manually stopping unit agents, removing relations, restarting agents etc. The logs don't shed any light on exactly what happened.

The best we can do is make juju more defensive by checking for orphaned units instead of complaining.

Revision history for this message
Ian Booth (wallyworld) wrote :

This PR makes the unit agent more robust to orphaned units
https://github.com/juju/juju/pull/12507

Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Trent Lloyd (lathiat) wrote :

Ran into this bug in an environment yesterday.

The symptoms were "ERROR permission denied" during relation-get or relation-set requests.

However the tell-tale log of this error only appeared after jujud was restarted and had not appeared before then (or any other obvious error about the issue at the jujud level):
2021-05-31 07:20:55 ERROR juju.worker.dependency engine.go:671 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-rabbitmq-server-0": cannot create relation state tracker: cannot remove persisted state, relation 233 has members

Advising to assist others in diagnosing the issue

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.