Thanks John for doing the additional analysis; I was also heading in the direction of a gap between is_leader() and leader_set().
The code that you quoted looks like it ought to complete in 30 seconds (the only call out is very quick), and this code (the only other leader_set()) doesn't have any paths that should be longer than 30 seconds to complete before calling leader_set():
if not is_leader(): log('Deferring leader-setting updates to the leader unit')
return
curr_fsid = leader_get('fsid')
curr_secret = leader_get('monitor-secret')
for relid in relation_ids('bootstrap-source'):
for unit in related_units(relid=relid): mon_secret = relation_get('monitor-secret', unit, relid)
fsid = relation_get('fsid', unit, relid)
if not (mon_secret and fsid): log('Relation data is not ready as the fsid or the ' 'monitor-secret are missing from the relation: ' 'mon_secret = {} and fsid = {} '.format(mon_secret, fsid)) continue
if not (curr_fsid or curr_secret): curr_fsid = fsid curr_secret = mon_secret
else:
# The fsids and secrets need to match or the local monitors
# will fail to join the mon cluster. If they don't,
# bail because something needs to be investigated. assert curr_fsid == fsid, \ "bootstrap fsid '{}' != current fsid '{}'".format( fsid, curr_fsid) assert curr_secret == mon_secret, \ "bootstrap secret '{}' != current secret '{}'".format( mon_secret, curr_secret)
log('Updating leader settings for fsid and monitor-secret ' 'from remote relation data: {}'.format(opts)) leader_set(opts)
This code is ONLY used when upgrading from 'ceph' only charm, which isn't the case here, so we can discount it. It does look like this code *could* call leader_set for every unit in the 'bootstrap-source' relation, which may be a problem.
So the only is_leader() ... leader_set() code path is the one that was quoted, and the only call out, that isn't Juju related, is "ceph.generate_monitor_secret()" which is very quick.
So, I'm wondering if the charm does manage to call leader_set() in time, but the controller doesn't 'see' the leader_set until several minutes later?
Thanks John for doing the additional analysis; I was also heading in the direction of a gap between is_leader() and leader_set().
The code that you quoted looks like it ought to complete in 30 seconds (the only call out is very quick), and this code (the only other leader_set()) doesn't have any paths that should be longer than 30 seconds to complete before calling leader_set():
if not is_leader():
log('Deferring leader-setting updates to the leader unit')
return
curr_fsid = leader_get('fsid') get('monitor- secret' ) ids('bootstrap- source' ): units(relid= relid):
mon_ secret = relation_ get('monitor- secret' , unit, relid) get('fsid' , unit, relid)
curr_secret = leader_
for relid in relation_
for unit in related_
fsid = relation_
if not (mon_secret and fsid):
log(' Relation data is not ready as the fsid or the '
' monitor- secret are missing from the relation: '
' mon_secret = {} and fsid = {} '.format( mon_secret, fsid))
continue
if not (curr_fsid or curr_secret):
curr_ fsid = fsid
curr_ secret = mon_secret
assert curr_fsid == fsid, \
" bootstrap fsid '{}' != current fsid '{}'".format(
fsid, curr_fsid)
assert curr_secret == mon_secret, \
" bootstrap secret '{}' != current secret '{}'".format(
mon_ secret, curr_secret)
else:
# The fsids and secrets need to match or the local monitors
# will fail to join the mon cluster. If they don't,
# bail because something needs to be investigated.
opts = {
'fsid' : fsid,
'monitor- secret' : mon_secret,
}
This code is ONLY used when upgrading from 'ceph' only charm, which isn't the case here, so we can discount it. It does look like this code *could* call leader_set for every unit in the 'bootstrap-source' relation, which may be a problem.
So the only is_leader() ... leader_set() code path is the one that was quoted, and the only call out, that isn't Juju related, is "ceph.generate_ monitor_ secret( )" which is very quick.
So, I'm wondering if the charm does manage to call leader_set() in time, but the controller doesn't 'see' the leader_set until several minutes later?