Comment 5 for bug 1760138

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Thanks John for doing the additional analysis; I was also heading in the direction of a gap between is_leader() and leader_set().

The code that you quoted looks like it ought to complete in 30 seconds (the only call out is very quick), and this code (the only other leader_set()) doesn't have any paths that should be longer than 30 seconds to complete before calling leader_set():

    if not is_leader():
        log('Deferring leader-setting updates to the leader unit')
        return

    curr_fsid = leader_get('fsid')
    curr_secret = leader_get('monitor-secret')
    for relid in relation_ids('bootstrap-source'):
        for unit in related_units(relid=relid):
            mon_secret = relation_get('monitor-secret', unit, relid)
            fsid = relation_get('fsid', unit, relid)

            if not (mon_secret and fsid):
                log('Relation data is not ready as the fsid or the '
                    'monitor-secret are missing from the relation: '
                    'mon_secret = {} and fsid = {} '.format(mon_secret, fsid))
                continue

            if not (curr_fsid or curr_secret):
                curr_fsid = fsid
                curr_secret = mon_secret
            else:
                # The fsids and secrets need to match or the local monitors
                # will fail to join the mon cluster. If they don't,
                # bail because something needs to be investigated.
                assert curr_fsid == fsid, \
                    "bootstrap fsid '{}' != current fsid '{}'".format(
                        fsid, curr_fsid)
                assert curr_secret == mon_secret, \
                    "bootstrap secret '{}' != current secret '{}'".format(
                        mon_secret, curr_secret)

            opts = {
                'fsid': fsid,
                'monitor-secret': mon_secret,
            }

            log('Updating leader settings for fsid and monitor-secret '
                'from remote relation data: {}'.format(opts))
            leader_set(opts)

This code is ONLY used when upgrading from 'ceph' only charm, which isn't the case here, so we can discount it. It does look like this code *could* call leader_set for every unit in the 'bootstrap-source' relation, which may be a problem.

So the only is_leader() ... leader_set() code path is the one that was quoted, and the only call out, that isn't Juju related, is "ceph.generate_monitor_secret()" which is very quick.

So, I'm wondering if the charm does manage to call leader_set() in time, but the controller doesn't 'see' the leader_set until several minutes later?