Comment 3 for bug 1760138

Revision history for this message
John A Meinel (jameinel) wrote :

This seems to be some sort of load issue, as this is what I'm seeing:
$ grep -rnI "is-leader\|leader-set" .
./unit-ceph-mon-1.log:1196:2018-03-30 01:54:36 DEBUG worker.uniter.jujuc server.go:181 running hook tool "is-leader"
./unit-ceph-mon-1.log:1206:2018-03-30 01:56:50 DEBUG worker.uniter.jujuc server.go:181 running hook tool "leader-set"
./unit-ceph-mon-1.log:1230:2018-03-30 01:58:27 DEBUG config-changed subprocess.CalledProcessError: Command '['leader-set', 'fsid=9a557a74-33bd-11e8-aedb-00163ef9463f', 'monitor-secret=AQBemb1ajb3qLRAAlPxv893HqNK1oos2V1jUzg==']' returned non-zero exit status 1

It isn't clear how long it is taking for "is-leader" to return but we are seeing that the time between "is-leader" and "leader-set" being called is a full 2m14s, and that we don't get to "leader-set" returning nonzero for another 1m37s.

When we nominate a node as a leader, we give them a lease that the Juju agent refreshes every 30s that lasts for 1minute. If we cannot get a refresh of "yes I'm still alive" within that minute, then we will give leadership over to someone else that is responding.

Now, is the problem the Jujud controller itself is so overloaded that it can't respond to requests in a timely manner?

AFAICT the code in question is:
    if is_leader():
        if not config('no-bootstrap'):
            if not leader_get('fsid') or not leader_get('monitor-secret'):
                if config('fsid'):
                    fsid = config('fsid')
                else:
                    fsid = "{}".format(uuid.uuid1())
                if config('monitor-secret'):
                    mon_secret = config('monitor-secret')
                else:
                    mon_secret = "{}".format(ceph.generate_monitor_secret())
                status_set('maintenance', 'Creating FSID and Monitor Secret')
                opts = {
                    'fsid': fsid,
                    'monitor-secret': mon_secret,
                }
                log("Settings for the cluster are: {}".format(opts))
                leader_set(opts)

$ grep -rnI "is-leader\|leader-set\|leader-get\|config-get" .
./unit-ceph-mon-1.log:1196:2018-03-30 01:54:36 DEBUG worker.uniter.jujuc server.go:181 running hook tool "is-leader"
./unit-ceph-mon-1.log:1200:2018-03-30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "leader-get"
./unit-ceph-mon-1.log:1201:2018-03-30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "config-get"
./unit-ceph-mon-1.log:1202:2018-03-30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "config-get"
./unit-ceph-mon-1.log:1206:2018-03-30 01:56:50 DEBUG worker.uniter.jujuc server.go:181 running hook tool "leader-set"
./unit-ceph-mon-1.log:1230:2018-03-30 01:58:27 DEBUG config-changed subprocess.CalledProcessError: Command '['leader-set', 'fsid=9a557a74-33bd-11e8-aedb-00163ef9463f', 'monitor-secret=AQBemb1ajb3qLRAAlPxv893HqNK1oos2V1jUzg==']' returned non-zero exit status 1

That seems to say leader-get and config-get are both very fast (sub second). At it is no more than 10s between is-leader and subsequent commands.
But maybe "is-leader" itself is swamped for some reason. Just
$ grep -rnI "running hook tool" .

./unit-ceph-mon-1.log:1196:2018-03-30 01:54:36 DEBUG worker.uniter.jujuc server.go:181 running hook tool "is-leader"
./unit-ceph-mon-1.log:1200:2018-03-30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "leader-get"
./unit-ceph-mon-1.log:1201:2018-03-30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "config-get"
./unit-ceph-mon-1.log:1202:2018-03-30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "config-get"
./unit-ceph-mon-1.log:1203:2018-03-30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "status-set"
./unit-ceph-mon-1.log:1204:2018-03-30 01:56:50 DEBUG worker.uniter.jujuc server.go:181 running hook tool "juju-log"
./unit-ceph-mon-1.log:1206:2018-03-30 01:56:50 DEBUG worker.uniter.jujuc server.go:181 running hook tool "leader-set"

The crashdump appears to have all the log files from all the machines running in the model, but no logs from the controller itself.
I was hoping to peek around and see if the controller is reporting any issues running around the 01:54:36 => 01:58:27 timeframe.