This seems to be some sort of load issue, as this is what I'm seeing:
$ grep -rnI "is-leader\|leader-set" .
./unit-ceph-mon-1.log:1196:2018-03-30 01:54:36 DEBUG worker.uniter.jujuc server.go:181 running hook tool "is-leader"
./unit-ceph-mon-1.log:1206:2018-03-30 01:56:50 DEBUG worker.uniter.jujuc server.go:181 running hook tool "leader-set"
./unit-ceph-mon-1.log:1230:2018-03-30 01:58:27 DEBUG config-changed subprocess.CalledProcessError: Command '['leader-set', 'fsid=9a557a74-33bd-11e8-aedb-00163ef9463f', 'monitor-secret=AQBemb1ajb3qLRAAlPxv893HqNK1oos2V1jUzg==']' returned non-zero exit status 1
It isn't clear how long it is taking for "is-leader" to return but we are seeing that the time between "is-leader" and "leader-set" being called is a full 2m14s, and that we don't get to "leader-set" returning nonzero for another 1m37s.
When we nominate a node as a leader, we give them a lease that the Juju agent refreshes every 30s that lasts for 1minute. If we cannot get a refresh of "yes I'm still alive" within that minute, then we will give leadership over to someone else that is responding.
Now, is the problem the Jujud controller itself is so overloaded that it can't respond to requests in a timely manner?
AFAICT the code in question is:
if is_leader():
if not config('no-bootstrap'):
if not leader_get('fsid') or not leader_get('monitor-secret'):
if config('fsid'): fsid = config('fsid') else: fsid = "{}".format(uuid.uuid1())
if config('monitor-secret'): mon_secret = config('monitor-secret') else: mon_secret = "{}".format(ceph.generate_monitor_secret()) status_set('maintenance', 'Creating FSID and Monitor Secret') opts = { 'fsid': fsid, 'monitor-secret': mon_secret,
} log("Settings for the cluster are: {}".format(opts)) leader_set(opts)
That seems to say leader-get and config-get are both very fast (sub second). At it is no more than 10s between is-leader and subsequent commands.
But maybe "is-leader" itself is swamped for some reason. Just
$ grep -rnI "running hook tool" .
The crashdump appears to have all the log files from all the machines running in the model, but no logs from the controller itself.
I was hoping to peek around and see if the controller is reporting any issues running around the 01:54:36 => 01:58:27 timeframe.
This seems to be some sort of load issue, as this is what I'm seeing: |leader- set" . ceph-mon- 1.log:1196: 2018-03- 30 01:54:36 DEBUG worker.uniter.jujuc server.go:181 running hook tool "is-leader" ceph-mon- 1.log:1206: 2018-03- 30 01:56:50 DEBUG worker.uniter.jujuc server.go:181 running hook tool "leader-set" ceph-mon- 1.log:1230: 2018-03- 30 01:58:27 DEBUG config-changed subprocess. CalledProcessEr ror: Command '['leader-set', 'fsid=9a557a74- 33bd-11e8- aedb-00163ef946 3f', 'monitor- secret= AQBemb1ajb3qLRA AlPxv893HqNK1oo s2V1jUzg= =']' returned non-zero exit status 1
$ grep -rnI "is-leader\
./unit-
./unit-
./unit-
It isn't clear how long it is taking for "is-leader" to return but we are seeing that the time between "is-leader" and "leader-set" being called is a full 2m14s, and that we don't get to "leader-set" returning nonzero for another 1m37s.
When we nominate a node as a leader, we give them a lease that the Juju agent refreshes every 30s that lasts for 1minute. If we cannot get a refresh of "yes I'm still alive" within that minute, then we will give leadership over to someone else that is responding.
Now, is the problem the Jujud controller itself is so overloaded that it can't respond to requests in a timely manner?
AFAICT the code in question is: 'no-bootstrap' ): get('monitor- secret' ):
fsid = config('fsid')
else:
fsid = "{}".format( uuid.uuid1( )) 'monitor- secret' ):
mon_ secret = config( 'monitor- secret' )
else:
mon_ secret = "{}".format( ceph.generate_ monitor_ secret( ))
status_ set('maintenanc e', 'Creating FSID and Monitor Secret')
opts = {
' fsid': fsid,
' monitor- secret' : mon_secret,
log(" Settings for the cluster are: {}".format(opts))
leader_ set(opts)
if is_leader():
if not config(
if not leader_get('fsid') or not leader_
if config('fsid'):
if config(
}
$ grep -rnI "is-leader\ |leader- set\|leader- get\|config- get" . ceph-mon- 1.log:1196: 2018-03- 30 01:54:36 DEBUG worker.uniter.jujuc server.go:181 running hook tool "is-leader" ceph-mon- 1.log:1200: 2018-03- 30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "leader-get" ceph-mon- 1.log:1201: 2018-03- 30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "config-get" ceph-mon- 1.log:1202: 2018-03- 30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "config-get" ceph-mon- 1.log:1206: 2018-03- 30 01:56:50 DEBUG worker.uniter.jujuc server.go:181 running hook tool "leader-set" ceph-mon- 1.log:1230: 2018-03- 30 01:58:27 DEBUG config-changed subprocess. CalledProcessEr ror: Command '['leader-set', 'fsid=9a557a74- 33bd-11e8- aedb-00163ef946 3f', 'monitor- secret= AQBemb1ajb3qLRA AlPxv893HqNK1oo s2V1jUzg= =']' returned non-zero exit status 1
./unit-
./unit-
./unit-
./unit-
./unit-
./unit-
That seems to say leader-get and config-get are both very fast (sub second). At it is no more than 10s between is-leader and subsequent commands.
But maybe "is-leader" itself is swamped for some reason. Just
$ grep -rnI "running hook tool" .
./unit- ceph-mon- 1.log:1196: 2018-03- 30 01:54:36 DEBUG worker.uniter.jujuc server.go:181 running hook tool "is-leader" ceph-mon- 1.log:1200: 2018-03- 30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "leader-get" ceph-mon- 1.log:1201: 2018-03- 30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "config-get" ceph-mon- 1.log:1202: 2018-03- 30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "config-get" ceph-mon- 1.log:1203: 2018-03- 30 01:56:46 DEBUG worker.uniter.jujuc server.go:181 running hook tool "status-set" ceph-mon- 1.log:1204: 2018-03- 30 01:56:50 DEBUG worker.uniter.jujuc server.go:181 running hook tool "juju-log" ceph-mon- 1.log:1206: 2018-03- 30 01:56:50 DEBUG worker.uniter.jujuc server.go:181 running hook tool "leader-set"
./unit-
./unit-
./unit-
./unit-
./unit-
./unit-
The crashdump appears to have all the log files from all the machines running in the model, but no logs from the controller itself.
I was hoping to peek around and see if the controller is reporting any issues running around the 01:54:36 => 01:58:27 timeframe.