Bug #1768241 “Unit falls out of quorum into leader-settings-chan...” : Bugs : Ceph Monitor Charm

Revision history for this message

Michael Skalka (mskalka) wrote on 2018-05-01:

#1

Some status logs while crashdump churns:

ceph-mon/0 (the troublesome unit) https://pastebin.canonical.com/p/3rQf8CD9sv/

ceph-mon/2 (the new leader) https://pastebin.canonical.com/p/PcMYCpJWnB/

Revision history for this message

Michael Skalka (mskalka) wrote on 2018-05-01:

#2

In attempting to debug what was happening I fired off the `leader-settings-changed` in the debug-hooks environment. After approximately four hours the same hook was still running. It looks like it's looping through every OSD in the cluster taking ~15 minutes per OSD to do... something:

...
Reading state information... Done
The following package was automatically installed and is no longer required:
libfreetype6
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
exported keyring for client.radosgw.gateway
obtained 'cephx.groups.objects'
value stored
obtained 'cephx.groups.objects'
value stored
obtained 'cephx.groups.objects'
value stored
obtained 'cephx.groups.objects'
value stored
obtained 'cephx.groups.objects'
value stored
obtained 'cephx.groups.objects'
value stored
obtained 'cephx.groups.objects'
value stored
obtained 'cephx.groups.objects'
value stored
obtained 'cephx.groups.objects'
value stored
obtained 'cephx.groups.objects'
value stored
obtained 'cephx.groups.objects'
value stored
exported keyring for client.nova-compute-kvm
...

Alex Kavanagh (ajkavanagh) on 2018-05-01

Changed in charm-ceph-mon:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → Alex Kavanagh (ajkavanagh)

Revision history for this message

John A Meinel (jameinel) wrote on 2018-05-01:

#3

Download full text (5.3 KiB)

I don't seem to see the unit-ceph-mon-0 log file which is a bit surprising.

However, if I grep around for "resolver loop error" there seem to be 2 possible causes:
/5/var/log/juju/unit-filebeat-53.log:15079:2018-04-19 20:14:05 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex

./5/var/log/juju/unit-keystone-ldap-citrix-1.log:4180:2018-04-19 20:21:42 ERROR juju.worker.uniter agent.go:28 resolver loop error: preparing operation "run update-status hook": connection is shut down

The former message happens in the logs about 10x more often.
"cancelled acquiring mutex" is triggered in github.com/juju/mutex. Most likely in mutex_flock.go:
func acquire(spec Spec, timeout <-chan time.Time) (Releaser, error) {
    done := make(chan struct{})
    defer close(done)
    select {
    case result := <-acquireFlock(spec.Name, done):
        if result.err != nil {
            return nil, errors.Trace(result.err)
        }
        return result.m, nil
    case <-timeout:
        return nil, ErrTimeout
    case <-spec.Cancel:
        return nil, ErrCancelled
    }
}

Looking at how we're using the lock in github.com/juju/juju worker/uniter/uniter.go we see:
func (u *Uniter) acquireExecutionLock() (mutex.Releaser, error) {
// We want to make sure we don't block forever when locking, but take the
// Uniter's catacomb into account.
spec := mutex.Spec{
  Name: u.hookLockName,
  Clock: u.clock,
  Delay: 250 * time.Millisecond,
  Cancel: u.catacomb.Dying(),
}
logger.Debugf("acquire lock %q for uniter hook execution", u.hookLockName)
releaser, err := mutex.Acquire(spec)
if err != nil {
  return nil, errors.Trace(err)
}
logger.Debugf("lock %q acquired", u.hookLockName)
return releaser, nil
}

u.catacomb.Dying() should only be firing if we are shutting down this worker for some reason.

Again digging into the log files for "resolver loop error" reveals:
grep -rnI -C 2 "resolver loop error" .

/5/var/log/juju/unit-filebeat-53.log-15077-2018-04-19 20:14:05 ERROR juju.api monitor.go:59 health ping timed out after 30s
./5/var/log/juju/unit-filebeat-53.log-15078-2018-04-19 20:14:05 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
./5/var/log/juju/unit-filebeat-53.log:15079:2018-04-19 20:14:05 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex
./5/var/log/juju/unit-filebeat-53.log-15080-2018-04-19 20:14:05 ERROR juju.worker.uniter agent.go:31 updating agent status: connection is shut down

./5/var/log/juju/unit-nrpe-57.log-583-2018-04-19 20:14:04 ERROR juju.api monitor.go:59 health ping timed out after 30s
./5/var/log/juju/unit-nrpe-57.log-584-2018-04-19 20:14:04 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
./5/var/log/juju/unit-nrpe-57.log:585:2018-04-19 20:14:04 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex
./5/var/log/juju/unit-nrpe-57.log-586-2018-04-19 20:14:04 ERROR juju.worker.uniter agent.go:31 updating ag...

I don't seem to see the unit-ceph-mon-0 log file which is a bit surprising.

However, if I grep around for "resolver loop error" there seem to be 2 possible causes:
/5/var/log/juju/unit-filebeat-53.log:15079:2018-04-19 20:14:05 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex

./5/var/log/juju/unit-keystone-ldap-citrix-1.log:4180:2018-04-19 20:21:42 ERROR juju.worker.uniter agent.go:28 resolver loop error: preparing operation "run update-status hook": connection is shut down

The former message happens in the logs about 10x more often.
"cancelled acquiring mutex" is triggered in github.com/juju/mutex. Most likely in mutex_flock.go:
func acquire(spec Spec, timeout <-chan time.Time) (Releaser, error) {
    done := make(chan struct{})
    defer close(done)
    select {
    case result := <-acquireFlock(spec.Name, done):
        if result.err != nil {
            return nil, errors.Trace(result.err)
        }
        return result.m, nil
    case <-timeout:
        return nil, ErrTimeout
    case <-spec.Cancel:
        return nil, ErrCancelled
    }
}

Looking at how we're using the lock in github.com/juju/juju worker/uniter/uniter.go we see:
func (u *Uniter) acquireExecutionLock() (mutex.Releaser, error) {
	// We want to make sure we don't block forever when locking, but take the
	// Uniter's catacomb into account.
	spec := mutex.Spec{
		Name:   u.hookLockName,
		Clock:  u.clock,
		Delay:  250 * time.Millisecond,
		Cancel: u.catacomb.Dying(),
	}
	logger.Debugf("acquire lock %q for uniter hook execution", u.hookLockName)
	releaser, err := mutex.Acquire(spec)
	if err != nil {
		return nil, errors.Trace(err)
	}
	logger.Debugf("lock %q acquired", u.hookLockName)
	return releaser, nil
}

u.catacomb.Dying() should only be firing if we are shutting down this worker for some reason.

Again digging into the log files for "resolver loop error" reveals:
grep -rnI -C 2 "resolver loop error" .

/5/var/log/juju/unit-filebeat-53.log-15077-2018-04-19 20:14:05 ERROR juju.api monitor.go:59 health ping timed out after 30s
./5/var/log/juju/unit-filebeat-53.log-15078-2018-04-19 20:14:05 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
./5/var/log/juju/unit-filebeat-53.log:15079:2018-04-19 20:14:05 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex
./5/var/log/juju/unit-filebeat-53.log-15080-2018-04-19 20:14:05 ERROR juju.worker.uniter agent.go:31 updating agent status: connection is shut down

./5/var/log/juju/unit-nrpe-57.log-583-2018-04-19 20:14:04 ERROR juju.api monitor.go:59 health ping timed out after 30s
./5/var/log/juju/unit-nrpe-57.log-584-2018-04-19 20:14:04 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
./5/var/log/juju/unit-nrpe-57.log:585:2018-04-19 20:14:04 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex
./5/var/log/juju/unit-nrpe-57.log-586-2018-04-19 20:14:04 ERROR juju.worker.uniter agent.go:31 updating agent status: connection is shut down
./5/var/log/juju/unit-nrpe-57.log-587-2018-04-19 20:21:38 ERROR juju.api monitor.go:59 health ping timed out after 30s
./5/var/log/juju/unit-nrpe-57.log-588-2018-04-19 20:21:38 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
./5/var/log/juju/unit-nrpe-57.log:589:2018-04-19 20:21:38 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex
./5/var/log/juju/unit-nrpe-57.log-590-2018-04-19 20:21:38 ERROR juju.worker.uniter agent.go:31 updating agent status: connection is shut down
./5/var/log/juju/unit-nrpe-57.log-591-2018-04-19 20:21:45 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped

./5/var/log/juju/unit-hacluster-keystone-0.log-685-2018-04-29 04:50:53 ERROR juju.worker.dependency engine.go:551 "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
./5/var/log/juju/unit-hacluster-keystone-0.log-686-2018-04-29 04:50:53 WARNING juju.worker.uniter.operation leader.go:115 we should run a leader-deposed hook here, but we can't yet
./5/var/log/juju/unit-hacluster-keystone-0.log:687:2018-04-29 04:50:53 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex
./5/var/log/juju/unit-hacluster-keystone-0.log-688-2018-04-29 05:11:39 INFO juju-log Making dir /usr/lib/ocf/resource.d/ceph root:root 555
./5/var/log/juju/unit-hacluster-keystone-0.log-689-2018-04-29 05:11:39 INFO juju-log Making dir /usr/lib/ocf/resource.d/maas root:root 555

That seems to say that either the lease manager on the controller is dying and getting restarted for some reason (which is unclear because we don't have the logs for the Controller model).
And the other failures also all seem to be related to the controller being unresponsive. (unable to let the controller know that we're still running.)

It would seem that a lot of this comes back to an unresponsive controller in some fashion, so it would be good to correlate what the Controller thinks is going on at the various times that these failures are seen.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2018-05-01:

#4

Download full text (5.3 KiB)

Unfortunately, the crashdump file doesn't show when the leader changed; there's no leader-elected for it in the logs. mysql, on the other hand is trying to do something weird with leadership ever 5 minutes. However, there are no mysql charm unit logs in the crashdump:

tar xJvf juju-crashdump-c16a6243-ad52-4eb6-838c-5fbf0fd71c6b.tar.xz | grep mysql
0/lxd/6/var/log/mysql.log
0/lxd/6/var/log/juju/unit-hacluster-mysql-1.log
0/lxd/6/var/log/mysql.err
0/lxd/6/var/log/mysql/
0/lxd/6/var/log/mysql/mysql-bin.index
0/lxd/6/var/log/mysql/error.log
1/lxd/7/var/log/mysql.log
1/lxd/7/var/log/juju/unit-hacluster-mysql-0.log
1/lxd/7/var/log/mysql.err
1/lxd/7/var/log/mysql/
1/lxd/7/var/log/mysql/mysql-bin.000001
1/lxd/7/var/log/mysql/mysql-bin.index
1/lxd/7/var/log/mysql/error.log
12/lxd/7/var/log/mysql.log
12/lxd/7/var/log/juju/unit-hacluster-mysql-2.log
12/lxd/7/var/log/mysql.err
12/lxd/7/var/log/mysql/
12/lxd/7/var/log/mysql/mysql-bin.000001
12/lxd/7/var/log/mysql/mysql-bin.index
12/lxd/7/var/log/mysql/error.log
hacluster-mysql_0
hacluster-mysql_1
hacluster-mysql_2
mysql_0
mysql_1
mysql_2

And ceph-mon/0 is missing:
tar xJvf juju-crashdump-c16a6243-ad52-4eb6-838c-5fbf0fd71c6b.tar.xz | grep ceph-mon
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.3.gz
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.6.gz
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.4.gz
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.7.gz
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.5.gz
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.2.gz
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.1.gz
0/lxd/0/var/lib/charm/ceph-mon/
0/lxd/0/var/lib/charm/ceph-mon/ceph.conf
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.5.gz
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.3.gz
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.2.gz
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.7.gz
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.1.gz
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.6.gz
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.4.gz
1/lxd/0/var/log/juju/unit-ceph-mon-1.log
1/lxd/0/var/lib/charm/ceph-mon/
1/lxd/0/var/lib/charm/ceph-mon/ceph.conf
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.6.gz
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.3.gz
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.1.gz
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.7.gz
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.2.gz
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.5.gz
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.4.gz
12/lxd/0/var/log/juju/unit-ceph-mon-2.log
12/lxd/0/var/lib/charm/ceph-mon/
12/lxd/0/var/lib/charm/ceph-mon/ceph.conf
ceph-mon_0
ceph-mon_1
ceph-mon_2

I think, sadly, the juju crash dump util didn't pick them up because it's huge.

(crashdump was shared via google drive as too big for launchpad).

One thing that is dying is mysql: (from debug_log.txt):

unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected ERROR cannot write leadership ...

Unfortunately, the crashdump file doesn't show when the leader changed; there's no leader-elected for it in the logs.  mysql, on the other hand is trying to do something weird with leadership ever 5 minutes.  However, there are no mysql charm unit logs in the crashdump:

tar xJvf juju-crashdump-c16a6243-ad52-4eb6-838c-5fbf0fd71c6b.tar.xz | grep mysql
0/lxd/6/var/log/mysql.log
0/lxd/6/var/log/juju/unit-hacluster-mysql-1.log
0/lxd/6/var/log/mysql.err
0/lxd/6/var/log/mysql/
0/lxd/6/var/log/mysql/mysql-bin.index
0/lxd/6/var/log/mysql/error.log
1/lxd/7/var/log/mysql.log
1/lxd/7/var/log/juju/unit-hacluster-mysql-0.log
1/lxd/7/var/log/mysql.err
1/lxd/7/var/log/mysql/
1/lxd/7/var/log/mysql/mysql-bin.000001
1/lxd/7/var/log/mysql/mysql-bin.index
1/lxd/7/var/log/mysql/error.log
12/lxd/7/var/log/mysql.log
12/lxd/7/var/log/juju/unit-hacluster-mysql-2.log
12/lxd/7/var/log/mysql.err
12/lxd/7/var/log/mysql/
12/lxd/7/var/log/mysql/mysql-bin.000001
12/lxd/7/var/log/mysql/mysql-bin.index
12/lxd/7/var/log/mysql/error.log
hacluster-mysql_0
hacluster-mysql_1
hacluster-mysql_2
mysql_0
mysql_1
mysql_2

And ceph-mon/0 is missing:
tar xJvf juju-crashdump-c16a6243-ad52-4eb6-838c-5fbf0fd71c6b.tar.xz | grep ceph-mon
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.3.gz
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.6.gz
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.4.gz
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.7.gz
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.5.gz
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.2.gz
0/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-0-lxd-0.log.1.gz
0/lxd/0/var/lib/charm/ceph-mon/
0/lxd/0/var/lib/charm/ceph-mon/ceph.conf
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.5.gz
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.3.gz
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.2.gz
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.7.gz
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.1.gz
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.6.gz
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log
1/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-1-lxd-0.log.4.gz
1/lxd/0/var/log/juju/unit-ceph-mon-1.log
1/lxd/0/var/lib/charm/ceph-mon/
1/lxd/0/var/lib/charm/ceph-mon/ceph.conf
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.6.gz
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.3.gz
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.1.gz
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.7.gz
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.2.gz
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.5.gz
12/lxd/0/var/log/ceph/ceph-mon.juju-33f5dc-12-lxd-0.log.4.gz
12/lxd/0/var/log/juju/unit-ceph-mon-2.log
12/lxd/0/var/lib/charm/ceph-mon/
12/lxd/0/var/lib/charm/ceph-mon/ceph.conf
ceph-mon_0
ceph-mon_1
ceph-mon_2

I think, sadly, the juju crash dump util didn't pick them up because it's huge.

(crashdump was shared via google drive as too big for launchpad).

One thing that is dying is mysql: (from debug_log.txt):

unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected ERROR cannot write leadership settings: cannot write settings: not the leader
unit-aodh-2: 04:51:49 INFO unit.aodh/2.juju-log Invoking reactive handler: reactive/aodh_handlers.py:38:setup_amqp_req
unit-filebeat-72: 04:51:49 ERROR unit.filebeat/72.juju-log Unable to find implementation for relation: requires of juju-info
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected Traceback (most recent call last):
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected   File "/var/lib/juju/agents/unit-mysql-2/charm/hooks/leader-elected", line 840, in <module>
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected     main()
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected   File "/var/lib/juju/agents/unit-mysql-2/charm/hooks/leader-elected", line 830, in main
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected     hooks.execute(sys.argv)
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected   File "/var/lib/juju/agents/unit-mysql-2/charm/hooks/charmhelpers/core/hookenv.py", line 800, in execute
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected     self._hooks[hook_name]()
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected   File "/var/lib/juju/agents/unit-mysql-2/charm/hooks/leader-elected", line 802, in leader_elected
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected     leader_set(**{'leader-ip': get_relation_ip('cluster')})
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected   File "/var/lib/juju/agents/unit-mysql-2/charm/hooks/charmhelpers/core/hookenv.py", line 929, in inner_translate_exc2
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected     return f(*args, **kwargs)
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected   File "/var/lib/juju/agents/unit-mysql-2/charm/hooks/charmhelpers/core/hookenv.py", line 982, in leader_set
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected     subprocess.check_call(cmd)
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected   File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected     raise CalledProcessError(retcode, cmd)
unit-mysql-2: 04:51:49 DEBUG unit.mysql/2.leader-elected subprocess.CalledProcessError: Command '['leader-set', 'leader-ip=10.14.244.133']' returned non-zero exit status 1

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2018-05-01:

#5

Hi Michael,

Can you add any controller logs to this bug per John's comments, regarding unresponsive controller, to help correlate the controller status vs his findings?

It might aslo be interesting to see what the actual failure is on mysql/2 mentioned in the traceback above from Alex for the failing 'leader-set leader-ip=10.14.244.133' call.

Thanks,
Corey

Revision history for this message

Michael Skalka (mskalka) wrote on 2018-05-02:

#6

unit-ceph-mon-0.log Edit (7.6 MiB, text/plain)

Missing ceph-mon-0.log

Revision history for this message

Michael Skalka (mskalka) wrote on 2018-05-02:

#7

juju-crashdump-959ab3de-2602-4342-bfad-30d61640affa.tar.xz Edit (33.1 MiB, application/x-tar)

Controller model juju-crashdump

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2018-05-02:

#8

This is my comment from: https://bugs.launchpad.net/charm-ceph-mon/+bug/1760138/comments/5

The relevance is:

"So the only is_leader() ... leader_set() code path is the one that was quoted, and the only call out, that isn't Juju related, is "ceph.generate_monitor_secret()" which is very quick.

So, I'm wondering if the charm does manage to call leader_set() in time, but the controller doesn't 'see' the leader_set until several minutes later?"

I'll also take a look at the log files/crash-dump.

Revision history for this message

Sandor Zeestraten (szeestraten) wrote on 2018-05-11:

#9

Hitting the same issue here upgrading from rev. 9 to 24. Just keeps looping.

$ juju status ceph-mon
Model Controller Cloud/Region Version
openstack prodcont1 prodmaas 2.3.4

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 10.2.6 maintenance 3 ceph-mon jujucharms 24 ubuntu
ceph-mon-nrpe active 3 nrpe jujucharms 41 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-mon/0 maintenance executing 3 10.aa.b.213 (leader-settings-changed) Bootstrapping MON cluster
  ceph-mon-nrpe/0 active idle 10.aa.b.213 ready
ceph-mon/1* active idle 4 10.aa.b.109 Unit is ready and clustered
  ceph-mon-nrpe/1* active idle 10.aa.b.109 ready
ceph-mon/2 maintenance executing 5 10.aa.b.89 (leader-settings-changed) Bootstrapping MON cluster
  ceph-mon-nrpe/2 active idle 10.aa.b.89 ready

Revision history for this message

Sandor Zeestraten (szeestraten) wrote on 2018-05-11:

#10

Correction for #9. Issue cleared after about an hour of retrying.

Alex Kavanagh (ajkavanagh) on 2019-05-28

Changed in charm-ceph-mon:
assignee:	Alex Kavanagh (ajkavanagh) → nobody

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2020-06-29:

#11

Is this still occurring for anyone? It's a relatively old bug, and there's been quite a lot of work done in this area of the charm since the bug was filed. Setting to incomplete; please set back to Triaged if it's still a concern or has happened since the last update to the bug. Thanks.

Changed in charm-ceph-mon:
status:	Triaged → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2020-08-29:

#12

[Expired for OpenStack ceph-mon charm because there has been no activity for 60 days.]

Changed in charm-ceph-mon:
status:	Incomplete → Expired

Revision history for this message

Alireza Nasri (sysnasri) wrote on 2021-01-13:

#13

I have got the same issue at the moment.

Revision history for this message

Aurelien Lourot (aurelien-lourot) wrote on 2021-01-14:

#14

Hi Alireza, which version of the ceph-mon charm are you using?

Ceph Monitor Charm

Unit falls out of quorum into leader-settings-changed loop

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches