I don't seem to see the unit-ceph-mon-0 log file which is a bit surprising.
However, if I grep around for "resolver loop error" there seem to be 2 possible causes:
/5/var/log/juju/unit-filebeat-53.log:15079:2018-04-19 20:14:05 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex
./5/var/log/juju/unit-keystone-ldap-citrix-1.log:4180:2018-04-19 20:21:42 ERROR juju.worker.uniter agent.go:28 resolver loop error: preparing operation "run update-status hook": connection is shut down
The former message happens in the logs about 10x more often.
"cancelled acquiring mutex" is triggered in github.com/juju/mutex. Most likely in mutex_flock.go:
func acquire(spec Spec, timeout <-chan time.Time) (Releaser, error) {
done := make(chan struct{})
defer close(done)
select {
case result := <-acquireFlock(spec.Name, done):
if result.err != nil {
return nil, errors.Trace(result.err)
}
return result.m, nil
case <-timeout:
return nil, ErrTimeout
case <-spec.Cancel:
return nil, ErrCancelled
}
}
Looking at how we're using the lock in github.com/juju/juju worker/uniter/uniter.go we see:
func (u *Uniter) acquireExecutionLock() (mutex.Releaser, error) {
// We want to make sure we don't block forever when locking, but take the
// Uniter's catacomb into account.
spec := mutex.Spec{
Name: u.hookLockName,
Clock: u.clock,
Delay: 250 * time.Millisecond,
Cancel: u.catacomb.Dying(),
}
logger.Debugf("acquire lock %q for uniter hook execution", u.hookLockName)
releaser, err := mutex.Acquire(spec)
if err != nil {
return nil, errors.Trace(err)
}
logger.Debugf("lock %q acquired", u.hookLockName)
return releaser, nil
}
u.catacomb.Dying() should only be firing if we are shutting down this worker for some reason.
Again digging into the log files for "resolver loop error" reveals:
grep -rnI -C 2 "resolver loop error" .
/5/var/log/juju/unit-filebeat-53.log-15077-2018-04-19 20:14:05 ERROR juju.api monitor.go:59 health ping timed out after 30s
./5/var/log/juju/unit-filebeat-53.log-15078-2018-04-19 20:14:05 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
./5/var/log/juju/unit-filebeat-53.log:15079:2018-04-19 20:14:05 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex
./5/var/log/juju/unit-filebeat-53.log-15080-2018-04-19 20:14:05 ERROR juju.worker.uniter agent.go:31 updating agent status: connection is shut down
./5/var/log/juju/unit-nrpe-57.log-583-2018-04-19 20:14:04 ERROR juju.api monitor.go:59 health ping timed out after 30s
./5/var/log/juju/unit-nrpe-57.log-584-2018-04-19 20:14:04 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
./5/var/log/juju/unit-nrpe-57.log:585:2018-04-19 20:14:04 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex
./5/var/log/juju/unit-nrpe-57.log-586-2018-04-19 20:14:04 ERROR juju.worker.uniter agent.go:31 updating agent status: connection is shut down
./5/var/log/juju/unit-nrpe-57.log-587-2018-04-19 20:21:38 ERROR juju.api monitor.go:59 health ping timed out after 30s
./5/var/log/juju/unit-nrpe-57.log-588-2018-04-19 20:21:38 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
./5/var/log/juju/unit-nrpe-57.log:589:2018-04-19 20:21:38 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex
./5/var/log/juju/unit-nrpe-57.log-590-2018-04-19 20:21:38 ERROR juju.worker.uniter agent.go:31 updating agent status: connection is shut down
./5/var/log/juju/unit-nrpe-57.log-591-2018-04-19 20:21:45 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
./5/var/log/juju/unit-hacluster-keystone-0.log-685-2018-04-29 04:50:53 ERROR juju.worker.dependency engine.go:551 "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
./5/var/log/juju/unit-hacluster-keystone-0.log-686-2018-04-29 04:50:53 WARNING juju.worker.uniter.operation leader.go:115 we should run a leader-deposed hook here, but we can't yet
./5/var/log/juju/unit-hacluster-keystone-0.log:687:2018-04-29 04:50:53 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex
./5/var/log/juju/unit-hacluster-keystone-0.log-688-2018-04-29 05:11:39 INFO juju-log Making dir /usr/lib/ocf/resource.d/ceph root:root 555
./5/var/log/juju/unit-hacluster-keystone-0.log-689-2018-04-29 05:11:39 INFO juju-log Making dir /usr/lib/ocf/resource.d/maas root:root 555
That seems to say that either the lease manager on the controller is dying and getting restarted for some reason (which is unclear because we don't have the logs for the Controller model).
And the other failures also all seem to be related to the controller being unresponsive. (unable to let the controller know that we're still running.)
It would seem that a lot of this comes back to an unresponsive controller in some fashion, so it would be good to correlate what the Controller thinks is going on at the various times that these failures are seen.
I don't seem to see the unit-ceph-mon-0 log file which is a bit surprising.
However, if I grep around for "resolver loop error" there seem to be 2 possible causes: log/juju/ unit-filebeat- 53.log: 15079:2018- 04-19 20:14:05 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex
/5/var/
./5/var/ log/juju/ unit-keystone- ldap-citrix- 1.log:4180: 2018-04- 19 20:21:42 ERROR juju.worker.uniter agent.go:28 resolver loop error: preparing operation "run update-status hook": connection is shut down
The former message happens in the logs about 10x more often. com/juju/ mutex. Most likely in mutex_flock.go: spec.Name, done): Trace(result. err)
"cancelled acquiring mutex" is triggered in github.
func acquire(spec Spec, timeout <-chan time.Time) (Releaser, error) {
done := make(chan struct{})
defer close(done)
select {
case result := <-acquireFlock(
if result.err != nil {
return nil, errors.
}
return result.m, nil
case <-timeout:
return nil, ErrTimeout
case <-spec.Cancel:
return nil, ErrCancelled
}
}
Looking at how we're using the lock in github. com/juju/ juju worker/ uniter/ uniter. go we see: nLock() (mutex.Releaser, error) { Debugf( "acquire lock %q for uniter hook execution", u.hookLockName) Debugf( "lock %q acquired", u.hookLockName)
func (u *Uniter) acquireExecutio
// We want to make sure we don't block forever when locking, but take the
// Uniter's catacomb into account.
spec := mutex.Spec{
Name: u.hookLockName,
Clock: u.clock,
Delay: 250 * time.Millisecond,
Cancel: u.catacomb.Dying(),
}
logger.
releaser, err := mutex.Acquire(spec)
if err != nil {
return nil, errors.Trace(err)
}
logger.
return releaser, nil
}
u.catacomb.Dying() should only be firing if we are shutting down this worker for some reason.
Again digging into the log files for "resolver loop error" reveals:
grep -rnI -C 2 "resolver loop error" .
/5/var/ log/juju/ unit-filebeat- 53.log- 15077-2018- 04-19 20:14:05 ERROR juju.api monitor.go:59 health ping timed out after 30s log/juju/ unit-filebeat- 53.log- 15078-2018- 04-19 20:14:05 ERROR juju.worker. dependency engine.go:551 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly log/juju/ unit-filebeat- 53.log: 15079:2018- 04-19 20:14:05 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex log/juju/ unit-filebeat- 53.log- 15080-2018- 04-19 20:14:05 ERROR juju.worker.uniter agent.go:31 updating agent status: connection is shut down
./5/var/
./5/var/
./5/var/
./5/var/ log/juju/ unit-nrpe- 57.log- 583-2018- 04-19 20:14:04 ERROR juju.api monitor.go:59 health ping timed out after 30s log/juju/ unit-nrpe- 57.log- 584-2018- 04-19 20:14:04 ERROR juju.worker. dependency engine.go:551 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly log/juju/ unit-nrpe- 57.log: 585:2018- 04-19 20:14:04 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex log/juju/ unit-nrpe- 57.log- 586-2018- 04-19 20:14:04 ERROR juju.worker.uniter agent.go:31 updating agent status: connection is shut down log/juju/ unit-nrpe- 57.log- 587-2018- 04-19 20:21:38 ERROR juju.api monitor.go:59 health ping timed out after 30s log/juju/ unit-nrpe- 57.log- 588-2018- 04-19 20:21:38 ERROR juju.worker. dependency engine.go:551 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly log/juju/ unit-nrpe- 57.log: 589:2018- 04-19 20:21:38 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex log/juju/ unit-nrpe- 57.log- 590-2018- 04-19 20:21:38 ERROR juju.worker.uniter agent.go:31 updating agent status: connection is shut down log/juju/ unit-nrpe- 57.log- 591-2018- 04-19 20:21:45 ERROR juju.worker. dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
./5/var/
./5/var/
./5/var/
./5/var/
./5/var/
./5/var/
./5/var/
./5/var/
./5/var/ log/juju/ unit-hacluster- keystone- 0.log-685- 2018-04- 29 04:50:53 ERROR juju.worker. dependency engine.go:551 "leadership- tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped log/juju/ unit-hacluster- keystone- 0.log-686- 2018-04- 29 04:50:53 WARNING juju.worker. uniter. operation leader.go:115 we should run a leader-deposed hook here, but we can't yet log/juju/ unit-hacluster- keystone- 0.log:687: 2018-04- 29 04:50:53 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex log/juju/ unit-hacluster- keystone- 0.log-688- 2018-04- 29 05:11:39 INFO juju-log Making dir /usr/lib/ ocf/resource. d/ceph root:root 555 log/juju/ unit-hacluster- keystone- 0.log-689- 2018-04- 29 05:11:39 INFO juju-log Making dir /usr/lib/ ocf/resource. d/maas root:root 555
./5/var/
./5/var/
./5/var/
./5/var/
That seems to say that either the lease manager on the controller is dying and getting restarted for some reason (which is unclear because we don't have the logs for the Controller model).
And the other failures also all seem to be related to the controller being unresponsive. (unable to let the controller know that we're still running.)
It would seem that a lot of this comes back to an unresponsive controller in some fashion, so it would be good to correlate what the Controller thinks is going on at the various times that these failures are seen.