leadership worker wakes up too often

Bug #1728902 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Undecided
Unassigned

Bug Description

On a local model that only has the controller and default model, and 1 application deployed (with 3 units) in the default model, we seem to be waking up *way* too often to deal with lease issues. If you "juju bootstrap --debug lxd" and then
"juju debug-log -m controller --include-module juju.worker.lease --replay"

I see:
15:47:22.674 DEBUG juju.worker.lease waking to check leases at 2017-10-31 11:47:52.680141764 +0000 UTC
15:47:22.764 DEBUG juju.worker.lease waking to check leases at 2017-10-31 11:48:22.725722344 +0000 UTC m=+96834.321437694
15:47:22.805 DEBUG juju.worker.lease waking to check leases at 2017-10-31 11:48:22.767452117 +0000 UTC m=+96834.363167338
15:47:25.140 DEBUG ... at 2017-10-31 11:48:25.092790999 m=...
15:47:45.744 DEBUG ... at 2017-10-31 11:48:45.744120828 m=...
15:47:52.716 DEBUG ... at 2017-10-31 11:48:22.725722344
15:47:52.815 DEBUG ... at 2017-10-31 11:48:52.767118576 m=...
15:47:52.850 DEBUG ... at 2017-10-31 11:48:52.807188433 m=...
15:47:55.139 DEBUG ... at 2017-10-31 11:48:55.092921836 m=...
15:48:00.183 DEBUG ... at 2017-10-31 11:49:00.183642493 m=...
15:48:22.761 DEBUG ... at 2017-10-31 11:48:52.767118576
15:48:22.869 DEBUG ... at 2017-10-31 11:49:22.81749278 m=+...
15:48:22.895 DEBUG ... at 2017-10-31 11:49:22.852232168 m=...
15:48:25.138 DEBUG ... at 2017-10-31 11:49:25.093142689 m=...
15:48:45.746 DEBUG ... at 2017-10-31 11:49:45.746100232 m=...
15:48:52.803 DEBUG ... at 2017-10-31 11:49:22.81749278
15:48:52.921 DEBUG ... at 2017-10-31 11:49:52.871913825 m=...
15:48:52.950 DEBUG ... at 2017-10-31 11:49:52.896919271 m=...
15:48:55.138 DEBUG ... at 2017-10-31 11:49:55.093818589 m=...
15:49:00.185 DEBUG ... at 2017-10-31 11:50:00.185937368 m=...
15:49:22.854 DEBUG ... at 2017-10-31 11:49:52.871913825
15:49:22.967 DEBUG ... at 2017-10-31 11:50:22.923342048 m=...
15:49:22.999 DEBUG ... at 2017-10-31 11:50:22.952223102 m=...
15:49:25.135 DEBUG ... at 2017-10-31 11:50:25.093480281 m=...
15:49:45.748 DEBUG ... at 2017-10-31 11:50:45.748549245 m=...
15:49:52.909 DEBUG ... at 2017-10-31 11:50:22.923342048
15:49:53.022 DEBUG ... at 2017-10-31 11:50:52.969434594 m=...
15:49:53.045 DEBUG ... at 2017-10-31 11:50:53.00087437 m=+...
15:49:55.137 DEBUG ... at 2017-10-31 11:50:55.094103739 m=...
15:50:00.187 DEBUG ... at 2017-10-31 11:51:00.187840411 m=...
15:50:22.959 DEBUG ... at 2017-10-31 11:50:52.969434594
15:50:23.077 DEBUG ... at 2017-10-31 11:51:23.024836933 m=...
15:50:23.093 DEBUG ... at 2017-10-31 11:51:23.046794216 m=...
15:50:25.136 DEBUG ... at 2017-10-31 11:51:25.094046756 m=...
15:50:45.751 DEBUG ... at 2017-10-31 11:51:45.751657032 m=...
15:50:53.004 DEBUG ... at 2017-10-31 11:51:23.024836933
15:50:53.138 DEBUG ... at 2017-10-31 11:51:53.079786742 m=...
15:50:53.144 DEBUG ... at 2017-10-31 11:51:53.094451491 m=...
15:50:55.142 DEBUG ... at 2017-10-31 11:51:55.095020156 m=...
15:51:00.222 DEBUG ... at 2017-10-31 11:52:00.222685919 m=...
15:51:23.057 DEBUG ... at 2017-10-31 11:51:53.079786742
15:51:23.201 DEBUG ... at 2017-10-31 11:52:23.140325923 m=...
15:51:23.203 DEBUG ... at 2017-10-31 11:52:23.14629786 m=+...
15:51:25.138 DEBUG ... at 2017-10-31 11:52:25.096069006 m=...
15:51:45.786 DEBUG ... at 2017-10-31 11:52:45.786542186 m=...
15:51:53.109 DEBUG ... at 2017-10-31 11:52:23.140325923
15:51:53.262 DEBUG ... at 2017-10-31 11:52:53.203701951 m=...
15:51:53.267 DEBUG ... at 2017-10-31 11:52:53.206208631 m=...
15:51:55.141 DEBUG ... at 2017-10-31 11:52:55.095308182 m=...
15:52:00.258 DEBUG ... at 2017-10-31 11:53:00.258229456 m=...

So in the span of 5 minutes, we have approximately 50 log lines issued.

Now it is possible that the fact that some of those lines have an "m=" and some don't, means that there are multiple code paths we're running into. Some of them also are surprisingly close in times (11:33:21.27 and 11:33.21.28, but then a gap until 11:33:25.08).

However, 50 in 5 minutes is 10 per minute or 1 every 6 seconds. Which isn't terrible in itself, its more about the times when we say "waking" and it has only been 10ms since the last time we "woke".

Tags: leadership
John A Meinel (jameinel)
tags: added: leadership
Revision history for this message
John A Meinel (jameinel) wrote :

There is only one place where we log "waking to check", and it is based on a time.Time() so it seems that the difference in m=+ or not is just based on the Time object. Maybe sometimes we're creating the time object in 2 different ways?
 now := manager.config.Clock.Now()
 nextTick := now.Add(manager.config.MaxSleep)
 for _, info := range manager.config.Client.Leases() {
  if info.Expiry.After(nextTick) {
   continue
  }
  nextTick = info.Expiry
 }

I wonder if it is a case of
nextTick := now.Add(MaxSleep)
has the monotonic timer, but
nextTick = info.Expiry
does not.

Looking at the code, it seems:
 manager, err := lease.NewManager(lease.ManagerConfig{
  Secretary: secretary,
  Client: client,
  Clock: st.clock(),
  MaxSleep: time.Minute,
 })

MaxSleep is defaulting to 1 minute. So it looks like we should be set up "wake up before the next lease that is about to expire, or 1 minute has passed".

I'll be updating this bug with debug-log at ms precision

description: updated
description: updated
description: updated
Revision history for this message
John A Meinel (jameinel) wrote :

The other bit that is particularly interesting is that we have an explicit pattern of 1 log message that *doesn't* have m=+, and then we have 4 that *do*. And that is repeating over and over again.

Revision history for this message
John A Meinel (jameinel) wrote :

All of the ones that don't have m=+ seem to be exactly 30s later than the current timestamp.
15:47:52.716 ... 11:48:22.725722344
15:48:22.761 ... 11:48:52.767118576
15:48:52.803 ... 11:49:22.81749278
15:49:22.854 ... 11:49:52.871913825

Compare that with the other ones:
15:47:52.815 ... 11:48:52.767118576 m=...
15:47:52.850 ... 11:48:52.807188433 m=...
15:47:55.139 ... 11:48:55.092921836 m=...
15:48:00.183 ... 11:49:00.183642493 m=...

15:48:22.869 ... 11:49:22.81749278 m=+...
15:48:22.895 ... 11:49:22.852232168 m=...
15:48:25.138 ... 11:49:25.093142689 m=...
15:48:45.746 ... 11:49:45.746100232 m=...

The Hour differences is just because I'm in UTC+4 and debug-log defaults to local times while log entries default to UTC.

But all of those look specifically like 1 min time deltas. That would work with the idea that:
nextTick := now.Add(manager.config.MaxSleep)
  - generates a time.Time object that has monotonic time and
nextTick = info.Expiry
  - does not.

If we actually just go to the ones without m= it does look like we're only waking up at the correct rate:

15:47:52.716 DEBUG ... at 2017-10-31 11:48:22.725722344
15:48:22.761 DEBUG ... at 2017-10-31 11:48:52.767118576
15:48:52.803 DEBUG ... at 2017-10-31 11:49:22.81749278
15:49:22.854 DEBUG ... at 2017-10-31 11:49:52.871913825
15:49:52.909 DEBUG ... at 2017-10-31 11:50:22.923342048
15:50:22.959 DEBUG ... at 2017-10-31 11:50:52.969434594
15:50:53.004 DEBUG ... at 2017-10-31 11:51:23.024836933
15:51:23.057 DEBUG ... at 2017-10-31 11:51:53.079786742
15:51:53.109 DEBUG ... at 2017-10-31 11:52:23.140325923

Those all match up reasonably well at saying "will wake up at" and the next log entry is just slightly after that time.

So the real question is more that while every manager has a list of leases that
it could handle, why are we seeing so many managers operating concurrently when
there are only 2 models.
I wonder if we're not stopping managers correctly. Given that all of those
managers are operating on the MaxSleep polling, that sounds like they're
actually not doing anything. So why are we running them?
Taking it a step further, if they really are all running at 1min, we actually
have 1 Manager that is doing something at 30s poll intervals, but another 9
managers sleeping 1min rotations.

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (4.7 KiB)

I'm trying to run with this patch to figure out why we're creating these managers:

diff --git a/worker/lease/manager.go b/worker/lease/manager.go
index b5de658..9e18ba0 100644
--- a/worker/lease/manager.go
+++ b/worker/lease/manager.go
@@ -4,6 +4,7 @@
 package lease

 import (
+ "runtime/debug"
        "sort"
        "time"

@@ -63,6 +64,7 @@ func NewManager(config ManagerConfig) (*Manager, error) {
        if err != nil {
                return nil, errors.Trace(err)
        }
+ logger.Debugf("creating manager %p for:\n%s", manager, string(debug.Stack()))
        return manager, nil
 }

@@ -232,12 +234,13 @@ func (manager *Manager) nextTick() <-chan time.Time {
        now := manager.config.Clock.Now()
        nextTick := now.Add(manager.config.MaxSleep)
        for _, info := range manager.config.Client.Leases() {
+ leases := manager.config.Client.Leases()
                if info.Expiry.After(nextTick) {
                        continue
                }
                nextTick = info.Expiry
        }
- logger.Debugf("waking to check leases at %s", nextTick)
+ logger.Debugf("[%p] waking to check (%d) leases at %s", manager, len(leases), nextTick)
        return clock.Alarm(manager.config.Clock, nextTick)
 }

I ended up needing to rebootstrap because I upgraded the source tree in the mean time. But logging again and trimming it up a bit I see:
19:30:15.372 ... [128460] wak... (1) 15:31:15.3720
19:30:19.933 ... [2a0960] wak... (0) 15:31:19.9339
19:30:20.652 ... [3de3c0] wak... (0) 15:31:20.6528
19:30:36.085 ... [1881e0] wak... (1) 15:31:36.0857
19:30:36.094 ... [1281e0] wak... (2) 15:31:06.0859
19:30:36.578 ... [1268c0] wak... (2) 15:31:06.5696
19:30:36.670 ... [1281e0] wak... (2) 15:31:36.6708
19:30:45.377 ... [128460] wak... (1) 15:31:45.3776
19:31:06.139 ... [1281e0] wak... (2) 15:31:37.1316
19:31:06.145 ... [1881e0] wak... (1) 15:32:06.1456
19:31:06.573 ... [1268c0] wak... (2) 15:31:36.5645
19:31:06.712 ... [1281e0] wak... (2) 15:32:05.7122
19:31:15.380 ... [128460] wak... (1) 15:32:15.3808
19:31:19.936 ... [2a0960] wak... (0) 15:32:19.9369
19:31:20.656 ... [3de3c0] wak... (0) 15:32:20.6560
19:31:36.194 ... [1281e0] wak... (2) 15:32:07.1841
19:31:36.198 ... [1881e0] wak... (1) 15:32:36.1980
19:31:36.567 ... [1268c0] wak... (2) 15:32:06.5577
19:31:36.754 ... [1281e0] wak... (2) 15:32:35.7545
19:31:45.381 ... [128460] wak... (1) 15:32:45.3811
19:32:06.244 ... [1281e0] wak... (2) 15:32:37.2340
19:32:06.250 ... [1881e0] wak... (1) 15:33:06.2503
19:32:06.561 ... [1268c0] wak... (2) 15:32:37.5508
19:32:06.792 ... [1281e0] wak... (2) 15:33:06.7926
19:32:15.381 ... [128460] wak... (1) 15:33:15.3812
19:32:19.940 ... [2a0960] wak... (0) 15:33:19.9400
19:32:20.693 ... [3de3c0] wak... (0) 15:33:20.6931
19:32:36.298 ... [1281e0] wak... (2) 15:33:06.2892
19:32:36.303 ... [1881e0] wak... (1) 15:33:36.3031
19:32:36.835 ... [1281e0] wak... (2) 15:33:36.8352
19:32:37.554 ... [1268c0] wak... (2) 15:33:36.5540
19:32:45.375 ... [128460] wak... (1) 15:33:45.3756
19:33:06.347 ... [1281e0] wak... (2) 15:33:37.3377
19:33:06.350 ... [1881e0] wak... (1) 15:34:06.3503
19:33:06.876 ... [1281e0]...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (5.8 KiB)

I cleaned up the logging a bit more (github.com/jameinel/juju/ 2.3-leadership-logging is a work in progress).

With a slightly munged log to fit launchpad limits it looks like this. The first bit in [] is the first 6 digits of the modeluuid or controller uuid, and the other is the %p of the manager instance.
I also updated my deployment with 6 applications of ubuntu-lite (so a total of 7 applications in the model).

One thing that is blatantly obvious is that we are waking up the same manager many times. Exactly the same modeluuid and model address.
21:14:15.442 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 29.991s
21:14:15.656 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 29.991s
21:14:15.864 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 29.991s
21:14:15.896 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 30.991s
21:14:16.450 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 57.998s
21:14:25.973 ... [b6b503 0xc4202a83c0] waking to check (2) leases in 35.991s
21:14:31.471 ... [b6b503 0xc4202366c0] waking to check (1) leases in 60.000s
21:14:31.785 ... [b6b503 0xc4202a83c0] waking to check (2) leases in 53.998s
21:14:45.119 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 30.992s
21:14:45.214 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 30.992s
21:14:45.446 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 30.992s
21:14:45.647 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 29.992s
21:14:45.868 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 29.992s
21:14:45.908 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 30.992s
21:14:46.451 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 59.000s
21:14:53.322 ... [6200f8 0xc420236180] waking to check (0) leases in 60.000s
21:14:53.442 ... [6200f8 0xc420236540] waking to check (0) leases in 60.000s
21:14:55.222 ... [b6b503 0xc4202360c0] waking to check (2) leases in 30.992s
21:14:56.042 ... [b6b503 0xc4202a83c0] waking to check (2) leases in 35.993s
21:15:01.509 ... [b6b503 0xc4202366c0] waking to check (1) leases in 60.000s
21:15:01.820 ... [b6b503 0xc4202a83c0] waking to check (2) leases in 53.998s
21:15:15.114 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 29.991s
21:15:15.212 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 29.991s
21:15:15.452 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 30.992s
21:15:15.657 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 30.992s
21:15:15.867 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 29.991s
21:15:15.899 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 29.991s
21:15:16.448 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 59.000s
21:15:26.072 ... [b6b503 0xc4202a83c0] waking to check (2) leases in 35.993s
21:15:26.218 ... [b6b503 0xc4202360c0] waking to check (2) leases in 35.993s
21:15:31.545 ... [b6b503 0xc4202366c0] waking to check (1) leases in 60.000s
21:15:31.861 ... [b6b503 0xc4202a83c0] waking to check (2) leases in 54.998s
21:15:45.119 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 29.992s
21:15:45.214 ... [3647c1 0xc4205d03c0] waking to check (7) leases in 29.992s
21:15:45.441 ... [3647c1 ...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
Anastasia (anastasia-macmood) wrote :

It looks to me that this bug has not been discovered/fixed on Juju 1.25.

I am re-targeting this to correct 'juju' project on launchpad and adding to the series where it was fixed - Juju 2.3+

no longer affects: juju-core
Changed in juju:
status: New → Fix Released
Revision history for this message
Anastasia (anastasia-macmood) wrote :

The above commit went in into 2.3-beta2, so I am marking this bug as Fix Released since it would have been included in already released 2.3.0, 2.3.1, 2.3.2, etc.

Changed in juju:
milestone: none → 2.3-beta2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.