cannot create leadership lease client: end of read window preceded beginning

Bug #1662202 reported by Aaron Bentley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
John A Meinel
2.1
Fix Released
Critical
Ian Booth

Bug Description

As seen here:
http://reports.vapour.ws/releases/issue/58988e19749a567526e815d3

Juju dies with "cannot create leadership lease client: end of read window preceded beginning".

This is a new intermittent failure, first seen here.
http://reports.vapour.ws/releases/4805/job/azure-arm-deploy-xenial-amd64/attempt/1037

We saw it again today, three days later. And that was over a weekend, when there is less testing. As such, I'm classifying it as a regression even though it is intermittent.

Full example:
2017-02-03 10:59:35 INFO juju --show-log list-models -c azure-arm-deploy-xenial-amd64
10:59:35 INFO juju.cmd supercommand.go:63 running juju [2.1-beta5 gc go1.6.2]
10:59:35 INFO juju.juju api.go:72 connecting to API addresses: [13.64.249.110:17070 192.168.16.4:17070]
10:59:35 INFO juju.api apiclient.go:590 dialing "wss://13.64.249.110:17070/api"
10:59:35 INFO juju.api apiclient.go:590 dialing "wss://192.168.16.4:17070/api"
10:59:36 INFO juju.api apiclient.go:521 connection established to "wss://13.64.249.110:17070/api"
10:59:36 INFO juju.juju api.go:72 connecting to API addresses: [13.64.249.110:17070 192.168.16.4:17070]
10:59:36 INFO juju.api apiclient.go:590 dialing "wss://13.64.249.110:17070/api"
10:59:36 INFO juju.api apiclient.go:590 dialing "wss://192.168.16.4:17070/api"
10:59:36 INFO juju.api apiclient.go:521 connection established to "wss://13.64.249.110:17070/api"
10:59:38 ERROR cmd supercommand.go:458 cannot get model details: getting model a9d5bd58-1337-4c45-8d47-8f607a2d6595 ("controller") info: cannot create standard state workers: cannot create leadership lease manager: cannot create leadership lease client: end of read window preceded beginning (88.378216ms)

Aaron Bentley (abentley)
summary: - annot create leadership lease client: end of read window preceded
+ cannot create leadership lease client: end of read window preceded
beginning
Changed in juju:
importance: Critical → High
Revision history for this message
Ian Booth (wallyworld) wrote :

This looks like a case of clock skew between the client used to bootstrap and the machines started in the cloud. ie the client machine's time is wrong.

The client says that bootstrap finished at 10:59:35.
it then goes to run list-models

The machine 0 logs (unfortunately only at INFO) show that bootstrap appears to complete at 10:59:42

So the bootstrap client machine is 7 seconds behind what it should be. Juju requires all machines to have accurate clocks. This is not a regression since that invariant has always been the case.

Revision history for this message
Ian Booth (wallyworld) wrote :

More investigation needed to see where the root cause it though. ie confirm that client clock skew causes the issue, and if so how.

Changed in juju:
milestone: none → 2.2.0-alpha1
Revision history for this message
Ian Booth (wallyworld) wrote :

Another possibility is NTP kicking in and rolling back the clock, which would also have the same effect. A check of the code with this theory shows that the lease manager does allow for this possibility, but only allows 10ms wiggle room. That seems a little too small to me.

I propose that the allowed skew be increased a bit. I'll unmark this as a regression because these's no change in juju behaviour here - its just that when there's large NTP corrections, things will get tripped up because only 10ms is allowed for.

tags: removed: regression
Revision history for this message
Anastasia (anastasia-macmood) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote :

This PR introduces a monotonic clock implementation which the lease manager uses. This will remove the possibility of time going backwards just at the wrong time due to NTP adjustments. Juju was accounting for NTP adjustments, but its allowed fudge factor was too low. Best just to eliminate the possibility.

https://github.com/juju/juju/pull/6940

Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Merged in develop (2.2) as part of the bigger merge: https://github.com/juju/juju/pull/6953

Changed in juju:
status: In Progress → Fix Committed
assignee: Ian Booth (wallyworld) → John A Meinel (jameinel)
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.