repeated ErrInvalid can cause lease.Manager to be stopped

Bug #1815719 reported by John A Meinel on 2019-02-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
Critical
John A Meinel
2.5
Critical
John A Meinel

Bug Description

We saw this in the wild. It used to be that ClaimLease spun in a tight loop whenever it got ErrInvalid and never stopped retrying.
We changed it so that it would keep trying but using the same exponential backoff that we use for ErrTimeout.
However, ErrTimeout only retried 5 times, backing off 2x each step starting at 50ms.
So
 50ms
 100ms
 200ms
 400ms
 800ms
# final failure.

It seems that we aren't able to get the Leases tracked on *this* node to be in-sync within that time, such that we would generate a ErrClaimDenied (we see that someone else does own this lease).

Probably if we get N attempts to claim a lease, and always get ErrInvalid, we should finish with ErrClaimDenied, even if we can't tell who does hold the lease.

It shouldn't be too risky, as the next thing the client should do is call BlockUntilLeadershipReleased, and if we still haven't gotten in sync by that point, then our Leases collection won't have the lease record, and will immediately unblock them, and they will try to Claim again.

ErrInvalid should *definitely* not kill the Manager loop.

I think we also have a different fundamental issue if we are seeing failure to get convergence in raft letting us know who the *master* thinks has the Lease after ~1.5s.

We can easily bump the number of retries, and/or change it from 2x backoff (retry.Exponential has a Factor we can set.))

Note that for ErrTimeout, its already a bit strange. Specifically, Timeout means it took 5s to attempt a request. But then if we fail after 5s, we then only sleep for 50ms before trying again. The scale of the times don't seem quite correct.

John A Meinel (jameinel) on 2019-02-13
Changed in juju:
milestone: none → 2.6-beta1
John A Meinel (jameinel) wrote :

https://github.com/juju/juju/pull/9745 is a PR against 2.5 to fix this behavior.

John A Meinel (jameinel) wrote :

Note that this seems to raise the priority of bug #1814735

John A Meinel (jameinel) wrote :
Changed in juju:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers