juju.state.leadership manager.go:72 stopping leadership manager with error: state changing too quickly; try again soon

Bug #1729930 reported by Jorge Niedbalski
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Andrew Wilkins

Bug Description

[Environment]

1.25.13
Trusty

[Description]

Juju is operating normally, until the following entry is displayed in the logs, at
this point isn't longer possible to operate Juju (deploy/upgrade-charm, etc).

machine-0: 2017-11-03 19:02:59 ERROR juju.state.leadership manager.go:72 stopping leadership manager with error: state changing too quickly; try again soon

Preceded by the following sequence:

machine-0: 2017-11-03 19:02:59 TRACE state.lease.service-leadership.machine-0 client.go:147 expiring lease "xxx" (attempt 1)
machine-0: 2017-11-03 19:02:59 TRACE state.lease.service-leadership.machine-0 client.go:179 refreshing
machine-0: 2017-11-03 19:02:59 DEBUG juju.state.leadership manager.go:214 waking to check leases at 2017-11-03 19:02:58.784391924 +0000 UTC
machine-0: 2017-11-03 19:02:59 TRACE juju.state.leadership manager.go:227 refreshing leases...
machine-0: 2017-11-03 19:02:59 TRACE state.lease.service-leadership.machine-0 client.go:179 refreshing
machine-0: 2017-11-03 19:02:59 TRACE juju.state.leadership manager.go:241 expiring leases...
machine-0: 2017-11-03 19:02:59 TRACE state.lease.service-leadership.machine-0 client.go:147 expiring lease "xxx" (attempt 0)
machine-0: 2017-11-03 19:02:59 TRACE juju.state txns.go:164 rewrote transaction: []txn.Op{txn.Op{C:"leases", Id:"xxx-fee1-44d3-83a3-bd9c78913db8:clock#service-leadership#", Assert:bson.M{"$or":[]bson.M{bson.M{"writers.machine-0":bson.M{"$lte":1509735779211032608}}, bson.M{"writers.machine-0":bson.M{"$
exists":false}}}}, Insert:interface {}(nil), Update:bson.M{"$set":bson.M{"writers.machine-0":1509735779211032608}}, Remove:false}, txn.Op{C:"leases", Id:"xxx-fee1-44d3-83a3-bd9c78913db8:lease#service-leadership#xxx#", Assert:bson.M{"writer":"machine-0", "holder":"xxx/1", "expiry":1
509735778784391924}, Insert:interface {}(nil), Update:interface {}(nil), Remove:true}}

The only possible workaround is to restart jujud on machine-0.

Tags: sts
tags: added: sts
Revision history for this message
Andrew Wilkins (axwalk) wrote :

The lease docs and transactions look OK, so I suspect there really is a lot of contention on the lease docs.

The Juju 1.25 branch has a few issues that would cause this:
 - when the leadership manager dies, it remains dead and is not restarted automatically
 - there are potentially many leadership managers running concurrently, within the same process, trying to maintain leases

It's possible that those many leadership managers are each making changes to the same clock document; there's a single clock document for all applications/services.

Tim Penhey (thumper)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Andrew Wilkins (axwalk)
Andrew Wilkins (axwalk)
Changed in juju-core:
status: Triaged → In Progress
milestone: none → 1.25.14
Revision history for this message
Andrew Wilkins (axwalk) wrote :
Revision history for this message
Andrew Wilkins (axwalk) wrote :

The underlying errors will not have been fixed, as that involves much more invasive changes than we can hope to backport. However, 1.25.14 will restart the leadership manager automatically now, so hopefully the need to restart the agent will be obviated.

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Jorge Niedbalski,

I have tried to verify that with Andrew's patch we can recover from the situation you've described above. The patch is on the 1.25 tip at the moment.

I could not reproduce the condition where you'd get a "state changed too quickly" error. Do you have an environment where you can easily verify that the problem is recoverable from? If you could verify it, we can push for an official 1.25.14 release.

(I have bootstrapped, deployed 3 units of ubuntu initially, then added 2 more and let this combination run for a week).

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Fwiw, we're seeing this in production too - however cannot repro at will. Hoping for a quick fix release :-)

Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.