juju-core

juju.state.leadership manager.go:72 stopping leadership manager with error: state changing too quickly; try again soon

Bug #1729930 reported by Jorge Niedbalski on 2017-11-03

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Fix Released	High	Andrew Wilkins	juju-core 1.25.14

Bug Description

[Environment]

1.25.13
Trusty

[Description]

Juju is operating normally, until the following entry is displayed in the logs, at
this point isn't longer possible to operate Juju (deploy/upgrade-charm, etc).

machine-0: 2017-11-03 19:02:59 ERROR juju.state.leadership manager.go:72 stopping leadership manager with error: state changing too quickly; try again soon

Preceded by the following sequence:

machine-0: 2017-11-03 19:02:59 TRACE state.lease.service-leadership.machine-0 client.go:147 expiring lease "xxx" (attempt 1)
machine-0: 2017-11-03 19:02:59 TRACE state.lease.service-leadership.machine-0 client.go:179 refreshing
machine-0: 2017-11-03 19:02:59 DEBUG juju.state.leadership manager.go:214 waking to check leases at 2017-11-03 19:02:58.784391924 +0000 UTC
machine-0: 2017-11-03 19:02:59 TRACE juju.state.leadership manager.go:227 refreshing leases...
machine-0: 2017-11-03 19:02:59 TRACE state.lease.service-leadership.machine-0 client.go:179 refreshing
machine-0: 2017-11-03 19:02:59 TRACE juju.state.leadership manager.go:241 expiring leases...
machine-0: 2017-11-03 19:02:59 TRACE state.lease.service-leadership.machine-0 client.go:147 expiring lease "xxx" (attempt 0)
machine-0: 2017-11-03 19:02:59 TRACE juju.state txns.go:164 rewrote transaction: []txn.Op{txn.Op{C:"leases", Id:"xxx-fee1-44d3-83a3-bd9c78913db8:clock#service-leadership#", Assert:bson.M{"$or":[]bson.M{bson.M{"writers.machine-0":bson.M{"$lte":1509735779211032608}}, bson.M{"writers.machine-0":bson.M{"$
exists":false}}}}, Insert:interface {}(nil), Update:bson.M{"$set":bson.M{"writers.machine-0":1509735779211032608}}, Remove:false}, txn.Op{C:"leases", Id:"xxx-fee1-44d3-83a3-bd9c78913db8:lease#service-leadership#xxx#", Assert:bson.M{"writer":"machine-0", "holder":"xxx/1", "expiry":1
509735778784391924}, Insert:interface {}(nil), Update:interface {}(nil), Remove:true}}

The only possible workaround is to restart jujud on machine-0.

Tags:

Jorge Niedbalski (niedbalski) on 2017-11-03

tags:

added: sts

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2017-11-07:

The lease docs and transactions look OK, so I suspect there really is a lot of contention on the lease docs.

The Juju 1.25 branch has a few issues that would cause this:
- when the leadership manager dies, it remains dead and is not restarted automatically
- there are potentially many leadership managers running concurrently, within the same process, trying to maintain leases

It's possible that those many leadership managers are each making changes to the same clock document; there's a single clock document for all applications/services.

Tim Penhey (thumper) on 2017-11-09

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → Andrew Wilkins (axwalk)

Andrew Wilkins (axwalk) on 2017-11-09

Changed in juju-core:
status:	Triaged → In Progress
milestone:	none → 1.25.14

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2017-11-10:

https://github.com/juju/juju/pull/8046

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2017-11-10:

The underlying errors will not have been fixed, as that involves much more invasive changes than we can hope to backport. However, 1.25.14 will restart the leadership manager automatically now, so hopefully the need to restart the agent will be obviated.

Changed in juju-core:
status:	In Progress → Fix Committed

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-01-24:

@Jorge Niedbalski,

I have tried to verify that with Andrew's patch we can recover from the situation you've described above. The patch is on the 1.25 tip at the moment.

I could not reproduce the condition where you'd get a "state changed too quickly" error. Do you have an environment where you can easily verify that the problem is recoverable from? If you could verify it, we can push for an official 1.25.14 release.

(I have bootstrapped, deployed 3 units of ubuntu initially, then added 2 more and let this combination run for a week).

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2018-02-07:

Fwiw, we're seeing this in production too - however cannot repro at will. Hoping for a quick fix release :-)

Canonical Juju QA Bot (juju-qa-bot) on 2018-04-20

Changed in juju-core:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.