2.2.1 runs 2 leadership workers

Bug #1702309 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Won't Fix
Low
Unassigned
2.2
Won't Fix
Low
Unassigned

Bug Description

I was testing out other issues wrt transactions and having a broken transaction in the queue.

I explicitly injected an invalid transaction token in order to test how mgopurge handles large transaction queues.

$ juju bootstrap lxd
$ juju deploy -m controller ubuntu --to 0
$ lxc exec juju-xxx bash
$$ dialmgo
> db.leases.find().pretty()
> db.leases.update({"_id": "5a6a6d6a-3b06-4148-8578-95df1fca410e:lease#application-leadership#ul#"}, {"$addToSet": {"txn-queue": "595b9ba58f3ec60b36cadead_deadbeef"}})

I was then watching 'juju debug-log' and saw this:
machine-0: 17:48:55 INFO juju.worker start "leadership"
machine-0: 17:48:55 INFO juju.worker start "leadership"
machine-0: 17:48:55 INFO juju.worker stopped "leadership", err: cannot find transaction ObjectIdHex("595b9ba58f3ec60b36cadead")
machine-0: 17:48:55 ERROR juju.worker exited "leadership": cannot find transaction ObjectIdHex("595b9ba58f3ec60b36cadead")
machine-0: 17:48:55 INFO juju.worker restarting "leadership" in 1s
machine-0: 17:48:55 INFO juju.worker stopped "leadership", err: cannot find transaction ObjectIdHex("595b9ba58f3ec60b36cadead")
machine-0: 17:48:55 ERROR juju.worker exited "leadership": cannot find transaction ObjectIdHex("595b9ba58f3ec60b36cadead")
machine-0: 17:48:55 INFO juju.worker restarting "leadership" in 1s
machine-0: 17:48:55 ERROR juju.worker.dependency "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find transaction ObjectIdHex("595b9ba58f3ec60b36cadead")
machine-0: 17:48:56 INFO juju.worker start "leadership"
machine-0: 17:48:56 INFO juju.worker start "leadership"
machine-0: 17:48:56 INFO juju.worker stopped "leadership", err: cannot find transaction ObjectIdHex("595b9ba58f3ec60b36cadead")
machine-0: 17:48:56 ERROR juju.worker exited "leadership": cannot find transaction ObjectIdHex("595b9ba58f3ec60b36cadead")
machine-0: 17:48:56 INFO juju.worker restarting "leadership" in 1s
machine-0: 17:48:56 INFO juju.worker stopped "leadership", err: cannot find transaction ObjectIdHex("595b9ba58f3ec60b36cadead")
machine-0: 17:48:56 ERROR juju.worker exited "leadership": cannot find transaction ObjectIdHex("595b9ba58f3ec60b36cadead")
machine-0: 17:48:56 INFO juju.worker restarting "leadership" in 1s
machine-0: 17:48:57 INFO juju.worker start "leadership"
machine-0: 17:48:57 INFO juju.worker start "leadership"
unit-ul-0: 17:48:57 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul2-2: 17:48:57 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
machine-0: 17:48:57 INFO juju.worker stopped "leadership", err: cannot find transaction ObjectIdHex("595b9ba58f3ec60b36cadead")
machine-0: 17:48:57 ERROR juju.worker exited "leadership": cannot find transaction ObjectIdHex("595b9ba58f3ec60b36cadead")
machine-0: 17:48:57 INFO juju.worker restarting "leadership" in 1s
machine-0: 17:48:57 INFO juju.worker stopped "leadership", err: cannot find transaction ObjectIdHex("595b9ba58f3ec60b36cadead")
machine-0: 17:48:57 ERROR juju.worker exited "leadership": cannot find transaction ObjectIdHex("595b9ba58f3ec60b36cadead")
machine-0: 17:48:57 INFO juju.worker restarting "leadership" in 1s

To start with, I thought we restarted at 3s not 1s, but even with that, you can see that there are 2 messages about restarting the 'leadership' worker within 1s. Such as:
machine-0: 17:48:57 INFO juju.worker start "leadership"
machine-0: 17:48:57 INFO juju.worker start "leadership"

This may only be because I'm messing with the 'controller' model. I have the feeling we might be running all the workers for the Controller *and* for the controller Model.

Revision history for this message
Tim Penhey (thumper) wrote :

Yes, we run the leadership workers for each state. I believe the leadership worker is only interested in docs for its own model.

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (7.4 KiB)

So its possible that this is one for the 'default' model and one for the 'controller'. I'll need a bit more debugging.
However, I'm also seeing a lot more other churn that surprises me. It seems the restarts might now be out of sync after running the machines overnight.
However:

machine-0: 07:53:31 INFO juju.worker restarting "leadership" in 1s
unit-ul2-2: 07:53:31 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul2-0: 07:53:31 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
machine-0: 07:53:32 INFO juju.worker start "leadership"
unit-ul-0: 07:53:32 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul-1: 07:53:32 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul2-1: 07:53:33 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul-2: 07:53:33 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul2-2: 07:53:34 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul2-0: 07:53:34 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul-0: 07:53:35 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul-1: 07:53:35 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul2-1: 07:53:36 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul-2: 07:53:36 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul2-2: 07:53:37 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul2-0: 07:53:37 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul-0: 07:53:38 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul-1: 07:53:38 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul2-1: 07:53:39 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
unit-ul-2: 07:53:39 ERROR juju.worker.dependency "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopp...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

It sounds like its by design (for now) that we run multiple leadership workers, and it shouldn't be bad as long as they are coordinating via the database, nobody should get a lease they shouldn't, and one of those workers just won't actually be managing anything. I'll note that we do still see things like:
machine-0: 15:42:24 DEBUG juju.worker.lease waking to check leases at 2017-07-11 11:43:24.55755107 +0000 UTC
machine-0: 15:42:24 DEBUG juju.worker.lease waking to check leases at 2017-07-11 11:43:24.561946773 +0000 UTC
machine-0: 15:42:24 DEBUG juju.worker.lease waking to check leases at 2017-07-11 11:43:24.778777446 +0000 UTC
machine-0: 15:42:24 DEBUG juju.worker.lease waking to check leases at 2017-07-11 11:43:24.799138555 +0000 UTC

That is 4 wakeups in the same second for checking leases. It may be because of multiple leadership workers. (maybe 2 workers each for 2 models which would be 4?)

Revision history for this message
Anastasia (anastasia-macmood) wrote :

We will not be fixing this in 2.2 as there are no further point releases planned in this series.

Revision history for this message
Joseph Phillips (manadart) wrote :

Leadership has been overhauled since this report.

Changed in juju:
status: Triaged → Won't Fix
importance: High → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.