1.24.0: Lots of "agent is lost, sorry!" messages
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| juju-core |
High
|
Unassigned | |||
Bug Description
Attaching juju status tabular output, not sure what else is needed, from what I can tell, the agents on the machines are alive.
Showing that the processes are running on one of the systems in question:
ubuntu@
jujud-machine-
ubuntu@
jujud-machine-
ubuntu@
jujud-unit-
ubuntu@
root 846 1 0 Jun11 ? 00:00:29 /var/lib/
ubuntu 1909 1868 0 14:12 pts/1 00:00:00 grep --color=auto 846
ubuntu@
root 887 1 0 Jun11 ? 00:00:43 /var/lib/
ubuntu 1911 1868 0 14:12 pts/1 00:00:00 grep --color=auto 887
Status History Command, not sure if this is useful or not:
dpb@helo:slaves[0]$ juju status-history trusty-client/0
TIME TYPE STATUS MESSAGE
11 Jun 2015 12:19:25-06:00 workload unknown Waiting for agent initialization to finish
11 Jun 2015 12:19:25-06:00 agent allocating
11 Jun 2015 13:06:57-06:00 workload maintenance installing charm software
11 Jun 2015 13:07:01-06:00 agent executing running install hook
11 Jun 2015 13:17:50-06:00 agent error hook failed: "install"
11 Jun 2015 13:19:38-06:00 workload maintenance installing charm software
11 Jun 2015 13:19:39-06:00 agent executing running install hook
11 Jun 2015 14:52:40-06:00 agent executing running leader-elected hook
11 Jun 2015 14:52:48-06:00 agent executing running config-changed hook
11 Jun 2015 14:53:17-06:00 agent executing running start hook
11 Jun 2015 14:53:21-06:00 workload unknown
11 Jun 2015 14:53:25-06:00 agent idle
12 Jun 2015 00:54:11-06:00 agent executing running leader-elected hook
12 Jun 2015 00:54:33-06:00 agent executing running config-changed hook
12 Jun 2015 00:55:00-06:00 agent idle
14 Jun 2015 00:50:10-06:00 agent executing running leader-elected hook
14 Jun 2015 00:50:35-06:00 agent executing running config-changed hook
14 Jun 2015 00:50:47-06:00 agent idle
CVE References
| Changed in juju-core: | |
| status: | New → Triaged |
| importance: | Undecided → High |
| milestone: | none → 1.25.0 |
| David Britton (davidpbritton) wrote : | #2 |
| Curtis Hovey (sinzui) wrote : | #3 |
David reports Beta5 was fine. The regression might have been introduced in the beta6 or in 1.24.0 changes.
https:/
https:/
| tags: | added: blocker regression |
| David Britton (davidpbritton) wrote : | #4 |
| David Britton (davidpbritton) wrote : | #5 |
Sanitized all-machines log
| Nate Finch (natefinch) wrote : | #6 |
So far, I have not been able to reproduce this bug simply deploying the ubuntu charm a whole bunch.... but I also deployed it by hand, so it's probably not as fast & furious as deploying via a script or bundle.
| Nate Finch (natefinch) wrote : | #7 |
I'll take a look at this some more later tonight.
| John A Meinel (jameinel) wrote : | #9 |
So it starts with:
2015-06-11 18:57:06 ERROR juju.apiserver images.go:47 GET(/environmen
but that is followed by:
2015-06-11 19:20:18 ERROR juju.apiserver debuglog.go:110 debug-log handler error: write tcp 10.172.
That looks a whole lot like networking broke. Because it was trying to write a message to 10.172.68.236 at a random port, which looks a lot like a client side connection. Though it is 30 minutes later, and the next errors about lease manager are from 11h30m later.
Followed more failures and then:
2015-06-14 18:12:28 ERROR juju.worker.
I'll do a quick spin up to see if I can reproduce, but it certainly looks strange.
| John A Meinel (jameinel) wrote : | #10 |
If you want a non-deployer way to request a lot of stuff at once:
$ juju bootstrap --debug --constraints=
$ juju deploy ubuntu -n 3
$ for j in `seq 10`; do for i in `seq 0 3`; do juju add-unit ubuntu --to lxc:$i; done & time wait; done
That will do a parallel AddUnit of an lxc onto each machine, tell you whole long it takes and then do it again 10 times. (I was seeing 10s per loop though it slowed down a bit near the end.)
I ran into trouble a couple times (first not passing mem=4G and it just being on m1.smalls, second not passing root-disk and running out of disk space on the 8G default disk size.)
But it did end up running and I do have 42 units running.
| Changed in juju-core: | |
| status: | Triaged → Incomplete |
| no longer affects: | juju-core/1.24 |
| Changed in juju-core: | |
| milestone: | 1.25.0 → 1.25.1 |
| tags: | removed: blocker |
| Manoj Iyer (manjo) wrote : | #11 |
I can confirm that juju-core 1.26-alpha1.1 fixes this for me on ARM64
| Cheryl Jennings (cherylj) wrote : | #12 |
David, are you able to reproduce this problem with 1.26-alpha1?
| Changed in juju-core: | |
| milestone: | 1.25.1 → 1.25.2 |
| Changed in juju-core: | |
| milestone: | 1.25.2 → 1.25.3 |
| Vahid Ashrafian (vahid-arn) wrote : | #13 |
Sorry! It was a mistake that I changed the state!
| Changed in juju-core: | |
| status: | Incomplete → Fix Released |
| Changed in juju-core: | |
| status: | Fix Released → Incomplete |
| Andreas Hasenack (ahasenack) wrote : | #14 |
I'm seeing something similar in 1.25.0. Juju status shows that all agents are lost:
[Services]
NAME STATUS EXPOSED CHARM
haproxy unknown false cs:trusty/
landscape-server unknown false cs:trusty/
postgresql active false cs:trusty/
rabbitmq-server active false cs:trusty/
[Units]
ID WORKLOAD-STATE AGENT-STATE VERSION MACHINE PORTS PUBLIC-ADDRESS MESSAGE
haproxy/0 unknown lost 1.25.0 0/lxc/0 80/tcp,
landscape-server/0 unknown lost 1.25.0 0/lxc/1 10.96.8.165 agent is lost, sorry! See 'juju status-history landscape-server/0'
postgresql/0 unknown lost 1.25.0 0/lxc/2 5432/tcp 10.96.9.108 agent is lost, sorry! See 'juju status-history postgresql/0'
rabbitmq-server/0 unknown lost 1.25.0 0/lxc/3 5672/tcp 10.96.3.82 agent is lost, sorry! See 'juju status-history rabbitmq-server/0'
[Machines]
ID STATE VERSION DNS INS-ID SERIES HARDWARE
0 started 1.25.0 lds-ci.scapestack /MAAS/api/
On machine 0, this is the last machine-0.log entry:
2016-01-05 12:12:24 ERROR juju.state.
all-machines.log is quite big, 244Mb, and is full of these at the end for all units:
unit-landscape-
unit-rabbitmq-
And this is the first message about tomb dying:
unit-rabbitmq-
It kind of matches the timestamp of the last entry in the machine-0.log file.
I'm attaching these logs to the bug.
| Andreas Hasenack (ahasenack) wrote : | #15 |
| tags: | added: kanban-cross-team |
| tags: | removed: kanban-cross-team |
| Changed in juju-core: | |
| milestone: | 1.25.3 → 1.25.4 |
| Paul Gear (paulgear) wrote : | #16 |
We're seeing this on 1.24.5.1 in one of our clouds.
| Paul Gear (paulgear) wrote : | #17 |
I'm seeing this on a very simple environment which I installed in Canonistack; provisioned, left overnight, and in the morning the debug log is full of the same message https:/
| Changed in juju-core: | |
| status: | Incomplete → New |
| Anastasia (anastasia-macmood) wrote : | #18 |
Please also provide `juju status-history` output for any of the units that report losing agent.
| Cheryl Jennings (cherylj) wrote : | #19 |
@paulgear - we'll need the unit logs for the units that are failing to come up, and the machine-0.log for the environment to debug this issue.
| Changed in juju-core: | |
| status: | New → Incomplete |
| Paul Gear (paulgear) wrote : | #20 |
@cherylj Thanks - will do. Just have my hands full with CVE-2015-7547 at the moment. :-)
| Paul Gear (paulgear) wrote : | #21 |
Status history on units for my test environment: https:/
Logs for all units attached.
| Paul Gear (paulgear) wrote : | #22 |
| Paul Gear (paulgear) wrote : | #23 |
| Changed in juju-core: | |
| status: | Incomplete → New |
| Cheryl Jennings (cherylj) wrote : | #24 |
This is a dup of bug #1539656 (which will be released with 1.25.4). I see the same "leadership manager stopped" messages in the unit logs, and evidence in machine-0.log that the mongo connection dropped around the same time.
| Cheryl Jennings (cherylj) wrote : | #25 |
You can work around this by restarting the jujud process on the state servers.
| tags: | added: canonical-bootstack |
| Suchitra Venugopal (suchvenu) wrote : | #26 |
I am using 1.25.5-trusty-amd64 version and getting this error when I deploy charms. Is this issue resolved ?
| Anastasia (anastasia-macmood) wrote : | #27 |
@Suchitra
It depends which issue you are referring to specifically...
This bug has morphed over time.
If you are referring to "dependency" related log message: this issue has been resolved in 1.25.6 by lowering log level from ERROR. Later juju versions, like 2.0, provide more details with respect to dependency details.
If you are referring to "agent lost" message: I believe, this has been fixed in 1.25.6
Could you please try 1.25.6 and if there are similar failures that affect you, create a separate bug with:
1. steps that you followed to reproduce;
2. what provider you are using;
3. logs.
| Suchitra Venugopal (suchvenu) wrote : | #28 |
@Anastasia,
I installed Juju today and got 1.25.5 installed. Is 1.25.6 already available ?
| Anastasia (anastasia-macmood) wrote : | #29 |
@Suchitra
1.25.6 was in proposed and came out couple of hours ago. You should be able to install it now as the latest version.
| Suchitra Venugopal (suchvenu) wrote : | #30 |
@Anastasia,
I installed 1.25.6 , but still getting the following errors in the log. However the juju status doesn't give error as before.
unit-ibm-
unit-ibm-
But i am unable to remove the service once I see this error in the logs.
| William Reade (fwereade) wrote : | #31 |
@Suchitra
Please open a new bug for the "tomb: dying" errors; it's clearly a problem, but I don't think it's the same as the "agent is lost" issue.
(You may be able to work around the tomb issue by SSHing in and restarting affected unit agents -- it might unblock service teardown -- but it's still clearly a bug, so please do report it, and link it here for continuity's sake.)
| Benedikt Troester (btroester) wrote : | #32 |
I'm having the described issue (WARNING juju.worker.
| Paul Gear (paulgear) wrote : | #33 |
@btroester: 'juju ssh 0 sudo service jujud-machine-0 restart', assuming machine 0 is your bootstrap node and you can still juju ssh to it - you may need to 'ssh ubuntu@IPaddress' instead of 'juju ssh 0' if the machine 0 jujud is unresponsive.


Pretty sure this is the cause (from one of the "failed" units):
2015-06-15 15:51:51 ERROR juju.worker. uniter. filter filter.go:137 tomb: dying uniter. filter filter.go:137 tomb: dying uniter. operation leader.go:111 we should run a leader-deposed hook here, but we can't yet uniter. filter filter.go:137 tomb: dying uniter. filter filter.go:137 tomb: dying uniter. filter filter.go:137 tomb: dying uniter. filter filter.go:137 tomb: dying uniter. filter filter.go:137 tomb: dying
2015-06-15 15:51:51 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 15:54:00 ERROR juju.worker.
2015-06-15 15:54:00 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 15:54:24 WARNING juju.worker.
2015-06-15 15:54:24 ERROR juju.worker.
2015-06-15 15:54:24 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 15:54:48 ERROR juju.worker.
2015-06-15 15:54:48 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 16:00:04 ERROR juju.worker.
2015-06-15 16:00:04 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 17:08:00 ERROR juju.worker.
2015-06-15 17:08:00 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 17:08:07 ERROR juju.worker.
2015-06-15 17:08:07 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: no active lease manager