1.24.0: Lots of "agent is lost, sorry!" messages

Bug #1465307 reported by David Britton
50
This bug affects 10 people
Affects Status Importance Assigned to Milestone
juju-core
New
High
Unassigned

Bug Description

Attaching juju status tabular output, not sure what else is needed, from what I can tell, the agents on the machines are alive.

Showing that the processes are running on one of the systems in question:

ubuntu@juju-machine-0-lxc-2:/etc/init$ ls juju*
jujud-machine-0-lxc-2.conf jujud-unit-trusty-client-0.conf
ubuntu@juju-machine-0-lxc-2:/etc/init$ sudo service jujud-machine-0-lxc-2 status
jujud-machine-0-lxc-2 start/running, process 846
ubuntu@juju-machine-0-lxc-2:/etc/init$ sudo service jujud-unit-trusty-client-0 status
jujud-unit-trusty-client-0 start/running, process 887
ubuntu@juju-machine-0-lxc-2:/etc/init$ ps -ef |grep 846
root 846 1 0 Jun11 ? 00:00:29 /var/lib/juju/tools/machine-0-lxc-2/jujud machine --data-dir /var/lib/juju --machine-id 0/lxc/2 --debug
ubuntu 1909 1868 0 14:12 pts/1 00:00:00 grep --color=auto 846
ubuntu@juju-machine-0-lxc-2:/etc/init$ ps -ef |grep 887
root 887 1 0 Jun11 ? 00:00:43 /var/lib/juju/tools/unit-trusty-client-0/jujud unit --data-dir /var/lib/juju --unit-name trusty-client/0 --debug
ubuntu 1911 1868 0 14:12 pts/1 00:00:00 grep --color=auto 887

Status History Command, not sure if this is useful or not:

dpb@helo:slaves[0]$ juju status-history trusty-client/0
TIME TYPE STATUS MESSAGE
11 Jun 2015 12:19:25-06:00 workload unknown Waiting for agent initialization to finish
11 Jun 2015 12:19:25-06:00 agent allocating
11 Jun 2015 13:06:57-06:00 workload maintenance installing charm software
11 Jun 2015 13:07:01-06:00 agent executing running install hook
11 Jun 2015 13:17:50-06:00 agent error hook failed: "install"
11 Jun 2015 13:19:38-06:00 workload maintenance installing charm software
11 Jun 2015 13:19:39-06:00 agent executing running install hook
11 Jun 2015 14:52:40-06:00 agent executing running leader-elected hook
11 Jun 2015 14:52:48-06:00 agent executing running config-changed hook
11 Jun 2015 14:53:17-06:00 agent executing running start hook
11 Jun 2015 14:53:21-06:00 workload unknown
11 Jun 2015 14:53:25-06:00 agent idle
12 Jun 2015 00:54:11-06:00 agent executing running leader-elected hook
12 Jun 2015 00:54:33-06:00 agent executing running config-changed hook
12 Jun 2015 00:55:00-06:00 agent idle
14 Jun 2015 00:50:10-06:00 agent executing running leader-elected hook
14 Jun 2015 00:50:35-06:00 agent executing running config-changed hook
14 Jun 2015 00:50:47-06:00 agent idle

CVE References

Revision history for this message
David Britton (dpb) wrote :
description: updated
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.25.0
Revision history for this message
David Britton (dpb) wrote :

Pretty sure this is the cause (from one of the "failed" units):

2015-06-15 15:51:51 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 15:51:51 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 15:54:00 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 15:54:00 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 15:54:24 WARNING juju.worker.uniter.operation leader.go:111 we should run a leader-deposed hook here, but we can't yet
2015-06-15 15:54:24 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 15:54:24 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 15:54:48 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 15:54:48 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 16:00:04 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 16:00:04 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 17:08:00 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 17:08:00 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 17:08:07 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 17:08:07 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: no active lease manager

Revision history for this message
Curtis Hovey (sinzui) wrote :

David reports Beta5 was fine. The regression might have been introduced in the beta6 or in 1.24.0 changes.
    https://launchpad.net/juju-core/+milestone/1.24-beta6
    https://launchpad.net/juju-core/+milestone/1.24.0

Curtis Hovey (sinzui)
tags: added: blocker regression
Revision history for this message
David Britton (dpb) wrote :
Revision history for this message
David Britton (dpb) wrote :

Sanitized all-machines log

Revision history for this message
Nate Finch (natefinch) wrote :

So far, I have not been able to reproduce this bug simply deploying the ubuntu charm a whole bunch.... but I also deployed it by hand, so it's probably not as fast & furious as deploying via a script or bundle.

Revision history for this message
Nate Finch (natefinch) wrote :

I'll take a look at this some more later tonight.

Revision history for this message
John A Meinel (jameinel) wrote :

So it starts with:
2015-06-11 18:57:06 ERROR juju.apiserver images.go:47 GET(/environment/fc4d4b2e-d601-4303-8ff9-c309ddc26fdc/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz?%3Aarch=amd64&%3Aenvuuid=fc4d4b2e-d601-4303-8ff9-c309ddc26fdc&%3Afilename=ubuntu-14.04-server-cloudimg-amd64-root.tar.gz&%3Akind=lxc&%3Aseries=trusty&) failed: while streaming image: write tcp 127.0.0.1:50781: broken pipe

but that is followed by:
2015-06-11 19:20:18 ERROR juju.apiserver debuglog.go:110 debug-log handler error: write tcp 10.172.68.236:58562: broken pipe

That looks a whole lot like networking broke. Because it was trying to write a message to 10.172.68.236 at a random port, which looks a lot like a client side connection. Though it is 30 minutes later, and the next errors about lease manager are from 11h30m later.

Followed more failures and then:
2015-06-14 18:12:28 ERROR juju.worker.charmrevisionworker revisionupdater.go:86 cannot process charms: finding charm revision info: cannot get metadata from the charm store: cannot unmarshal error response "<html><body><h1>503 Service Unavailable</h1>\nNo server is available to handle this request.\n</body></html>\n": invalid character '<' looking for beginning of value.

I'll do a quick spin up to see if I can reproduce, but it certainly looks strange.

Revision history for this message
John A Meinel (jameinel) wrote :

If you want a non-deployer way to request a lot of stuff at once:

$ juju bootstrap --debug --constraints="mem=4G root-disk=20000"
$ juju deploy ubuntu -n 3
$ for j in `seq 10`; do for i in `seq 0 3`; do juju add-unit ubuntu --to lxc:$i; done & time wait; done

That will do a parallel AddUnit of an lxc onto each machine, tell you whole long it takes and then do it again 10 times. (I was seeing 10s per loop though it slowed down a bit near the end.)

I ran into trouble a couple times (first not passing mem=4G and it just being on m1.smalls, second not passing root-disk and running out of disk space on the 8G default disk size.)

But it did end up running and I do have 42 units running.

Curtis Hovey (sinzui)
Changed in juju-core:
status: Triaged → Incomplete
no longer affects: juju-core/1.24
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.0 → 1.25.1
Curtis Hovey (sinzui)
tags: removed: blocker
Revision history for this message
Manoj Iyer (manjo) wrote :

I can confirm that juju-core 1.26-alpha1.1 fixes this for me on ARM64

Revision history for this message
Cheryl Jennings (cherylj) wrote :

David, are you able to reproduce this problem with 1.26-alpha1?

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.1 → 1.25.2
Changed in juju-core:
milestone: 1.25.2 → 1.25.3
Revision history for this message
Vahid Ashrafian (vahid-arn) wrote :

Sorry! It was a mistake that I changed the state!

Changed in juju-core:
status: Incomplete → Fix Released
Changed in juju-core:
status: Fix Released → Incomplete
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I'm seeing something similar in 1.25.0. Juju status shows that all agents are lost:
[Services]
NAME STATUS EXPOSED CHARM
haproxy unknown false cs:trusty/haproxy-15
landscape-server unknown false cs:trusty/landscape-server-13
postgresql active false cs:trusty/postgresql-31
rabbitmq-server active false cs:trusty/rabbitmq-server-42

[Units]
ID WORKLOAD-STATE AGENT-STATE VERSION MACHINE PORTS PUBLIC-ADDRESS MESSAGE
haproxy/0 unknown lost 1.25.0 0/lxc/0 80/tcp,443/tcp,10000/tcp 10.96.9.128 agent is lost, sorry! See 'juju status-history haproxy/0'
landscape-server/0 unknown lost 1.25.0 0/lxc/1 10.96.8.165 agent is lost, sorry! See 'juju status-history landscape-server/0'
postgresql/0 unknown lost 1.25.0 0/lxc/2 5432/tcp 10.96.9.108 agent is lost, sorry! See 'juju status-history postgresql/0'
rabbitmq-server/0 unknown lost 1.25.0 0/lxc/3 5672/tcp 10.96.3.82 agent is lost, sorry! See 'juju status-history rabbitmq-server/0'

[Machines]
ID STATE VERSION DNS INS-ID SERIES HARDWARE
0 started 1.25.0 lds-ci.scapestack /MAAS/api/1.0/nodes/node-5348ce0e-9df3-11e5-ab31-2c59e54ace76/ trusty arch=amd64 cpu-cores=2 mem=4096M

On machine 0, this is the last machine-0.log entry:
2016-01-05 12:12:24 ERROR juju.state.leadership manager.go:72 stopping leadership manager with error: read tcp 10.96.10.219:37017: i/o timeout

all-machines.log is quite big, 244Mb, and is full of these at the end for all units:
unit-landscape-server-0[955]: 2016-01-19 17:54:08 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-rabbitmq-server-0[943]: 2016-01-19 17:54:08 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available

And this is the first message about tomb dying:
unit-rabbitmq-server-0[943]: 2016-01-05 12:12:31 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying

It kind of matches the timestamp of the last entry in the machine-0.log file.

I'm attaching these logs to the bug.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
tags: added: kanban-cross-team
tags: removed: kanban-cross-team
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.3 → 1.25.4
Revision history for this message
Paul Gear (paulgear) wrote :

We're seeing this on 1.24.5.1 in one of our clouds.

Revision history for this message
Paul Gear (paulgear) wrote :

I'm seeing this on a very simple environment which I installed in Canonistack; provisioned, left overnight, and in the morning the debug log is full of the same message https://pastebin.canonical.com/150077/ and status shows several lost agents: https://pastebin.canonical.com/150078/

Changed in juju-core:
status: Incomplete → New
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Please also provide `juju status-history` output for any of the units that report losing agent.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

@paulgear - we'll need the unit logs for the units that are failing to come up, and the machine-0.log for the environment to debug this issue.

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Paul Gear (paulgear) wrote :

@cherylj Thanks - will do. Just have my hands full with CVE-2015-7547 at the moment. :-)

Revision history for this message
Paul Gear (paulgear) wrote :

Status history on units for my test environment: https://pastebin.canonical.com/150772/

Logs for all units attached.

Revision history for this message
Paul Gear (paulgear) wrote :
Revision history for this message
Paul Gear (paulgear) wrote :
Paul Gear (paulgear)
Changed in juju-core:
status: Incomplete → New
Revision history for this message
Cheryl Jennings (cherylj) wrote :

This is a dup of bug #1539656 (which will be released with 1.25.4). I see the same "leadership manager stopped" messages in the unit logs, and evidence in machine-0.log that the mongo connection dropped around the same time.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

You can work around this by restarting the jujud process on the state servers.

tags: added: canonical-bootstack
Revision history for this message
Suchitra Venugopal (suchvenu) wrote :

I am using 1.25.5-trusty-amd64 version and getting this error when I deploy charms. Is this issue resolved ?

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Suchitra

It depends which issue you are referring to specifically...

This bug has morphed over time.

If you are referring to "dependency" related log message: this issue has been resolved in 1.25.6 by lowering log level from ERROR. Later juju versions, like 2.0, provide more details with respect to dependency details.

If you are referring to "agent lost" message: I believe, this has been fixed in 1.25.6

Could you please try 1.25.6 and if there are similar failures that affect you, create a separate bug with:
1. steps that you followed to reproduce;
2. what provider you are using;
3. logs.

Revision history for this message
Suchitra Venugopal (suchvenu) wrote :

@Anastasia,
 I installed Juju today and got 1.25.5 installed. Is 1.25.6 already available ?

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Suchitra
1.25.6 was in proposed and came out couple of hours ago. You should be able to install it now as the latest version.

Revision history for this message
Suchitra Venugopal (suchvenu) wrote :

@Anastasia,
  I installed 1.25.6 , but still getting the following errors in the log. However the juju status doesn't give error as before.

unit-ibm-mobilefirst-server-5[30749]: 2016-08-01 04:38:24 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
unit-ibm-mobilefirst-server-5[30749]: 2016-08-01 04:38:28 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying

But i am unable to remove the service once I see this error in the logs.

Revision history for this message
William Reade (fwereade) wrote :

@Suchitra

Please open a new bug for the "tomb: dying" errors; it's clearly a problem, but I don't think it's the same as the "agent is lost" issue.

(You may be able to work around the tomb issue by SSHing in and restarting affected unit agents -- it might unblock service teardown -- but it's still clearly a bug, so please do report it, and link it here for continuity's sake.)

Revision history for this message
Benedikt Troester (btroester) wrote :

I'm having the described issue (WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available and lots of agent lost) on Juju 1.25.6-trusty-amd64. Can you guys give me a hint as to how to restart the Juju service on the main machine if that is a workaround?

Revision history for this message
Paul Gear (paulgear) wrote :

@btroester: 'juju ssh 0 sudo service jujud-machine-0 restart', assuming machine 0 is your bootstrap node and you can still juju ssh to it - you may need to 'ssh ubuntu@IPaddress' instead of 'juju ssh 0' if the machine 0 jujud is unresponsive.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.