juju-core

1.24.0: Lots of "agent is lost, sorry!" messages

Bug #1465307 reported by David Britton on 2015-06-15

This bug report is a duplicate of: Bug #1539656: leadership dependency failures in CI deploy tests. Edit Remove

This bug affects 10 people

Affects		Status	Importance	Assigned to	Milestone
	juju-core	New	High	Unassigned	juju-core 1.25.4

Bug Description

Attaching juju status tabular output, not sure what else is needed, from what I can tell, the agents on the machines are alive.

Showing that the processes are running on one of the systems in question:

ubuntu@juju-machine-0-lxc-2:/etc/init$ ls juju*
jujud-machine-0-lxc-2.conf jujud-unit-trusty-client-0.conf
ubuntu@juju-machine-0-lxc-2:/etc/init$ sudo service jujud-machine-0-lxc-2 status
jujud-machine-0-lxc-2 start/running, process 846
ubuntu@juju-machine-0-lxc-2:/etc/init$ sudo service jujud-unit-trusty-client-0 status
jujud-unit-trusty-client-0 start/running, process 887
ubuntu@juju-machine-0-lxc-2:/etc/init$ ps -ef |grep 846
root 846 1 0 Jun11 ? 00:00:29 /var/lib/juju/tools/machine-0-lxc-2/jujud machine --data-dir /var/lib/juju --machine-id 0/lxc/2 --debug
ubuntu 1909 1868 0 14:12 pts/1 00:00:00 grep --color=auto 846
ubuntu@juju-machine-0-lxc-2:/etc/init$ ps -ef |grep 887
root 887 1 0 Jun11 ? 00:00:43 /var/lib/juju/tools/unit-trusty-client-0/jujud unit --data-dir /var/lib/juju --unit-name trusty-client/0 --debug
ubuntu 1911 1868 0 14:12 pts/1 00:00:00 grep --color=auto 887

Status History Command, not sure if this is useful or not:

dpb@helo:slaves[0]$ TIME 11 Jun 2015 12:19:25-06:00 11 Jun 2015 12:19:25-06:00 11 Jun 2015 13:06:57-06:00 11 Jun 2015 13:07:01-06:00 11 Jun 2015 13:17:50-06:00 11 Jun 2015 13:19:38-06:00 11 Jun 2015 13:19:39-06:00 11 Jun 2015 14:52:40-06:00 11 Jun 2015 14:52:48-06:00 11 Jun 2015 14:53:17-06:00 11 Jun 2015 14:53:21-06:00 11 Jun 2015 14:53:25-06:00 12 Jun 2015 00:54:11-06:00 12 Jun 2015 00:54:33-06:00 12 Jun 2015 00:55:00-06:00 14 Jun 2015 00:50:10-06:00 14 Jun 2015 00:50:35-06:00 14 Jun 2015 00:50:47-06:00 juju status-history trusty-client/0
TYPE STATUS MESSAGE
workload unknown Waiting for agent initialization to finish
agent allocating
workload maintenance installing charm software
agent executing running install hook
agent error hook failed: "install"
workload maintenance installing charm software
agent executing running install hook
agent executing running leader-elected hook
agent executing running config-changed hook
agent executing running start hook
workload unknown
agent idle
agent executing running leader-elected hook
agent executing running config-changed hook
agent idle
agent executing running leader-elected hook
agent executing running config-changed hook
agent idle

See original description

Tags:

CVE References

2015-7547

Revision history for this message

David Britton (dpb) wrote on 2015-06-15:

status-tabular.txt Edit (3.7 KiB, text/plain)

description:

updated

Curtis Hovey (sinzui) on 2015-06-15

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 1.25.0

Revision history for this message

David Britton (dpb) wrote on 2015-06-15:

Pretty sure this is the cause (from one of the "failed" units):

2015-06-15 15:51:51 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 15:51:51 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 15:54:00 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 15:54:00 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 15:54:24 WARNING juju.worker.uniter.operation leader.go:111 we should run a leader-deposed hook here, but we can't yet
2015-06-15 15:54:24 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 15:54:24 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 15:54:48 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 15:54:48 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 16:00:04 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 16:00:04 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 17:08:00 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 17:08:00 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: worker stopped
2015-06-15 17:08:07 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-06-15 17:08:07 ERROR juju.worker runner.go:219 exited "uniter": leadership failure: unable to make a leadership claim: no active lease manager

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-06-15:

David reports Beta5 was fine. The regression might have been introduced in the beta6 or in 1.24.0 changes.
https://launchpad.net/juju-core/+milestone/1.24-beta6
https://launchpad.net/juju-core/+milestone/1.24.0

Curtis Hovey (sinzui) on 2015-06-15

tags:

added: blocker regression

Revision history for this message

David Britton (dpb) wrote on 2015-06-15:

machine-0.log Edit (33.2 KiB, text/html)

Revision history for this message

David Britton (dpb) wrote on 2015-06-15:

all-machines-sanitized.log Edit (61.6 KiB, text/html)

Sanitized all-machines log

Revision history for this message

Nate Finch (natefinch) wrote on 2015-06-15:

So far, I have not been able to reproduce this bug simply deploying the ubuntu charm a whole bunch.... but I also deployed it by hand, so it's probably not as fast & furious as deploying via a script or bundle.

Revision history for this message

Nate Finch (natefinch) wrote on 2015-06-15:

I'll take a look at this some more later tonight.

Revision history for this message

John A Meinel (jameinel) wrote on 2015-06-16:

So it starts with:
2015-06-11 18:57:06 ERROR juju.apiserver images.go:47 GET(/environment/fc4d4b2e-d601-4303-8ff9-c309ddc26fdc/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz?%3Aarch=amd64&%3Aenvuuid=fc4d4b2e-d601-4303-8ff9-c309ddc26fdc&%3Afilename=ubuntu-14.04-server-cloudimg-amd64-root.tar.gz&%3Akind=lxc&%3Aseries=trusty&) failed: while streaming image: write tcp 127.0.0.1:50781: broken pipe

but that is followed by:
2015-06-11 19:20:18 ERROR juju.apiserver debuglog.go:110 debug-log handler error: write tcp 10.172.68.236:58562: broken pipe

That looks a whole lot like networking broke. Because it was trying to write a message to 10.172.68.236 at a random port, which looks a lot like a client side connection. Though it is 30 minutes later, and the next errors about lease manager are from 11h30m later.

Followed more failures and then:
2015-06-14 18:12:28 ERROR juju.worker.charmrevisionworker revisionupdater.go:86 cannot process charms: finding charm revision info: cannot get metadata from the charm store: cannot unmarshal error response "<html><body><h1>503 Service Unavailable</h1>\nNo server is available to handle this request.\n</body></html>\n": invalid character '<' looking for beginning of value.

I'll do a quick spin up to see if I can reproduce, but it certainly looks strange.

Revision history for this message

John A Meinel (jameinel) wrote on 2015-06-16:

#10

If you want a non-deployer way to request a lot of stuff at once:

$ juju bootstrap --debug --constraints="mem=4G root-disk=20000"
$ juju deploy ubuntu -n 3
$ for j in `seq 10`; do for i in `seq 0 3`; do juju add-unit ubuntu --to lxc:$i; done & time wait; done

That will do a parallel AddUnit of an lxc onto each machine, tell you whole long it takes and then do it again 10 times. (I was seeing 10s per loop though it slowed down a bit near the end.)

I ran into trouble a couple times (first not passing mem=4G and it just being on m1.smalls, second not passing root-disk and running out of disk space on the 8G default disk size.)

But it did end up running and I do have 42 units running.

Curtis Hovey (sinzui) on 2015-06-16

Changed in juju-core:
status:	Triaged → Incomplete
no longer affects:	juju-core/1.24

Curtis Hovey (sinzui) on 2015-06-22

Changed in juju-core:
milestone:	1.25.0 → 1.25.1

Curtis Hovey (sinzui) on 2015-07-01

tags:

removed: blocker

Revision history for this message

Manoj Iyer (manjo) wrote on 2015-11-06:

#11

I can confirm that juju-core 1.26-alpha1.1 fixes this for me on ARM64

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-11-10:

#12

David, are you able to reproduce this problem with 1.26-alpha1?

Curtis Hovey (sinzui) on 2015-11-23

Changed in juju-core:
milestone:	1.25.1 → 1.25.2

Cheryl Jennings (cherylj) on 2015-11-24

Changed in juju-core:
milestone:	1.25.2 → 1.25.3

Revision history for this message

Vahid Ashrafian (vahid-arn) wrote on 2016-01-19:

#13

Sorry! It was a mistake that I changed the state!

Changed in juju-core:
status:	Incomplete → Fix Released

Matthew Williams (mattyw) on 2016-01-19

Changed in juju-core:
status:	Fix Released → Incomplete

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-01-19:

#14

I'm seeing something similar in 1.25.0. Juju status shows that all agents are lost:
[Services]
NAME STATUS EXPOSED CHARM
haproxy unknown false cs:trusty/haproxy-15
landscape-server unknown false cs:trusty/landscape-server-13
postgresql active false cs:trusty/postgresql-31
rabbitmq-server active false cs:trusty/rabbitmq-server-42

[Machines]
ID STATE VERSION DNS INS-ID SERIES HARDWARE
0 started 1.25.0 lds-ci.scapestack /MAAS/api/1.0/nodes/node-5348ce0e-9df3-11e5-ab31-2c59e54ace76/ trusty arch=amd64 cpu-cores=2 mem=4096M

On machine 0, this is the last machine-0.log entry:
2016-01-05 12:12:24 ERROR juju.state.leadership manager.go:72 stopping leadership manager with error: read tcp 10.96.10.219:37017: i/o timeout

all-machines.log is quite big, 244Mb, and is full of these at the end for all units:
unit-landscape-server-0[955]: 2016-01-19 17:54:08 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-rabbitmq-server-0[943]: 2016-01-19 17:54:08 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available

And this is the first message about tomb dying:
unit-rabbitmq-server-0[943]: 2016-01-05 12:12:31 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying

It kind of matches the timestamp of the last entry in the machine-0.log file.

I'm attaching these logs to the bug.

I'm seeing something similar in 1.25.0. Juju status shows that all agents are lost:
[Services]       
NAME             STATUS  EXPOSED CHARM                         
haproxy          unknown false   cs:trusty/haproxy-15          
landscape-server unknown false   cs:trusty/landscape-server-13 
postgresql       active  false   cs:trusty/postgresql-31       
rabbitmq-server  active  false   cs:trusty/rabbitmq-server-42

[Units]            
ID                 WORKLOAD-STATE AGENT-STATE VERSION MACHINE PORTS                    PUBLIC-ADDRESS MESSAGE                                                            
haproxy/0          unknown        lost        1.25.0  0/lxc/0 80/tcp,443/tcp,10000/tcp 10.96.9.128    agent is lost, sorry! See 'juju status-history haproxy/0'          
landscape-server/0 unknown        lost        1.25.0  0/lxc/1                          10.96.8.165    agent is lost, sorry! See 'juju status-history landscape-server/0' 
postgresql/0       unknown        lost        1.25.0  0/lxc/2 5432/tcp                 10.96.9.108    agent is lost, sorry! See 'juju status-history postgresql/0'       
rabbitmq-server/0  unknown        lost        1.25.0  0/lxc/3 5672/tcp                 10.96.3.82     agent is lost, sorry! See 'juju status-history rabbitmq-server/0'

[Machines] 
ID         STATE   VERSION DNS               INS-ID                                                         SERIES HARDWARE                         
0          started 1.25.0  lds-ci.scapestack /MAAS/api/1.0/nodes/node-5348ce0e-9df3-11e5-ab31-2c59e54ace76/ trusty arch=amd64 cpu-cores=2 mem=4096M

On machine 0, this is the last machine-0.log entry:
2016-01-05 12:12:24 ERROR juju.state.leadership manager.go:72 stopping leadership manager with error: read tcp 10.96.10.219:37017: i/o timeout

And this is the first message about tomb dying:
unit-rabbitmq-server-0[943]: 2016-01-05 12:12:31 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying

It kind of matches the timestamp of the last entry in the machine-0.log file.

I'm attaching these logs to the bug.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-01-19:

#15

andreas-juju-1.25.0-logs.tar.bz2 Edit (5.6 MiB, application/x-tar)

Andreas Hasenack (ahasenack) on 2016-01-19

tags:

added: kanban-cross-team

🤖 Landscape Builder (landscape-builder) on 2016-01-19

tags:

removed: kanban-cross-team

Curtis Hovey (sinzui) on 2016-01-21

Changed in juju-core:
milestone:	1.25.3 → 1.25.4

Revision history for this message

Paul Gear (paulgear) wrote on 2016-01-29:

#16

We're seeing this on 1.24.5.1 in one of our clouds.

Revision history for this message

Paul Gear (paulgear) wrote on 2016-02-17:

#17

I'm seeing this on a very simple environment which I installed in Canonistack; provisioned, left overnight, and in the morning the debug log is full of the same message https://pastebin.canonical.com/150077/ and status shows several lost agents: https://pastebin.canonical.com/150078/

Changed in juju-core:
status:	Incomplete → New

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-02-18:

#18

Please also provide `juju status-history` output for any of the units that report losing agent.

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-02-22:

#19

@paulgear - we'll need the unit logs for the units that are failing to come up, and the machine-0.log for the environment to debug this issue.

Changed in juju-core:
status:	New → Incomplete

Revision history for this message

Paul Gear (paulgear) wrote on 2016-02-22:

#20

@cherylj Thanks - will do. Just have my hands full with CVE-2015-7547 at the moment. :-)

Revision history for this message

Paul Gear (paulgear) wrote on 2016-02-28:

#21

lp1465307-10.55.60.56.tar.xz Edit (233.4 KiB, application/octet-stream)

Status history on units for my test environment: https://pastebin.canonical.com/150772/

Logs for all units attached.

Revision history for this message

Paul Gear (paulgear) wrote on 2016-02-28:

#22

lp1465307-10.55.60.62.tar.xz Edit (31.5 KiB, application/octet-stream)

Revision history for this message

Paul Gear (paulgear) wrote on 2016-02-28:

#23

lp1465307-10.55.60.179.tar.xz Edit (33.3 KiB, application/octet-stream)

Paul Gear (paulgear) on 2016-02-29

Changed in juju-core:
status:	Incomplete → New

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-03-01:

#24

This is a dup of bug #1539656 (which will be released with 1.25.4). I see the same "leadership manager stopped" messages in the unit logs, and evidence in machine-0.log that the mongo connection dropped around the same time.

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-03-02:

#25

You can work around this by restarting the jujud process on the state servers.

Benjamin Kaehne (ben-kaehne) on 2016-07-07

tags:

added: canonical-bootstack

Revision history for this message

Suchitra Venugopal (suchvenu) wrote on 2016-07-26:

#26

I am using 1.25.5-trusty-amd64 version and getting this error when I deploy charms. Is this issue resolved ?

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-07-26:

#27

@Suchitra

It depends which issue you are referring to specifically...

This bug has morphed over time.

If you are referring to "dependency" related log message: this issue has been resolved in 1.25.6 by lowering log level from ERROR. Later juju versions, like 2.0, provide more details with respect to dependency details.

If you are referring to "agent lost" message: I believe, this has been fixed in 1.25.6

Could you please try 1.25.6 and if there are similar failures that affect you, create a separate bug with:
1. steps that you followed to reproduce;
2. what provider you are using;
3. logs.

Revision history for this message

Suchitra Venugopal (suchvenu) wrote on 2016-07-26:

#28

@Anastasia,
I installed Juju today and got 1.25.5 installed. Is 1.25.6 already available ?

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-07-27:

#29

@Suchitra
1.25.6 was in proposed and came out couple of hours ago. You should be able to install it now as the latest version.

Revision history for this message

Suchitra Venugopal (suchvenu) wrote on 2016-08-01:

#30

@Anastasia,
I installed 1.25.6 , but still getting the following errors in the log. However the juju status doesn't give error as before.

unit-ibm-mobilefirst-server-5[30749]: 2016-08-01 04:38:24 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
unit-ibm-mobilefirst-server-5[30749]: 2016-08-01 04:38:28 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying

But i am unable to remove the service once I see this error in the logs.

Revision history for this message

William Reade (fwereade) wrote on 2016-08-01:

#31

@Suchitra

Please open a new bug for the "tomb: dying" errors; it's clearly a problem, but I don't think it's the same as the "agent is lost" issue.

(You may be able to work around the tomb issue by SSHing in and restarting affected unit agents -- it might unblock service teardown -- but it's still clearly a bug, so please do report it, and link it here for continuity's sake.)

Revision history for this message

Benedikt Troester (btroester) wrote on 2016-12-29:

#32

I'm having the described issue (WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available and lots of agent lost) on Juju 1.25.6-trusty-amd64. Can you guys give me a hint as to how to restart the Juju service on the main machine if that is a workaround?

Revision history for this message

Paul Gear (paulgear) wrote on 2016-12-30:

#33

@btroester: 'juju ssh 0 sudo service jujud-machine-0 restart', assuming machine 0 is your bootstrap node and you can still juju ssh to it - you may need to 'ssh ubuntu@IPaddress' instead of 'juju ssh 0' if the machine 0 jujud is unresponsive.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1539656 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.