juju-core

unit stuck executing update-status

Bug #1672306 reported by Laurent Sesquès on 2017-03-13

This bug report is a duplicate of: Bug #1662272: Agents stop running hooks and are hung. Edit Remove

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Confirmed	Undecided	Unassigned

Bug Description

Hi,

I found a long-running environment with three units running update-status and never getting out of it.
juju version: 1.25.10
cloud provider: openstack

Here are relevant log excerpts. Apparently the failure happens on 2017-03-13 06:49, but I added a few lines of context.

juju unit logs logs (for one of the three units):
2017-03-08 12:26:45 INFO config-changed + service nagios-nrpe-server reload
2017-03-08 12:26:45 INFO config-changed * Reloading nagios-nrpe configuration files nagios-nrpe
2017-03-08 12:26:45 INFO config-changed ...done.
2017-03-08 17:45:50 WARNING juju.worker.uniter.operation leader.go:115 we should run a leader-deposed
hook here, but we can't yet
2017-03-13 06:49:43 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2017-03-13 06:49:43 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold wor
ker: "leadership-tracker" not running: dependency not available
2017-03-13 06:49:47 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold wor
ker: "leadership-tracker" not running: dependency not available
2017-03-13 06:49:49 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold wor
ker: "leadership-tracker" not running: dependency not available
(and then the same message every few seconds for 1h+)

Machine 0 logs:
2017-03-11 13:01:19 WARNING juju.worker.instanceupdater updater.go:251 cannot get instance info for instance "32ee7228-7760-4c52-957b-91c0734f6908": failed to get list of server details
caused by: request (http://10.24.0.176:8774/v2/36aec9c4184a43fabb0185ab738858a1/servers/detail?name=juju-ps45-cdo-jujucharms-machine-%5Cd%2A) returned unexpected status: 500; error info: {"computeFault":
{"message": "The server has either erred or is incapable of performing the requested operation.", "code": 500}}
2017-03-11 13:01:19 WARNING juju.worker.instanceupdater updater.go:251 cannot get instance info for instance "2e540899-5c25-4240-827d-94d0e4225f05": failed to get list of server details
caused by: request (http://10.24.0.176:8774/v2/36aec9c4184a43fabb0185ab738858a1/servers/detail?name=juju-ps45-cdo-jujucharms-machine-%5Cd%2A) returned unexpected status: 500; error info: {"computeFault":
{"message": "The server has either erred or is incapable of performing the requested operation.", "code": 500}}
2017-03-13 06:49:42 ERROR juju.state.leadership manager.go:72 stopping leadership manager with error:
state changing too quickly; try again soon
2017-03-13 08:51:50 ERROR juju.rpc server.go:573 error writing response: write tcp 10.25.8.154:17070->10.25.9.248:42440: write: broken pipe
2017-03-13 08:51:50 ERROR juju.rpc server.go:573 error writing response: write tcp 10.25.8.154:17070->10.25.9.248:42440: write: broken pipe
2017-03-13 08:52:40 INFO juju.cmd supercommand.go:37 running jujud [1.25.10-trusty-amd64 gc]
2017-03-13 08:52:40 DEBUG juju.agent agent.go:491 read agent config, format "1.18"
2017-03-13 08:52:40 INFO juju.cmd.jujud machine.go:419 machine agent machine-0 start (1.25.10-trusty-amd64 [gc])
2017-03-13 08:52:40 DEBUG juju.wrench wrench.go:112 couldn't read wrench directory: stat /var/lib/juju/wrench: no such file or directory
2017-03-13 08:52:40 INFO juju.cmd.jujud upgrade.go:88 no upgrade steps required or upgrade steps for 1.25.10 have already been run.

This is resolved by restarting jujud-machine-0 (as can be seen in the log above).

Thanks,
Laurent

Tags:

Revision history for this message

Laurent Sesquès (sajoupa) wrote on 2017-03-13:

I forgot to mention that IP 10.25.8.154 (seen in machine-0's logs) is the IP of the machine where the mentioned nrpe unit runs.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-03-13:

@Laurent Sesques (sajoupa),
Since the problem is resolved by applying the workaround - restarting jujud - I will have to mark this as Won't Fix: 1.25 is only open to Critical bugs that do not have a workaround.
Thank you for your report. The problem is addressed in Juju 2.x.

Changed in juju-core:
status:	New → Won't Fix

Revision history for this message

Haw Loeung (hloeung) wrote on 2017-04-08:

Is this actually fixed in Juju 2.x? If so, how easy/hard would it be to backport this fix?

Restarting isn't really a workaround as we're back to seeing this after a couple of ours. FYI, this also affects the jujucharms.com environment.

Changed in juju-core:
status:	Won't Fix → Confirmed

Haw Loeung (hloeung) on 2017-04-08

tags:

added: canonical-is

Revision history for this message

Haw Loeung (hloeung) wrote on 2017-04-08:

s/ours/hours/. It basically just resets the state, lets all hooks fire, then we're back with bunch of then stuck and showing as "executing" with "(leader-elected)". Wouldn't this be classified critical?

Revision history for this message

Haw Loeung (hloeung) wrote on 2017-04-08:

Looks to be first reported in LP#:1662272

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-07-24:

This seems to be a duplicate of bug # 1662272 mentioned above. I am marking it as such.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1662272 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.