unit stuck executing update-status

Bug #1672306 reported by Laurent Sesquès
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju-core
Confirmed
Undecided
Unassigned

Bug Description

Hi,

I found a long-running environment with three units running update-status and never getting out of it.
juju version: 1.25.10
cloud provider: openstack

Here are relevant log excerpts. Apparently the failure happens on 2017-03-13 06:49, but I added a few lines of context.

juju unit logs logs (for one of the three units):
2017-03-08 12:26:45 INFO config-changed + service nagios-nrpe-server reload
2017-03-08 12:26:45 INFO config-changed * Reloading nagios-nrpe configuration files nagios-nrpe
2017-03-08 12:26:45 INFO config-changed ...done.
2017-03-08 17:45:50 WARNING juju.worker.uniter.operation leader.go:115 we should run a leader-deposed
hook here, but we can't yet
2017-03-13 06:49:43 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2017-03-13 06:49:43 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold wor
ker: "leadership-tracker" not running: dependency not available
2017-03-13 06:49:47 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold wor
ker: "leadership-tracker" not running: dependency not available
2017-03-13 06:49:49 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold wor
ker: "leadership-tracker" not running: dependency not available
(and then the same message every few seconds for 1h+)

Machine 0 logs:
2017-03-11 13:01:19 WARNING juju.worker.instanceupdater updater.go:251 cannot get instance info for instance "32ee7228-7760-4c52-957b-91c0734f6908": failed to get list of server details
caused by: request (http://10.24.0.176:8774/v2/36aec9c4184a43fabb0185ab738858a1/servers/detail?name=juju-ps45-cdo-jujucharms-machine-%5Cd%2A) returned unexpected status: 500; error info: {"computeFault":
{"message": "The server has either erred or is incapable of performing the requested operation.", "code": 500}}
2017-03-11 13:01:19 WARNING juju.worker.instanceupdater updater.go:251 cannot get instance info for instance "2e540899-5c25-4240-827d-94d0e4225f05": failed to get list of server details
caused by: request (http://10.24.0.176:8774/v2/36aec9c4184a43fabb0185ab738858a1/servers/detail?name=juju-ps45-cdo-jujucharms-machine-%5Cd%2A) returned unexpected status: 500; error info: {"computeFault":
{"message": "The server has either erred or is incapable of performing the requested operation.", "code": 500}}
2017-03-13 06:49:42 ERROR juju.state.leadership manager.go:72 stopping leadership manager with error:
state changing too quickly; try again soon
2017-03-13 08:51:50 ERROR juju.rpc server.go:573 error writing response: write tcp 10.25.8.154:17070->10.25.9.248:42440: write: broken pipe
2017-03-13 08:51:50 ERROR juju.rpc server.go:573 error writing response: write tcp 10.25.8.154:17070->10.25.9.248:42440: write: broken pipe
2017-03-13 08:52:40 INFO juju.cmd supercommand.go:37 running jujud [1.25.10-trusty-amd64 gc]
2017-03-13 08:52:40 DEBUG juju.agent agent.go:491 read agent config, format "1.18"
2017-03-13 08:52:40 INFO juju.cmd.jujud machine.go:419 machine agent machine-0 start (1.25.10-trusty-amd64 [gc])
2017-03-13 08:52:40 DEBUG juju.wrench wrench.go:112 couldn't read wrench directory: stat /var/lib/juju/wrench: no such file or directory
2017-03-13 08:52:40 INFO juju.cmd.jujud upgrade.go:88 no upgrade steps required or upgrade steps for 1.25.10 have already been run.

This is resolved by restarting jujud-machine-0 (as can be seen in the log above).

Thanks,
Laurent

Tags: canonical-is
Revision history for this message
Laurent Sesquès (sajoupa) wrote :

I forgot to mention that IP 10.25.8.154 (seen in machine-0's logs) is the IP of the machine where the mentioned nrpe unit runs.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Laurent Sesques (sajoupa),
Since the problem is resolved by applying the workaround - restarting jujud - I will have to mark this as Won't Fix: 1.25 is only open to Critical bugs that do not have a workaround.
Thank you for your report. The problem is addressed in Juju 2.x.

Changed in juju-core:
status: New → Won't Fix
Revision history for this message
Haw Loeung (hloeung) wrote :

Is this actually fixed in Juju 2.x? If so, how easy/hard would it be to backport this fix?

Restarting isn't really a workaround as we're back to seeing this after a couple of ours. FYI, this also affects the jujucharms.com environment.

Changed in juju-core:
status: Won't Fix → Confirmed
Haw Loeung (hloeung)
tags: added: canonical-is
Revision history for this message
Haw Loeung (hloeung) wrote :

s/ours/hours/. It basically just resets the state, lets all hooks fire, then we're back with bunch of then stuck and showing as "executing" with "(leader-elected)". Wouldn't this be classified critical?

Revision history for this message
Haw Loeung (hloeung) wrote :

Looks to be first reported in LP#:1662272

Revision history for this message
Anastasia (anastasia-macmood) wrote :

This seems to be a duplicate of bug # 1662272 mentioned above. I am marking it as such.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.