Agents stop running hooks and are hung

Bug #1662272 reported by Jacek Nykis
54
This bug affects 11 people
Affects Status Importance Assigned to Milestone
juju-core
Won't Fix
Undecided
Unassigned

Bug Description

In my environment some agents stopped running hooks. I see this in the logs:

2017-02-06 15:57:24 WARNING juju.worker.leadership tracker.go:211 error while landscape-client/41 waiting for landscape-client leadership release: error blocking on leadership release: leadership manager stopped
2017-02-06 15:57:25 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2017-02-06 15:57:25 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold worker: "leadership-tracker" not running: dependency not available
2017-02-06 15:57:29 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold worker: "leadership-tracker" not running: dependency not available
2017-02-06 15:57:32 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold worker: "leadership-tracker" not running: dependency not available

And it keeps going.

Timestamp of first message matches time when it stopped working.

I tried restarting unit agents but that did not help

juju CLI 1.25.6 on trusty
juju agents are 1.25.9 on trusty

Jacek Nykis (jacekn)
description: updated
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Jacek,
On what provider is this environment running?

We may have addressed this issue in later releases and I am trying to track down a fix/workaround.

For example, there was a similar looking log in bug # 1628206.

Please also attach a full log, not just an extract :D

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Jacek Nykis (jacekn) wrote :

@Anastasia

> On what provider is this environment running?

Openstack

> Please also attach a full log, not just an extract :D

Please don't share outside of the company:
https://private-fileshare.canonical.com/~jacek/unit-landscape-client-6.log.gz

Changed in juju-core:
status: Incomplete → New
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Jacek,

Thank you for your patience. I am not seeing anything particularly exciting in logs, certainly nothing that we have not seen before...

You said you have tried re-starting unit agents. Have you tried restarting state server? machine-0...

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Jacek Nykis (jacekn) wrote :

Yes I tried restarting machine-0 agent and also juju-db (followed by affected unit agent restart) but nothing helped.

Changed in juju-core:
status: Incomplete → New
Revision history for this message
Anastasia (anastasia-macmood) wrote :

From our conversation on IRC, you running this against openstack icehouse which is obsolete.

Also as this is staging environment and we do not have capacity to address non-Critical bugs in 1.25.x, I have to mark this as Won't Fix.

Changed in juju-core:
status: New → Won't Fix
Paul Gear (paulgear)
tags: added: canonical-is
Revision history for this message
Jacek Nykis (jacekn) wrote :

Openstack icehouse is still fully supported on Ubuntu:
https://www.ubuntu.com/info/release-end-of-life

The bug impacted staging but the environment is not different in any way to production so there is a high change it will impact production in the future, we would like to prevent that.

Changed in juju-core:
status: Won't Fix → New
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I think that to properly address this issue we need to address why agents stopped running hooks and how to shake up the environment where it happens to re-kick the process.

I'll rename the bug and will change it's Importance.

summary: - Agents stuck with "dependency not available"
+ Agents stop running hooks and are hung
Changed in juju-core:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 1.25.11
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Jacek,
1. From the unit log you've attached, messages such as:

2017-02-02 17:50:24 ERROR juju.worker.uniter modes.go:44 updating agent status: cannot set status: cannot find transaction ObjectIdHex("589370345b0d552e731e7713")

 are indicative of some problems that mgopurge can help you with - https://github.com/juju/juju/wiki/Incomplete-Transactions-and-MgoPurge

2. Please attach state server log. It's possible, on pure hypothetical deduction based on seeing 1, that mongo has become clogged up and will require some attention. We have also seen situations where mongo was run into the ground by rsyslog. But without state server logs, it's is hard to say.

3. Is this an HA environment?

4. Could you please share status output?

Changed in juju-core:
milestone: 1.25.11 → none
importance: Critical → Undecided
status: Triaged → Incomplete
Revision history for this message
Xav Paice (xavpaice) wrote :

I'm now seeing this same issue - 1.25.11 all round, on Trusty, with the maas provider.

This is an HA environment, I've restarted juju db and the state servers.

I notice a number of instances of "ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: cannot find transaction ObjectIdHex("<longhexid>")" so will go dig out the mgopurge tools and see where that takes it.

Revision history for this message
Xav Paice (xavpaice) wrote :

Just confirming, I saw the same errors as the OP for this bug, ran the mgopurge as advise, and it's fixed the problem.

Revision history for this message
Paul Gear (paulgear) wrote :

I saw this again in another environment on juju 1.25.6; juju was unable to run actions due to this error: https://pastebin.canonical.com/187167/ Restart of all machine agents only seemed to push juju into a worse state, with machine 0 unable to execute actions due to lack of memory: https://pastebin.canonical.com/187168/ A further restart of jujud-machine-0 restored the ability to juju run.

Changed in juju-core:
status: Incomplete → Confirmed
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Thank you for update \o/

As there is a workaround which involves restarting jujud, I am marking this bug as Won't Fix.

We are not seeing similar behavior with later version of Juju.

Changed in juju-core:
status: Confirmed → Won't Fix
Revision history for this message
Jacek Nykis (jacekn) wrote :

I've just hit this bug in a juju 1.25.10 environment. I tried agent restarts, including machine-0 and even juju-db but none of it worked.

This bug prevents changes to the environment, for example in my specific case I am unable to remove applications and units.

Changed in juju-core:
status: Won't Fix → New
Revision history for this message
Tim Penhey (thumper) wrote :

@jacekn can we get some logs from this most recent problem?

Revision history for this message
Jacek Nykis (jacekn) wrote :

@thumper I can confirm that we restored the environment using those two commands, run in this order in sequence:
./mgopurge-latest -password XXXXXXXXXXXXXX -stages prune
./mgopurge-latest -password XXXXXXXXXXXXXX

However it's worth pointing out that we initially tried this:
./mgopurge-latest -password XXXXXXXXXXXXXX -yes -stages purgemissing
and also this:
./mgopurge-latest -password XXXXXXXXXXXXXX

And neither of them fixed it.

It's not very clear from this bug what is the workaround, it seems that different environments need different actions for services to come back

Revision history for this message
Tim Penhey (thumper) wrote :

Without at least the logs, we can't help diagnose further.

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Xav Paice (xavpaice) wrote :

I've just come across this again, in a xenial, 1.25.13 environment.

Let me know what logs you'd like?

Revision history for this message
Xav Paice (xavpaice) wrote :
Download full text (4.8 KiB)

machine-0 log:

<snip>
2017-10-26 22:48:45 WARNING juju.worker.instanceupdater updater.go:251 cannot get instance info for instance "/MAAS/api/1.0/nodes/node-e6d31b94-8ed8-11e6-a6e7-525400751d5d/": Get http://10.116.0.11/MAAS/api/1.0/nodes/?agent_name=41878cb2-e995-4bed-8554-8fdce39caa76&id=node-d3c63af4-8ed8-11e6-a6e7-525400751d5d&id=node-07a2e28c-8ed9-11e6-a748-525400751d5d&id=node-0aeaace0-8ed9-11e6-a748-525400751d5d&id=node-e6d31b94-8ed8-11e6-a6e7-525400751d5d&id=node-d6e896d2-8ed8-11e6-8450-525400751d5d&id=node-048a0846-8ed9-11e6-a748-525400751d5d&id=node-d09fd092-8ed8-11e6-a6e7-525400751d5d&op=list: dial tcp 10.116.0.11:80: getsockopt: no route to host
2017-10-26 22:48:45 WARNING juju.worker.instanceupdater updater.go:251 cannot get instance info for instance "/MAAS/api/1.0/nodes/node-d3c63af4-8ed8-11e6-a6e7-525400751d5d/": Get http://10.116.0.11/MAAS/api/1.0/nodes/?agent_name=41878cb2-e995-4bed-8554-8fdce39caa76&id=node-d3c63af4-8ed8-11e6-a6e7-525400751d5d&id=node-07a2e28c-8ed9-11e6-a748-525400751d5d&id=node-0aeaace0-8ed9-11e6-a748-525400751d5d&id=node-e6d31b94-8ed8-11e6-a6e7-525400751d5d&id=node-d6e896d2-8ed8-11e6-8450-525400751d5d&id=node-048a0846-8ed9-11e6-a748-525400751d5d&id=node-d09fd092-8ed8-11e6-a6e7-525400751d5d&op=list: dial tcp 10.116.0.11:80: getsockopt: no route to host

(note, that the time at this collection was Fri Oct 27 01:58:58 UTC 2017 - so nothing had hit that log for a while)

Logs from a 'failed' unit:
2017-10-27 01:59:26 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold worker: "leadership-tracker" not running: dependency not available
2017-10-27 01:59:29 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold worker: "leadership-tracker" not running: dependency not available
2017-10-27 01:59:32 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold worker: "leadership-tracker" not running: dependency not available
2017-10-27 01:59:35 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold worker: "leadership-tracker" not running: dependency not available
2017-10-27 01:59:38 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold worker: "leadership-tracker" not running: dependency not available
2017-10-27 01:59:41 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold worker: "leadership-tracker" not running: dependency not available
2017-10-27 01:59:44 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold worker: "leadership-tracker" not running: dependency not available

From the host machine running the 'failed' unit:

2017-10-26 22:55:19 ERROR juju.worker runner.go:212 fatal "upgrader": error receiving message: read tcp 10.116.0.34:36706->10.116.0.31:17070: read: connection reset by peer
2017-10-26 22:55:19 ERROR juju.worker.storageprovisioner storageprovisioner.go:184 error receiving message: read tcp 10.116.0.34:36706->10.116.0.31:17070: read: connection reset by peer
2017-10-26 22:55:19 ERROR juju.worker runner.go:212 fatal "proxyupdater": error receiving message: read tcp 10.116.0.34:36706->10.116.0.31:17070: read: connection reset by ...

Read more...

tags: added: canonical-bootstack
Revision history for this message
Xav Paice (xavpaice) wrote :

https://pastebin.canonical.com/201793/ logs from a mgopurge (v1.6) with the state server stopped.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju-core because there has been no activity for 60 days.]

Changed in juju-core:
status: Incomplete → Expired
Revision history for this message
Xav Paice (xavpaice) wrote :

we're still seeing this on several sites running 1.25.13 - any chance of an update?

Changed in juju-core:
status: Expired → Confirmed
Revision history for this message
Xav Paice (xavpaice) wrote :

I've marked this back as incomplete, please let me know what logs you need and we will try to collect.

Changed in juju-core:
status: Confirmed → Incomplete
Xav Paice (xavpaice)
Changed in juju-core:
status: Incomplete → New
Revision history for this message
John A Meinel (jameinel) wrote :

We acknowledge the problem, but this seems the same as a symptom we fixed in 2.x that is rather involved and not suitable for a backport. Instead we should work on providing a smoother experience around upgrading to the 2.X series.

Changed in juju-core:
status: New → Won't Fix
Revision history for this message
James Troup (elmo) wrote : Re: [Bug 1662272] Re: Agents stop running hooks and are hung

John A Meinel <email address hidden> writes:

> We acknowledge the problem, but this seems the same as a symptom we
> fixed in 2.x that is rather involved and not suitable for a
> backport. Instead we should work on providing a smoother experience
> around upgrading to the 2.X series.

I'm fairly sure Xav is trying to get a 1.2x environment to a steady
state precisely so he can help you guys test the 1.x to 2.x upgrade
process ...

--
James

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.