Dependency engine logging should provide more information

Bug #1513667 reported by Charles Butler
84
This bug affects 16 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Alexis Bruemmer
juju-core
Won't Fix
Undecided
Unassigned
1.25
Fix Released
Medium
William Reade

Bug Description

While revving charms, I encountered a situation that spammed failure logs about failing to start the uniter, dependency unavailable - however it gave no context into what was happening, or what restarted/caused the uniter to fail.

unit-swarm-6[3112]: 2015-11-06 00:00:46 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-etcd-0[3120]: 2015-11-06 00:00:47 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available

 I'll try to reproduce this and attach unit logs + repeatability.

Changed in juju-core:
status: New → Triaged
importance: Undecided → Medium
tags: added: sts
Revision history for this message
David Lawson (deej) wrote :

I've encountered this as well, it appears that juju is confused about leader status, since this is a single unit but this is the only thing in the logs. Note that this starter, as far as I can tell, spontaneously, I wasn't logged on to the unit nor was anyone else interacting with it.

2015-12-04 06:45:02 WARNING juju.worker.uniter.operation leader.go:115 we should run a leader-deposed hook here, but we can't yet
2015-12-04 06:45:04 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-12-04 06:45:04 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
...
That message repeats up until the current time.

Revision history for this message
David Lawson (deej) wrote :

Huh, it appears that this behavior infects an environment, spinning up a new unit of the affected service ends with the same behavior, juju gets stuck allocating the new agent and the log is full of the same message.

tags: added: kanban-cross-team
tags: removed: kanban-cross-team
Revision history for this message
Adam Collard (adam-collard) wrote :

Restarting jujud on machine-0 seems to have resolved the situation:

# restart jujud-machine-0

I noticed whilst doing that that mongo seems to have burped: /var/log/upstart/juju-db.log.1.gz

https://pastebin.canonical.com/146061/

tags: removed: sts
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

We got another environment that is borked. The juju logs are just repeating this over and over:
2016-04-20 12:12:55 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available

Since january 5th, 2016.

Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

I have had the same problem. MongoDB using all the CPU, grinding the whole server to a halt. After a few minutes, everything is back to normal again, and juju debug-log is just "failed to start uniter" over and over again.

Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

Note to people having the same problem: You restart the juju machine daemon by running `sudo restart jujud-machine-0` from machine 0

Revision history for this message
Florian Haas (fghaas) wrote :

Just seeing that same issue now; restarting jujud-machine-0 doesn't do a thing.

Log excerpt is here:

unit-cinder-api-1[893]: 2016-05-12 08:34:54 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-neutron-openvswitch-0[1149]: 2016-05-12 08:34:53 WARNING juju.worker.dependency engine.go:304 failed to start
"uniter" manifold worker: dependency not available
unit-neutron-openvswitch-0[1149]: 2016-05-12 08:34:53 WARNING juju.worker.dependency engine.go:304 failed to start
"uniter" manifold worker: dependency not available
unit-cinder-volume-0[11719]: 2016-05-12 08:34:54 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-openstack-dashboard-1[1167]: 2016-05-12 08:34:55 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-openstack-dashboard-1[1167]: 2016-05-12 08:34:55 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-juju-gui-0[1160]: 2016-05-12 08:34:55 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
unit-juju-gui-0[1160]: 2016-05-12 08:34:55 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
unit-juju-gui-0[1160]: 2016-05-12 08:34:55 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-juju-gui-0[1160]: 2016-05-12 08:34:55 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-neutron-gateway-0[5570]: 2016-05-12 08:34:55 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available

Revision history for this message
Florian Haas (fghaas) wrote :

As far as I can tell this issue, when it occurs, is a complete show stopper, meaning no changes can be made to existing charms, nor can new charms be deployed.

Could someone please comment on what the recommended course of action is if this issue occurs and restarting jujud-machine-0 doesn't help?

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Restarting jujud on machine 0 in the admin model should resolve the issue. If it doesn't, please set the logging level to DEBUG and attach machine-0.log from the admin model.

You can set the log level with this command: juju set-model-config logging-config="<root>=DEBUG" -m admin

tags: added: canonical-bootstack
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

I'm also getting this.

    2016-05-30 10:01:35 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available

The unit shows as failed, and unit operations are unavail. (eg. run, config-change)

The workaround (restart jujud-machine-0) does NOT help here. The log for machine-0 shows this (repeatedly):

    2016-05-30 10:04:52 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: cannot find transaction ObjectIdHex("57471fe2004d521d0679974a")

This is from a production system and I can't make logs available publically. I've uploaded the machine-0 logs here: private-fileshare.canonical.com:~sabaini/lp-1513667-jujulogs-machine-0.tar.xz

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

I realize now that this is not spelt out too clearly -- but the affected units are basically unreachable from juju (eg. for juju run, juju set)

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Peter,

Please confirm what version of juju you are using :D

Revision history for this message
Anastasia (anastasia-macmood) wrote :

This looks like the same issue as LP#1528261

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Peter,

Could you please clarify what units are affected - what services/charms are deployed? I'd like to ensure that I can reproduce this as close as possible to your actual setup.

Also, when you were restarting machine-0, have you also restarted agents on the machines of the affected units? I presume affected units are not on machine-0.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

It looks like the original bug has morphed:
uniter log verbosity has been and is being addressed. We are also working on making the log messages themselves be more descriptive - for e.g. details about what dependencies are missing (in the pipeline). The log messages level has also been reduced from warning to info (beta8).

Peter,
What you have seen in your scenario - "cannot find transaction" - is a mongo hiccup. Please open a separate bug with logs 9not just extract), details of juju version, provider, maybe a reproducible scenario. If it's an existing customer, what outcome you would like to see. For example, if you need our assistance to unblock this particular environment.

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Anastasia:

The customer system I mentioned above runs juju 1.25.3.1. Yes, also the unit agents have been restarted. I will open a separate bug about mongo

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Thank you, Peter!

Revision history for this message
Jacek Nykis (jacekn) wrote :

I've just hit this bug on trusty with juju 1.25.5-trusty-amd64 with local provider. I tried restarting juju-db-*-local but that did not help.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Jacek - you'll need to restart the jujud-machine-0 service.

Revision history for this message
Joey Stanford (joey) wrote :

Bug #1465307 is reported as a dup of this bug but it is in fact still an issue.

Revision history for this message
JuanJo Ciarlante (jjo) wrote :

I'm consistently hitting this on 1.25.6 (from proposed atm), restarting
jujud-machine-0 does *not* fix it - appreciate if you could give it more
priority, as it makes deploying pretty painful.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

JuanJo,

Thank you for your patience and input. Could you please clarify what you are consistently hitting?

The error message in the bug's title is really just a warning and will be/has been reduced in level in next major release. Maybe the bug has morphed...

Please be explicit with:

1. what you are doing - reproducible steps;
2. what is the problem that you encountering, i.e besides/ignoring the log messages: what is the behavioral fallout?;
3. machine and unit logs from 1.25.6 would be very helpful.

I am happy to bump the priority once this is provided and there is a clear understanding of what failure is a show stopper for you.

Paul Gear (paulgear)
tags: added: canonical-is
Revision history for this message
Paul Gear (paulgear) wrote :

I'm seeing this now in the logs on one of my canonistack test environments. Everything was working fine; the environment has been operating fine, then status messages from the agents dry up, some agents (but not all) go into lost state, and the log shows the same message constantly: https://pastebin.canonical.com/161815/

I haven't yet restarted jujud-machine-0. Please let me know if you'd like any debugging done before I do so.

Revision history for this message
Paul Gear (paulgear) wrote :

Logs of my failing environment uploaded to: https://private-fileshare.canonical.com/~paulgear/lp:1513667/

Revision history for this message
Paul Gear (paulgear) wrote :

Yesterday I upgraded my working (canonistack lcy01) test environment to 1.25.6. All seemed to be well, with clean status and a quiet juju debug-log. I left it overnight without making any changes, and this morning all but two units are in agent lost state: https://pastebin.canonical.com/161942/

I'm not sure whether this is part of this same bug, or whether a new one should be created to address this specific situation, but to my mind if a working juju env can turn into a non-working juju env overnight with no changes being made, that should be considered a critical bug.

I've just uploaded new logs from this environment to https://private-fileshare.canonical.com/~paulgear/lp:1513667/ - named juju-logs-*.tar.xz.

Revision history for this message
Paul Gear (paulgear) wrote :

To clarify a couple of points above:

- The env mentioned in https://bugs.launchpad.net/juju-core/+bug/1513667/comments/25 is the formerly-working sister env to that which failed in https://bugs.launchpad.net/juju-core/+bug/1513667/comments/23, which lives in canonistack lcy02. The latter is still working OK.

- The env with 7/9 agents lost is still working from a payload perspective, and juju ssh/scp still works. My guess is that a restart of jujud-machine-0 will clear these up.

Revision history for this message
Paul Gear (paulgear) wrote :

Confirmed that restart of jujud-machine-0 resolved these issues. IMO, this doesn't make it any less critical, because juju agents lost in a production env will generally result in people getting paged. (It is certainly that way by default on Canonical IS-run environments.)

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Paul,

I am looking into logs now.

When - if :D - it happens again, could you please pastebin status in yaml format?
It tends to have more details.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Further context:

lcy01 and lcy02 are two sister environments.
They are identical from Juju perspective: were deployed from the same mojo spec, with the same bootstrap constraints.
They only differ in public IPs: since they are using floating IPs, these were added manually.

 *** lcy01
 * [YESTERDAY] OK;
 * [YESTERDAY] was upgraded to 1.25.6;
 * [TODAY] showed symptoms;
 * [TODAY] fixed with restart of jujud-machine-0 + juju resolved --retry of failed units;
 * logs attached above are juju-log-0,1,2.

 *** lcy02
 * was deployed first: had some issues with one unit which was destroyed & replaced;
 * [YESTERDAY] showed symptoms;
 * [YESTERDAY] fixed with restart of jujud-machine-0 + juju resolved --retry of failed units;
 * [YESTERDAY] was upgraded to 1.25.6;
 * [TODAY] OK;
 * logs attached above are machine-[0,3].

Changed in juju-core:
importance: Medium → High
milestone: none → 2.0.0
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@paulgear

Is there a chance for you to run an environment with DEBUG or even TRACE logging in anticipation of more details in the logs?

Something like `logging-config=juju.apiserver=DEBUG` may shed more light.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Won't fix for 1.25, because we can restart jujud which fixes it.

Newer logs will be instrument to fix this for 2.0

Tim Penhey (thumper)
summary: - Better error messaging around uniter failure
+ Dependency engine logging should provide more information
affects: juju-core → juju
Changed in juju:
milestone: 2.0.0 → none
milestone: none → 2.0.0
Changed in juju-core:
status: New → Won't Fix
Revision history for this message
Alvaro Uria (aluria) wrote :

Also hit this on Juju 1.25.6.1, using Xenial.

"""
failed to start "uniter" manifold worker: dependency not available
"""

By restarting "failed" unit, state didn't change. After restarting jujud-machine-0, issue got solved.

-Alvaro.

Revision history for this message
Paul Gear (paulgear) wrote :

This seems like it's probably the same bug as https://bugs.launchpad.net/juju/+bug/1513667 - should it be marked as a duplicate? (Or vice versa, since this one was opened first?)

Revision history for this message
Paul Gear (paulgear) wrote :

Sorry - last comment should have been about bug #1587644

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Paul, it's not clear what is causing the CPU/memory growth yet, so I don't think we should mark it as a duplicate just yet. It's *possible* that the log spam is causing the growth, seeing as the logs are stored in MongoDB and routed through the state server. I haven't seen enough evidence to say with any confidence that that is the case, though.

The reason why the logs are continuously spamming is because there's an failure scenario that we're not recovering from in Juju 1.25. What can happen is that a leadership-related worker running in machine-0 fails due to an I/O timeout communicating with MongoDB. In 1.25, that worker is not automatically restarted; in 2.0 it is.

Changed in juju:
importance: High → Critical
assignee: nobody → Alexis Bruemmer (alexis-bruemmer)
milestone: 2.0.0 → 2.0-beta18
Tim Penhey (thumper)
Changed in juju:
status: Triaged → Fix Committed
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

Hello Tim,

Could you point us to the specific commit/PR that fixes this issue?

Thanks.

Revision history for this message
Anastasia (anastasia-macmood) wrote :
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.