Canonical Juju

Dependency engine logging should provide more information

Bug #1513667 reported by Charles Butler on 2015-11-06

This bug affects 16 people

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	Critical	Alexis Bruemmer	Canonical Juju 2.0-beta18
juju-core	Won't Fix	Undecided	Unassigned
1.25	Fix Released	Medium	William Reade	juju-core 1.25.7

Bug Description

While revving charms, I encountered a situation that spammed failure logs about failing to start the uniter, dependency unavailable - however it gave no context into what was happening, or what restarted/caused the uniter to fail.

unit-swarm-6[3112]: 2015-11-06 00:00:46 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-etcd-0[3120]: 2015-11-06 00:00:47 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available

I'll try to reproduce this and attach unit logs + repeatability.

Tags:

Cheryl Jennings (cherylj) on 2015-11-06

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → Medium

Jorge Niedbalski (niedbalski) on 2015-12-03

tags:

added: sts

Revision history for this message

David Lawson (deej) wrote on 2015-12-04:

I've encountered this as well, it appears that juju is confused about leader status, since this is a single unit but this is the only thing in the logs. Note that this starter, as far as I can tell, spontaneously, I wasn't logged on to the unit nor was anyone else interacting with it.

2015-12-04 06:45:02 WARNING juju.worker.uniter.operation leader.go:115 we should run a leader-deposed hook here, but we can't yet
2015-12-04 06:45:04 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
2015-12-04 06:45:04 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
...
That message repeats up until the current time.

Revision history for this message

David Lawson (deej) wrote on 2015-12-04:

Huh, it appears that this behavior infects an environment, spinning up a new unit of the affected service ends with the same behavior, juju gets stuck allocating the new agent and the log is full of the same message.

Adam Collard (adam-collard) on 2015-12-14

tags:

added: kanban-cross-team

🤖 Landscape Builder (landscape-builder) on 2015-12-14

tags:

removed: kanban-cross-team

Revision history for this message

Adam Collard (adam-collard) wrote on 2015-12-14:

Restarting jujud on machine-0 seems to have resolved the situation:

# restart jujud-machine-0

I noticed whilst doing that that mongo seems to have burped: /var/log/upstart/juju-db.log.1.gz

https://pastebin.canonical.com/146061/

Jorge Niedbalski (niedbalski) on 2016-02-08

tags:

removed: sts

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-04-20:

We got another environment that is borked. The juju logs are just repeating this over and over:
2016-04-20 12:12:55 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available

Since january 5th, 2016.

Revision history for this message

Merlijn Sebrechts (merlijn-sebrechts) wrote on 2016-04-26:

I have had the same problem. MongoDB using all the CPU, grinding the whole server to a halt. After a few minutes, everything is back to normal again, and juju debug-log is just "failed to start uniter" over and over again.

Revision history for this message

Merlijn Sebrechts (merlijn-sebrechts) wrote on 2016-04-26:

Note to people having the same problem: You restart the juju machine daemon by running `sudo restart jujud-machine-0` from machine 0

Revision history for this message

Florian Haas (fghaas) wrote on 2016-05-12:

Just seeing that same issue now; restarting jujud-machine-0 doesn't do a thing.

Log excerpt is here:

unit-cinder-api-1[893]: 2016-05-12 08:34:54 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-neutron-openvswitch-0[1149]: 2016-05-12 08:34:53 WARNING juju.worker.dependency engine.go:304 failed to start
"uniter" manifold worker: dependency not available
unit-neutron-openvswitch-0[1149]: 2016-05-12 08:34:53 WARNING juju.worker.dependency engine.go:304 failed to start
"uniter" manifold worker: dependency not available
unit-cinder-volume-0[11719]: 2016-05-12 08:34:54 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-openstack-dashboard-1[1167]: 2016-05-12 08:34:55 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-openstack-dashboard-1[1167]: 2016-05-12 08:34:55 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-juju-gui-0[1160]: 2016-05-12 08:34:55 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
unit-juju-gui-0[1160]: 2016-05-12 08:34:55 ERROR juju.worker.uniter.filter filter.go:137 tomb: dying
unit-juju-gui-0[1160]: 2016-05-12 08:34:55 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-juju-gui-0[1160]: 2016-05-12 08:34:55 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available
unit-neutron-gateway-0[5570]: 2016-05-12 08:34:55 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available

Revision history for this message

Florian Haas (fghaas) wrote on 2016-05-13:

As far as I can tell this issue, when it occurs, is a complete show stopper, meaning no changes can be made to existing charms, nor can new charms be deployed.

Could someone please comment on what the recommended course of action is if this issue occurs and restarting jujud-machine-0 doesn't help?

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-05-13:

Restarting jujud on machine 0 in the admin model should resolve the issue. If it doesn't, please set the logging level to DEBUG and attach machine-0.log from the admin model.

You can set the log level with this command: juju set-model-config logging-config="<root>=DEBUG" -m admin

Peter Sabaini (peter-sabaini) on 2016-05-30

tags:

added: canonical-bootstack

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2016-05-30:

#10

I'm also getting this.

2016-05-30 10:01:35 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available

The unit shows as failed, and unit operations are unavail. (eg. run, config-change)

The workaround (restart jujud-machine-0) does NOT help here. The log for machine-0 shows this (repeatedly):

2016-05-30 10:04:52 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: cannot find transaction ObjectIdHex("57471fe2004d521d0679974a")

This is from a production system and I can't make logs available publically. I've uploaded the machine-0 logs here: private-fileshare.canonical.com:~sabaini/lp-1513667-jujulogs-machine-0.tar.xz

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2016-05-31:

#11

I realize now that this is not spelt out too clearly -- but the affected units are basically unreachable from juju (eg. for juju run, juju set)

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-06-01:

#12

Peter,

Please confirm what version of juju you are using :D

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-06-01:

#13

This looks like the same issue as LP#1528261

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-06-02:

#14

Peter,

Could you please clarify what units are affected - what services/charms are deployed? I'd like to ensure that I can reproduce this as close as possible to your actual setup.

Also, when you were restarting machine-0, have you also restarted agents on the machines of the affected units? I presume affected units are not on machine-0.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-06-02:

#15

It looks like the original bug has morphed:
uniter log verbosity has been and is being addressed. We are also working on making the log messages themselves be more descriptive - for e.g. details about what dependencies are missing (in the pipeline). The log messages level has also been reduced from warning to info (beta8).

Peter,
What you have seen in your scenario - "cannot find transaction" - is a mongo hiccup. Please open a separate bug with logs 9not just extract), details of juju version, provider, maybe a reproducible scenario. If it's an existing customer, what outcome you would like to see. For example, if you need our assistance to unblock this particular environment.

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2016-06-06:

#16

Anastasia:

The customer system I mentioned above runs juju 1.25.3.1. Yes, also the unit agents have been restarted. I will open a separate bug about mongo

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-06-06:

#17

Thank you, Peter!

Revision history for this message

Jacek Nykis (jacekn) wrote on 2016-06-22:

#18

I've just hit this bug on trusty with juju 1.25.5-trusty-amd64 with local provider. I tried restarting juju-db-*-local but that did not help.

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-06-22:

#19

Jacek - you'll need to restart the jujud-machine-0 service.

Revision history for this message

Joey Stanford (joey) wrote on 2016-07-06:

#20

Bug #1465307 is reported as a dup of this bug but it is in fact still an issue.

Revision history for this message

JuanJo Ciarlante (jjo) wrote on 2016-07-18:

#21

I'm consistently hitting this on 1.25.6 (from proposed atm), restarting
jujud-machine-0 does *not* fix it - appreciate if you could give it more
priority, as it makes deploying pretty painful.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-07-19:

#22

JuanJo,

Thank you for your patience and input. Could you please clarify what you are consistently hitting?

The error message in the bug's title is really just a warning and will be/has been reduced in level in next major release. Maybe the bug has morphed...

Please be explicit with:

1. what you are doing - reproducible steps;
2. what is the problem that you encountering, i.e besides/ignoring the log messages: what is the behavioral fallout?;
3. machine and unit logs from 1.25.6 would be very helpful.

I am happy to bump the priority once this is provided and there is a clear understanding of what failure is a show stopper for you.

Paul Gear (paulgear) on 2016-07-27

tags:

added: canonical-is

Revision history for this message

Paul Gear (paulgear) wrote on 2016-07-28:

#23

I'm seeing this now in the logs on one of my canonistack test environments. Everything was working fine; the environment has been operating fine, then status messages from the agents dry up, some agents (but not all) go into lost state, and the log shows the same message constantly: https://pastebin.canonical.com/161815/

I haven't yet restarted jujud-machine-0. Please let me know if you'd like any debugging done before I do so.

Revision history for this message

Paul Gear (paulgear) wrote on 2016-07-28:

#24

Logs of my failing environment uploaded to: https://private-fileshare.canonical.com/~paulgear/lp:1513667/

Revision history for this message

Paul Gear (paulgear) wrote on 2016-07-29:

#25

Yesterday I upgraded my working (canonistack lcy01) test environment to 1.25.6. All seemed to be well, with clean status and a quiet juju debug-log. I left it overnight without making any changes, and this morning all but two units are in agent lost state: https://pastebin.canonical.com/161942/

I'm not sure whether this is part of this same bug, or whether a new one should be created to address this specific situation, but to my mind if a working juju env can turn into a non-working juju env overnight with no changes being made, that should be considered a critical bug.

I've just uploaded new logs from this environment to https://private-fileshare.canonical.com/~paulgear/lp:1513667/ - named juju-logs-*.tar.xz.

Revision history for this message

Paul Gear (paulgear) wrote on 2016-07-29:

#26

To clarify a couple of points above:

- The env mentioned in https://bugs.launchpad.net/juju-core/+bug/1513667/comments/25 is the formerly-working sister env to that which failed in https://bugs.launchpad.net/juju-core/+bug/1513667/comments/23, which lives in canonistack lcy02. The latter is still working OK.

- The env with 7/9 agents lost is still working from a payload perspective, and juju ssh/scp still works. My guess is that a restart of jujud-machine-0 will clear these up.

Revision history for this message

Paul Gear (paulgear) wrote on 2016-07-29:

#27

Confirmed that restart of jujud-machine-0 resolved these issues. IMO, this doesn't make it any less critical, because juju agents lost in a production env will generally result in people getting paged. (It is certainly that way by default on Canonical IS-run environments.)

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-07-29:

#28

Paul,

I am looking into logs now.

When - if :D - it happens again, could you please pastebin status in yaml format?
It tends to have more details.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-07-29:

#29

Further context:

lcy01 and lcy02 are two sister environments.
They are identical from Juju perspective: were deployed from the same mojo spec, with the same bootstrap constraints.
They only differ in public IPs: since they are using floating IPs, these were added manually.

*** lcy01
* [YESTERDAY] OK;
* [YESTERDAY] was upgraded to 1.25.6;
* [TODAY] showed symptoms;
* [TODAY] fixed with restart of jujud-machine-0 + juju resolved --retry of failed units;
* logs attached above are juju-log-0,1,2.

*** lcy02
* was deployed first: had some issues with one unit which was destroyed & replaced;
* [YESTERDAY] showed symptoms;
* [YESTERDAY] fixed with restart of jujud-machine-0 + juju resolved --retry of failed units;
* [YESTERDAY] was upgraded to 1.25.6;
* [TODAY] OK;
* logs attached above are machine-[0,3].

Changed in juju-core:
importance:	Medium → High
milestone:	none → 2.0.0

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-07-29:

#30

@paulgear

Is there a chance for you to run an environment with DEBUG or even TRACE logging in anticipation of more details in the logs?

Something like `logging-config=juju.apiserver=DEBUG` may shed more light.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-08-08:

#31

Won't fix for 1.25, because we can restart jujud which fixes it.

Newer logs will be instrument to fix this for 2.0

Tim Penhey (thumper) on 2016-08-15

summary:

- Better error messaging around uniter failure
+ Dependency engine logging should provide more information

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0.0 → none
milestone:	none → 2.0.0

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

Changed in juju-core:
status:	New → Won't Fix

Revision history for this message

Alvaro Uria (aluria) wrote on 2016-08-23:

#32

Also hit this on Juju 1.25.6.1, using Xenial.

"""
failed to start "uniter" manifold worker: dependency not available
"""

By restarting "failed" unit, state didn't change. After restarting jujud-machine-0, issue got solved.

-Alvaro.

Revision history for this message

Paul Gear (paulgear) wrote on 2016-08-29:

#33

This seems like it's probably the same bug as https://bugs.launchpad.net/juju/+bug/1513667 - should it be marked as a duplicate? (Or vice versa, since this one was opened first?)

Revision history for this message

Paul Gear (paulgear) wrote on 2016-08-29:

#34

Sorry - last comment should have been about bug #1587644

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-08-30:

#35

Paul, it's not clear what is causing the CPU/memory growth yet, so I don't think we should mark it as a duplicate just yet. It's *possible* that the log spam is causing the growth, seeing as the logs are stored in MongoDB and routed through the state server. I haven't seen enough evidence to say with any confidence that that is the case, though.

The reason why the logs are continuously spamming is because there's an failure scenario that we're not recovering from in Juju 1.25. What can happen is that a leadership-related worker running in machine-0 fails due to an I/O timeout communicating with MongoDB. In 1.25, that worker is not automatically restarted; in 2.0 it is.

Alexis Bruemmer (alexis-bruemmer) on 2016-08-30

Changed in juju:
importance:	High → Critical
assignee:	nobody → Alexis Bruemmer (alexis-bruemmer)
milestone:	2.0.0 → 2.0-beta18

Tim Penhey (thumper) on 2016-09-01

Changed in juju:
status:	Triaged → Fix Committed

Revision history for this message

Jorge Niedbalski (niedbalski) wrote on 2016-09-05:

#36

Hello Tim,

Could you point us to the specific commit/PR that fixes this issue?

Thanks.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-09-06:

#37

1.25 PR: https://github.com/juju/juju/pull/5920

Curtis Hovey (sinzui) on 2016-09-09

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.