Canonical Juju

Upgrade to 2.3.8 caused unit agents to become unresponsive

Bug #1778614 reported by Xav Paice on 2018-06-26

10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Expired	Undecided	Unassigned

Bug Description

After upgrading the controller and workload model from 2.3.5 to 2.3.8, on an environment with Maas 2.3 and a bunch of machines with lxd containers, hooks in units are not running and even 'juju run --unit $thing' is unresponsive. 'juju run --machine X' is fine.

When I use 'juju run' on a unit, the unit agent reports:

2018-06-26 00:49:46 DEBUG juju.worker.uniter.remotestate watcher.go:376 got action change: [415ed5b4-6792-4ef3-8134-4b14cfea61b1] ok=true

juju debug-log also shows the same message, but nothing else.

engine-report from the unit machine agent: https://pastebin.canonical.com/p/WCNMTS8cXk/

machine log on the controller machine, with the PRIMARY mongo role (there's 2 more controllers): https://pastebin.canonical.com/p/xrqPMzyHcY/ (2000 lines from around the last time I ran a 'juju run')

Complete machine log on the keystone/0 unit (lxc): https://pastebin.canonical.com/p/394YjpHNvC/
last 5000 lines of the unit log for keystone/0: https://pastebin.canonical.com/p/mR4Ccdy5Bm/

Tags:

Revision history for this message

Xav Paice (xavpaice) wrote on 2018-06-26:

#1

When I restarted the subordinate agents on the unit, the log showed some change (restart was at 01:57): https://pastebin.canonical.com/p/fV4vBFnJSn/

Looks like hooks are running, maybe there was some log not being released till a service got restarted?

Revision history for this message

Richard Harding (rharding) wrote on 2018-07-11:

#2

Sounds like the unit agents didn't come back after restarting after the upgrade. Is there any log details around the time of upgrade for the controller or unit logs that might provide any about why they did not come up?

When you juju run with a unit target it will need to speak to the unit agent where as if you target the machine a different agent (which seems like it was back up and responsive) handles the request.

Changed in juju:
status:	New → Incomplete

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2018-07-18:

#3

I'm wondering if this is somehow related to lp#1778727 and agent upgrade timing. If two units are getting upgraded at once on the same metal and both units are trying to rebuild the symlinks in the /var/lib/juju/tools/<new version> directory, it's conceivable that there could be a race condition. While I would have expected there to be ERRNO 2 No such file errors from calls like config-get or relation-get, it's conceivable some charms could capture these exceptions and silence these sorts of failures. Are some of those tools in that directory used for controller->agent action communications?

Revision history for this message

John A Meinel (jameinel) wrote on 2018-07-18: Re: [Bug 1778614] Re: Upgrade to 2.3.8 caused unit agents to become unresponsive

#4

The agent itself doesn't use the symlinks to connect back to the Juju
Controller. So having the symlinks removed in the middle wouldn't affect
it.

On Wed, Jul 18, 2018 at 2:17 PM, Drew Freiberger <<email address hidden>
> wrote:

> I'm wondering if this is somehow related to lp#1778727 and agent upgrade
> timing. If two units are getting upgraded at once on the same metal and
> both units are trying to rebuild the symlinks in the
> /var/lib/juju/tools/<new version> directory, it's conceivable that there
> could be a race condition. While I would have expected there to be
> ERRNO 2 No such file errors from calls like config-get or relation-get,
> it's conceivable some charms could capture these exceptions and silence
> these sorts of failures. Are some of those tools in that directory used
> for controller->agent action communications?
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1778614
>
> Title:
> Upgrade to 2.3.8 caused unit agents to become unresponsive
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1778614/+subscriptions
>

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-09-17:

#5

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.