Upgrade to 2.3.8 caused unit agents to become unresponsive

Bug #1778614 reported by Xav Paice
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
Undecided
Unassigned

Bug Description

After upgrading the controller and workload model from 2.3.5 to 2.3.8, on an environment with Maas 2.3 and a bunch of machines with lxd containers, hooks in units are not running and even 'juju run --unit $thing' is unresponsive. 'juju run --machine X' is fine.

When I use 'juju run' on a unit, the unit agent reports:

2018-06-26 00:49:46 DEBUG juju.worker.uniter.remotestate watcher.go:376 got action change: [415ed5b4-6792-4ef3-8134-4b14cfea61b1] ok=true

juju debug-log also shows the same message, but nothing else.

engine-report from the unit machine agent: https://pastebin.canonical.com/p/WCNMTS8cXk/

machine log on the controller machine, with the PRIMARY mongo role (there's 2 more controllers): https://pastebin.canonical.com/p/xrqPMzyHcY/ (2000 lines from around the last time I ran a 'juju run')

Complete machine log on the keystone/0 unit (lxc): https://pastebin.canonical.com/p/394YjpHNvC/
last 5000 lines of the unit log for keystone/0: https://pastebin.canonical.com/p/mR4Ccdy5Bm/

Revision history for this message
Xav Paice (xavpaice) wrote :

When I restarted the subordinate agents on the unit, the log showed some change (restart was at 01:57): https://pastebin.canonical.com/p/fV4vBFnJSn/

Looks like hooks are running, maybe there was some log not being released till a service got restarted?

Revision history for this message
Richard Harding (rharding) wrote :

Sounds like the unit agents didn't come back after restarting after the upgrade. Is there any log details around the time of upgrade for the controller or unit logs that might provide any about why they did not come up?

When you juju run with a unit target it will need to speak to the unit agent where as if you target the machine a different agent (which seems like it was back up and responsive) handles the request.

Changed in juju:
status: New → Incomplete
Revision history for this message
Drew Freiberger (afreiberger) wrote :

I'm wondering if this is somehow related to lp#1778727 and agent upgrade timing. If two units are getting upgraded at once on the same metal and both units are trying to rebuild the symlinks in the /var/lib/juju/tools/<new version> directory, it's conceivable that there could be a race condition. While I would have expected there to be ERRNO 2 No such file errors from calls like config-get or relation-get, it's conceivable some charms could capture these exceptions and silence these sorts of failures. Are some of those tools in that directory used for controller->agent action communications?

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1778614] Re: Upgrade to 2.3.8 caused unit agents to become unresponsive

The agent itself doesn't use the symlinks to connect back to the Juju
Controller. So having the symlinks removed in the middle wouldn't affect
it.

On Wed, Jul 18, 2018 at 2:17 PM, Drew Freiberger <<email address hidden>
> wrote:

> I'm wondering if this is somehow related to lp#1778727 and agent upgrade
> timing. If two units are getting upgraded at once on the same metal and
> both units are trying to rebuild the symlinks in the
> /var/lib/juju/tools/<new version> directory, it's conceivable that there
> could be a race condition. While I would have expected there to be
> ERRNO 2 No such file errors from calls like config-get or relation-get,
> it's conceivable some charms could capture these exceptions and silence
> these sorts of failures. Are some of those tools in that directory used
> for controller->agent action communications?
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1778614
>
> Title:
> Upgrade to 2.3.8 caused unit agents to become unresponsive
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1778614/+subscriptions
>

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.