Upgraded juju to 1.24 dies shortly after starting

Bug #1466565 reported by Adam Collard
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju-core
Invalid
Undecided
Unassigned
1.23
Won't Fix
High
Menno Finlay-Smits
1.24
Invalid
High
Menno Finlay-Smits

Bug Description

On a local provider that was upgraded from 1.23.3.1 to 1.24, jujud comes up briefly but then dies.

1.24.0-0ubuntu1~14.04.1~juju1

machine-0.log after "sudo restart juju-agent-ubuntu-local"

http://paste.ubuntu.com/11735989/

juju status just hangs, failing to connect to the websocket

2015-06-18 15:54:00 DEBUG juju.api apiclient.go:337 error dialing "wss://localhost:17070/environment/dafc88c6-e812-44f1-8c03-35274bbb6edf/api", will retry: websocket.Dial wss://localhost:17070/environment/dafc88c6-e812-44f1-8c03-35274bbb6edf/api: dial tcp 127.0.0.1:17070: connection refused

I haven't tested it on different providers, as generally I deploy from scratch.

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.24.1
Revision history for this message
Adam Collard (adam-collard) wrote :

Note that the upgrade itself was ... a little tricky. I think I ended up running "juju upgrade-juju --version=1.24.0" a few times after having changed agent-stream to proposed (before 1.24.0 hit the mainstream)

Changed in juju-core:
assignee: nobody → Katherine Cox-Buday (cox-katherine-e)
status: Triaged → In Progress
Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

I cannot reproduce this locally by trying the trivial case a few times:

juju bootstrap
juju deploy ubuntu

(wait)

juju upgrade-juju --version=1.24.0

Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

I suggest trying to restart the service. From the logs it looks as though jujud didn't come back up after the upgrade.

Revision history for this message
Adam Collard (adam-collard) wrote :

Roger suggested that this could be because of a juju environment that was bootstrapped with uploaded tools. From poking around in the environment it certainly looks like that's the case.

I see a FORCE-VERSION file that pertains to 1.23.3 so I think we can close this bug down

Changed in juju-core:
status: In Progress → Invalid
Changed in juju-core:
status: Invalid → Confirmed
Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

Unsure of the significance, but it looks like there is always a FORCE-VERSION file placed into the 1.23.3 folder. As I cannot reproduce the issue locally, it's possible that this file has nothing to do with the issue at hand.

tags: added: cts sts
Changed in juju-core:
assignee: Katherine Cox-Buday (cox-katherine-e) → nobody
milestone: 1.24.1 → 1.25.0
Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

Adam, it looks like the log cuts off right after it goes down for the upgrade. Are there any logs messages after this, or is jujud even running?

Jorge, do you have any logs you can furnish to help triangulate any issues?

Changed in juju-core:
status: Confirmed → Incomplete
Revision history for this message
Adam Collard (adam-collard) wrote :

jujud still running, but you are correct in that the logs just stop, I want to note that this log is just from the restart, I get juju correctly able to talk to jujud for a few seconds before it dies.

strace'ing the jujud process shows a futex() with -ff I see more things (epoll, read, write, lots of clock_gettime)

Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

Interesting, so upon restarting jujud, you receive these series of log messages every time? That would imply that it's finding the update but is unable to apply it.

Revision history for this message
Adam Collard (adam-collard) wrote : Re: [Bug 1466565] Re: Upgraded juju to 1.24 dies shortly after starting

Correct.

On Fri, 19 Jun 2015 at 19:40 Katherine Cox-Buday <email address hidden>
wrote:

> Interesting, so upon restarting jujud, you receive these series of log
> messages every time? That would imply that it's finding the update but
> is unable to apply it.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1466565
>
> Title:
> Upgraded juju to 1.24 dies shortly after starting
>
> Status in juju-core:
> Incomplete
> Status in juju-core 1.24 series:
> Incomplete
>
> Bug description:
> On a local provider that was upgraded from 1.23.3.1 to 1.24, jujud
> comes up briefly but then dies.
>
> 1.24.0-0ubuntu1~14.04.1~juju1
>
> machine-0.log after "sudo restart juju-agent-ubuntu-local"
>
> http://paste.ubuntu.com/11735989/
>
> juju status just hangs, failing to connect to the websocket
>
> 2015-06-18 15:54:00 DEBUG juju.api apiclient.go:337 error dialing
>
> "wss://localhost:17070/environment/dafc88c6-e812-44f1-8c03-35274bbb6edf/api",
> will retry: websocket.Dial
>
> wss://localhost:17070/environment/dafc88c6-e812-44f1-8c03-35274bbb6edf/api:
> dial tcp 127.0.0.1:17070: connection refused
>
>
> I haven't tested it on different providers, as generally I deploy from
> scratch.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1466565/+subscriptions
>

Changed in juju-core:
status: Incomplete → Triaged
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

What seems to be happening is the 1.23.3 jujud is failing to exit and restart to continue the upgrade. Please confirm this by restarting the agent, noting the jujud PID and once the API becomes unresponsive, check that the running jujud is still the same PID.

My guess is that a worker is failing to exit, preventing the agent from exiting so that the upgrade can begin. The fact that the new tools aren't used even when you manually restart the agent indicates that it could be the upgrader worker itself. It's the one responsible for downloading the tools and installing them.

Turning up the logging might tell us more. In the time where the environment is responsive please run this to increase the logging level:

    juju set-env logging-config='<root>=TRACE'

Then restart the agent and report back with the logs after the agent restarted.

You can put logging back with:

    juju set-env logging-config='<root>=WARNING;unit=DEBUG'

Can you also please report which OS and init system (upstart or systemd) are in use?

Finally can you please run:

   ls -l ~/.juju/local/tools

Thanks!

Revision history for this message
Adam Collard (adam-collard) wrote :

On Wed, 24 Jun 2015 at 04:40 Menno Smits <email address hidden> wrote:

> What seems to be happening is the 1.23.3 jujud is failing to exit and
> restart to continue the upgrade. Please confirm this by restarting the
> agent, noting the jujud PID and once the API becomes unresponsive, check
> that the running jujud is still the same PID.
>

Confirmed, there are no PID changes.

>
> My guess is that a worker is failing to exit, preventing the agent from
> exiting so that the upgrade can begin. The fact that the new tools
> aren't used even when you manually restart the agent indicates that it
> could be the upgrader worker itself. It's the one responsible for
> downloading the tools and installing them.
>
> Turning up the logging might tell us more. In the time where the
> environment is responsive please run this to increase the logging level:
>
> juju set-env logging-config='<root>=TRACE'
>
> Then restart the agent and report back with the logs after the agent
> restarted.
>
>
After verifying there were no logs after 08:51:20...

$ grep '2015-06-24 08:5' /var/log/juju-ubuntu-local/mahine-0.log |
pastebinit

http://paste.ubuntu.com/11766690/

You can put logging back with:
>
> juju set-env logging-config='<root>=WARNING;unit=DEBUG'
>

Will leave as trace for now, thanks though.

>
> Can you also please report which OS and init system (upstart or systemd)
> are in use?
>

Ubuntu Trusty 14.04.2 with our good friend upstart

> Finally can you please run:
>
> ls -l ~/.juju/local/tools
>

FYI JUJU_HOME is set (so it's not in ~/.juju), that said

$ ls -al $JUJU_HOME/local/tools
total 16
drwxr-xr-x 4 root root 4096 Jun 16 11:35 .
drwxr-xr-x 9 ubuntu ubuntu 4096 Jun 24 08:50 ..
drwxr-xr-x 2 root root 4096 May 27 21:18 1.23.3.1-trusty-amd64
drwxr-xr-x 2 root root 4096 Jun 16 11:35 1.24.0-trusty-amd64
lrwxrwxrwx 1 root root 21 May 27 21:18 machine-0 ->
1.23.3.1-trusty-amd64

Changed in juju-core:
assignee: nobody → Menno Smits (menno.smits)
Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

Not confirmed, but just in case, this may be related to bug 1468653.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I thought that it could be the same thing as bug 1468653 too, but it's not definite. We'll see.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Thanks Adam for gathering up all the extra details... extremely useful.

There's some logging that's been added to 1.24 that would have made diagnosing this issue easier. That said, I can see that after just about everything has finished in preparation for the reboot, a watcher fires reporting a change to the leases DB collection. This indicate that the lease manager is dying.

The lease/leadership functionality 1.23 is known to have issues where it can get can stuck and not honour kill requests so this is probably the problem here.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I can reliably reproduce the problem myself like this:

## with juju 1.23.3

juju bootstrap
juju deploy ubuntu
juju add-unit ubuntu
# wait for machines and units to come up

## with juju 1.24.0
juju upgrade-juju --upload-tools

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

To get the upgrade to complete you can manually set the symlink in the tools directory to point to the new tools version and then kill jujud. This will probably need to be done on every state server in a HA configuration.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I'm fairly sure this has been fixed in 1.24 but will check now.

no longer affects: juju-core
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The problem has been fixed for 1.24 already.

It's really unfortunate but anyone on 1.23.x is likely to run in to this problem and will have to manually switch the tools symlink to allow the upgrade to happen. I'm going to email the mailing list about this now.

There's no code fix that makes sense (you won't be able to upgrade to a fixed Juju version without manual intervention anyway) and we're not doing another 1.23 release so updating the ticket to reflect this.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Also note: I'm pretty sure bug 1468653 is a different issue.

Changed in juju-core:
status: New → Invalid
Revision history for this message
Adam Collard (adam-collard) wrote :

Thanks Menno! Manually switching the symlink and restarting jujud worked as advertised.

David Britton (dpb)
tags: added: kanban-cross-team
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.