Upgraded juju to 1.24 dies shortly after starting

Bug #1466565 reported by Adam Collard on 2015-06-18
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju-core
Undecided
Unassigned
1.23
High
Menno Finlay-Smits
1.24
High
Menno Finlay-Smits

Bug Description

On a local provider that was upgraded from 1.23.3.1 to 1.24, jujud comes up briefly but then dies.

1.24.0-0ubuntu1~14.04.1~juju1

machine-0.log after "sudo restart juju-agent-ubuntu-local"

http://paste.ubuntu.com/11735989/

juju status just hangs, failing to connect to the websocket

2015-06-18 15:54:00 DEBUG juju.api apiclient.go:337 error dialing "wss://localhost:17070/environment/dafc88c6-e812-44f1-8c03-35274bbb6edf/api", will retry: websocket.Dial wss://localhost:17070/environment/dafc88c6-e812-44f1-8c03-35274bbb6edf/api: dial tcp 127.0.0.1:17070: connection refused

I haven't tested it on different providers, as generally I deploy from scratch.

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.24.1
Adam Collard (adam-collard) wrote :

Note that the upgrade itself was ... a little tricky. I think I ended up running "juju upgrade-juju --version=1.24.0" a few times after having changed agent-stream to proposed (before 1.24.0 hit the mainstream)

Changed in juju-core:
assignee: nobody → Katherine Cox-Buday (cox-katherine-e)
status: Triaged → In Progress

I cannot reproduce this locally by trying the trivial case a few times:

juju bootstrap
juju deploy ubuntu

(wait)

juju upgrade-juju --version=1.24.0

I suggest trying to restart the service. From the logs it looks as though jujud didn't come back up after the upgrade.

Adam Collard (adam-collard) wrote :

Roger suggested that this could be because of a juju environment that was bootstrapped with uploaded tools. From poking around in the environment it certainly looks like that's the case.

I see a FORCE-VERSION file that pertains to 1.23.3 so I think we can close this bug down

Changed in juju-core:
status: In Progress → Invalid
Changed in juju-core:
status: Invalid → Confirmed

Unsure of the significance, but it looks like there is always a FORCE-VERSION file placed into the 1.23.3 folder. As I cannot reproduce the issue locally, it's possible that this file has nothing to do with the issue at hand.

tags: added: cts sts
Changed in juju-core:
assignee: Katherine Cox-Buday (cox-katherine-e) → nobody
milestone: 1.24.1 → 1.25.0

Adam, it looks like the log cuts off right after it goes down for the upgrade. Are there any logs messages after this, or is jujud even running?

Jorge, do you have any logs you can furnish to help triangulate any issues?

Changed in juju-core:
status: Confirmed → Incomplete
Adam Collard (adam-collard) wrote :

jujud still running, but you are correct in that the logs just stop, I want to note that this log is just from the restart, I get juju correctly able to talk to jujud for a few seconds before it dies.

strace'ing the jujud process shows a futex() with -ff I see more things (epoll, read, write, lots of clock_gettime)

Interesting, so upon restarting jujud, you receive these series of log messages every time? That would imply that it's finding the update but is unable to apply it.

Correct.

On Fri, 19 Jun 2015 at 19:40 Katherine Cox-Buday <email address hidden>
wrote:

> Interesting, so upon restarting jujud, you receive these series of log
> messages every time? That would imply that it's finding the update but
> is unable to apply it.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1466565
>
> Title:
> Upgraded juju to 1.24 dies shortly after starting
>
> Status in juju-core:
> Incomplete
> Status in juju-core 1.24 series:
> Incomplete
>
> Bug description:
> On a local provider that was upgraded from 1.23.3.1 to 1.24, jujud
> comes up briefly but then dies.
>
> 1.24.0-0ubuntu1~14.04.1~juju1
>
> machine-0.log after "sudo restart juju-agent-ubuntu-local"
>
> http://paste.ubuntu.com/11735989/
>
> juju status just hangs, failing to connect to the websocket
>
> 2015-06-18 15:54:00 DEBUG juju.api apiclient.go:337 error dialing
>
> "wss://localhost:17070/environment/dafc88c6-e812-44f1-8c03-35274bbb6edf/api",
> will retry: websocket.Dial
>
> wss://localhost:17070/environment/dafc88c6-e812-44f1-8c03-35274bbb6edf/api:
> dial tcp 127.0.0.1:17070: connection refused
>
>
> I haven't tested it on different providers, as generally I deploy from
> scratch.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1466565/+subscriptions
>

Changed in juju-core:
status: Incomplete → Triaged

What seems to be happening is the 1.23.3 jujud is failing to exit and restart to continue the upgrade. Please confirm this by restarting the agent, noting the jujud PID and once the API becomes unresponsive, check that the running jujud is still the same PID.

My guess is that a worker is failing to exit, preventing the agent from exiting so that the upgrade can begin. The fact that the new tools aren't used even when you manually restart the agent indicates that it could be the upgrader worker itself. It's the one responsible for downloading the tools and installing them.

Turning up the logging might tell us more. In the time where the environment is responsive please run this to increase the logging level:

    juju set-env logging-config='<root>=TRACE'

Then restart the agent and report back with the logs after the agent restarted.

You can put logging back with:

    juju set-env logging-config='<root>=WARNING;unit=DEBUG'

Can you also please report which OS and init system (upstart or systemd) are in use?

Finally can you please run:

   ls -l ~/.juju/local/tools

Thanks!

Adam Collard (adam-collard) wrote :

On Wed, 24 Jun 2015 at 04:40 Menno Smits <email address hidden> wrote:

> What seems to be happening is the 1.23.3 jujud is failing to exit and
> restart to continue the upgrade. Please confirm this by restarting the
> agent, noting the jujud PID and once the API becomes unresponsive, check
> that the running jujud is still the same PID.
>

Confirmed, there are no PID changes.

>
> My guess is that a worker is failing to exit, preventing the agent from
> exiting so that the upgrade can begin. The fact that the new tools
> aren't used even when you manually restart the agent indicates that it
> could be the upgrader worker itself. It's the one responsible for
> downloading the tools and installing them.
>
> Turning up the logging might tell us more. In the time where the
> environment is responsive please run this to increase the logging level:
>
> juju set-env logging-config='<root>=TRACE'
>
> Then restart the agent and report back with the logs after the agent
> restarted.
>
>
After verifying there were no logs after 08:51:20...

$ grep '2015-06-24 08:5' /var/log/juju-ubuntu-local/mahine-0.log |
pastebinit

http://paste.ubuntu.com/11766690/

You can put logging back with:
>
> juju set-env logging-config='<root>=WARNING;unit=DEBUG'
>

Will leave as trace for now, thanks though.

>
> Can you also please report which OS and init system (upstart or systemd)
> are in use?
>

Ubuntu Trusty 14.04.2 with our good friend upstart

> Finally can you please run:
>
> ls -l ~/.juju/local/tools
>

FYI JUJU_HOME is set (so it's not in ~/.juju), that said

$ ls -al $JUJU_HOME/local/tools
total 16
drwxr-xr-x 4 root root 4096 Jun 16 11:35 .
drwxr-xr-x 9 ubuntu ubuntu 4096 Jun 24 08:50 ..
drwxr-xr-x 2 root root 4096 May 27 21:18 1.23.3.1-trusty-amd64
drwxr-xr-x 2 root root 4096 Jun 16 11:35 1.24.0-trusty-amd64
lrwxrwxrwx 1 root root 21 May 27 21:18 machine-0 ->
1.23.3.1-trusty-amd64

Changed in juju-core:
assignee: nobody → Menno Smits (menno.smits)

Not confirmed, but just in case, this may be related to bug 1468653.

I thought that it could be the same thing as bug 1468653 too, but it's not definite. We'll see.

Thanks Adam for gathering up all the extra details... extremely useful.

There's some logging that's been added to 1.24 that would have made diagnosing this issue easier. That said, I can see that after just about everything has finished in preparation for the reboot, a watcher fires reporting a change to the leases DB collection. This indicate that the lease manager is dying.

The lease/leadership functionality 1.23 is known to have issues where it can get can stuck and not honour kill requests so this is probably the problem here.

I can reliably reproduce the problem myself like this:

## with juju 1.23.3

juju bootstrap
juju deploy ubuntu
juju add-unit ubuntu
# wait for machines and units to come up

## with juju 1.24.0
juju upgrade-juju --upload-tools

To get the upgrade to complete you can manually set the symlink in the tools directory to point to the new tools version and then kill jujud. This will probably need to be done on every state server in a HA configuration.

I'm fairly sure this has been fixed in 1.24 but will check now.

no longer affects: juju-core

The problem has been fixed for 1.24 already.

It's really unfortunate but anyone on 1.23.x is likely to run in to this problem and will have to manually switch the tools symlink to allow the upgrade to happen. I'm going to email the mailing list about this now.

There's no code fix that makes sense (you won't be able to upgrade to a fixed Juju version without manual intervention anyway) and we're not doing another 1.23 release so updating the ticket to reflect this.

Also note: I'm pretty sure bug 1468653 is a different issue.

Changed in juju-core:
status: New → Invalid
Adam Collard (adam-collard) wrote :

Thanks Menno! Manually switching the symlink and restarting jujud worked as advertised.

tags: added: kanban-cross-team
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers