Upgraded juju to 1.24 dies shortly after starting
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-core |
Undecided
|
Unassigned | ||
| | 1.23 |
High
|
Menno Finlay-Smits | ||
| | 1.24 |
High
|
Menno Finlay-Smits | ||
Bug Description
On a local provider that was upgraded from 1.23.3.1 to 1.24, jujud comes up briefly but then dies.
1.24.0-
machine-0.log after "sudo restart juju-agent-
http://
juju status just hangs, failing to connect to the websocket
2015-06-18 15:54:00 DEBUG juju.api apiclient.go:337 error dialing "wss://
I haven't tested it on different providers, as generally I deploy from scratch.
| Changed in juju-core: | |
| status: | New → Triaged |
| importance: | Undecided → High |
| milestone: | none → 1.24.1 |
| Adam Collard (adam-collard) wrote : | #1 |
| Changed in juju-core: | |
| assignee: | nobody → Katherine Cox-Buday (cox-katherine-e) |
| status: | Triaged → In Progress |
I cannot reproduce this locally by trying the trivial case a few times:
juju bootstrap
juju deploy ubuntu
(wait)
juju upgrade-juju --version=1.24.0
I suggest trying to restart the service. From the logs it looks as though jujud didn't come back up after the upgrade.
| Adam Collard (adam-collard) wrote : | #4 |
Roger suggested that this could be because of a juju environment that was bootstrapped with uploaded tools. From poking around in the environment it certainly looks like that's the case.
I see a FORCE-VERSION file that pertains to 1.23.3 so I think we can close this bug down
| Changed in juju-core: | |
| status: | In Progress → Invalid |
| Changed in juju-core: | |
| status: | Invalid → Confirmed |
Unsure of the significance, but it looks like there is always a FORCE-VERSION file placed into the 1.23.3 folder. As I cannot reproduce the issue locally, it's possible that this file has nothing to do with the issue at hand.
| tags: | added: cts sts |
| Changed in juju-core: | |
| assignee: | Katherine Cox-Buday (cox-katherine-e) → nobody |
| milestone: | 1.24.1 → 1.25.0 |
Adam, it looks like the log cuts off right after it goes down for the upgrade. Are there any logs messages after this, or is jujud even running?
Jorge, do you have any logs you can furnish to help triangulate any issues?
| Changed in juju-core: | |
| status: | Confirmed → Incomplete |
| Adam Collard (adam-collard) wrote : | #7 |
jujud still running, but you are correct in that the logs just stop, I want to note that this log is just from the restart, I get juju correctly able to talk to jujud for a few seconds before it dies.
strace'ing the jujud process shows a futex() with -ff I see more things (epoll, read, write, lots of clock_gettime)
Interesting, so upon restarting jujud, you receive these series of log messages every time? That would imply that it's finding the update but is unable to apply it.
| Adam Collard (adam-collard) wrote : Re: [Bug 1466565] Re: Upgraded juju to 1.24 dies shortly after starting | #9 |
Correct.
On Fri, 19 Jun 2015 at 19:40 Katherine Cox-Buday <email address hidden>
wrote:
> Interesting, so upon restarting jujud, you receive these series of log
> messages every time? That would imply that it's finding the update but
> is unable to apply it.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https:/
>
> Title:
> Upgraded juju to 1.24 dies shortly after starting
>
> Status in juju-core:
> Incomplete
> Status in juju-core 1.24 series:
> Incomplete
>
> Bug description:
> On a local provider that was upgraded from 1.23.3.1 to 1.24, jujud
> comes up briefly but then dies.
>
> 1.24.0-
>
> machine-0.log after "sudo restart juju-agent-
>
> http://
>
> juju status just hangs, failing to connect to the websocket
>
> 2015-06-18 15:54:00 DEBUG juju.api apiclient.go:337 error dialing
>
> "wss://
> will retry: websocket.Dial
>
> wss://localhost
> dial tcp 127.0.0.1:17070: connection refused
>
>
> I haven't tested it on different providers, as generally I deploy from
> scratch.
>
> To manage notifications about this bug go to:
> https:/
>
| Changed in juju-core: | |
| status: | Incomplete → Triaged |
| Menno Finlay-Smits (menno.smits) wrote : | #10 |
What seems to be happening is the 1.23.3 jujud is failing to exit and restart to continue the upgrade. Please confirm this by restarting the agent, noting the jujud PID and once the API becomes unresponsive, check that the running jujud is still the same PID.
My guess is that a worker is failing to exit, preventing the agent from exiting so that the upgrade can begin. The fact that the new tools aren't used even when you manually restart the agent indicates that it could be the upgrader worker itself. It's the one responsible for downloading the tools and installing them.
Turning up the logging might tell us more. In the time where the environment is responsive please run this to increase the logging level:
juju set-env logging-
Then restart the agent and report back with the logs after the agent restarted.
You can put logging back with:
juju set-env logging-
Can you also please report which OS and init system (upstart or systemd) are in use?
Finally can you please run:
ls -l ~/.juju/local/tools
Thanks!
| Adam Collard (adam-collard) wrote : | #11 |
On Wed, 24 Jun 2015 at 04:40 Menno Smits <email address hidden> wrote:
> What seems to be happening is the 1.23.3 jujud is failing to exit and
> restart to continue the upgrade. Please confirm this by restarting the
> agent, noting the jujud PID and once the API becomes unresponsive, check
> that the running jujud is still the same PID.
>
Confirmed, there are no PID changes.
>
> My guess is that a worker is failing to exit, preventing the agent from
> exiting so that the upgrade can begin. The fact that the new tools
> aren't used even when you manually restart the agent indicates that it
> could be the upgrader worker itself. It's the one responsible for
> downloading the tools and installing them.
>
> Turning up the logging might tell us more. In the time where the
> environment is responsive please run this to increase the logging level:
>
> juju set-env logging-
>
> Then restart the agent and report back with the logs after the agent
> restarted.
>
>
After verifying there were no logs after 08:51:20...
$ grep '2015-06-24 08:5' /var/log/
pastebinit
http://
You can put logging back with:
>
> juju set-env logging-
>
Will leave as trace for now, thanks though.
>
> Can you also please report which OS and init system (upstart or systemd)
> are in use?
>
Ubuntu Trusty 14.04.2 with our good friend upstart
> Finally can you please run:
>
> ls -l ~/.juju/local/tools
>
FYI JUJU_HOME is set (so it's not in ~/.juju), that said
$ ls -al $JUJU_HOME/
total 16
drwxr-xr-x 4 root root 4096 Jun 16 11:35 .
drwxr-xr-x 9 ubuntu ubuntu 4096 Jun 24 08:50 ..
drwxr-xr-x 2 root root 4096 May 27 21:18 1.23.3.
drwxr-xr-x 2 root root 4096 Jun 16 11:35 1.24.0-trusty-amd64
lrwxrwxrwx 1 root root 21 May 27 21:18 machine-0 ->
1.23.3.
| Changed in juju-core: | |
| assignee: | nobody → Menno Smits (menno.smits) |
Not confirmed, but just in case, this may be related to bug 1468653.
| Menno Finlay-Smits (menno.smits) wrote : | #13 |
I thought that it could be the same thing as bug 1468653 too, but it's not definite. We'll see.
| Menno Finlay-Smits (menno.smits) wrote : | #14 |
Thanks Adam for gathering up all the extra details... extremely useful.
There's some logging that's been added to 1.24 that would have made diagnosing this issue easier. That said, I can see that after just about everything has finished in preparation for the reboot, a watcher fires reporting a change to the leases DB collection. This indicate that the lease manager is dying.
The lease/leadership functionality 1.23 is known to have issues where it can get can stuck and not honour kill requests so this is probably the problem here.
| Menno Finlay-Smits (menno.smits) wrote : | #15 |
I can reliably reproduce the problem myself like this:
## with juju 1.23.3
juju bootstrap
juju deploy ubuntu
juju add-unit ubuntu
# wait for machines and units to come up
## with juju 1.24.0
juju upgrade-juju --upload-tools
| Menno Finlay-Smits (menno.smits) wrote : | #16 |
To get the upgrade to complete you can manually set the symlink in the tools directory to point to the new tools version and then kill jujud. This will probably need to be done on every state server in a HA configuration.
| Menno Finlay-Smits (menno.smits) wrote : | #17 |
I'm fairly sure this has been fixed in 1.24 but will check now.
| no longer affects: | juju-core |
| Menno Finlay-Smits (menno.smits) wrote : | #18 |
The problem has been fixed for 1.24 already.
It's really unfortunate but anyone on 1.23.x is likely to run in to this problem and will have to manually switch the tools symlink to allow the upgrade to happen. I'm going to email the mailing list about this now.
There's no code fix that makes sense (you won't be able to upgrade to a fixed Juju version without manual intervention anyway) and we're not doing another 1.23 release so updating the ticket to reflect this.
| Menno Finlay-Smits (menno.smits) wrote : | #19 |
Also note: I'm pretty sure bug 1468653 is a different issue.
| Changed in juju-core: | |
| status: | New → Invalid |
| Adam Collard (adam-collard) wrote : | #20 |
Thanks Menno! Manually switching the symlink and restarting jujud worked as advertised.
| tags: | added: kanban-cross-team |


Note that the upgrade itself was ... a little tricky. I think I ended up running "juju upgrade-juju --version=1.24.0" a few times after having changed agent-stream to proposed (before 1.24.0 hit the mainstream)