`juju upgrade-juju --upload-tools` leaves local environment unusable

Bug #1457728 reported by Adam Israel on 2015-05-22
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
High
Andrew Wilkins
1.24
Critical
Andrew Wilkins

Bug Description

I've been running the devel releases of juju 1.24. Each upgrade, from beta1 -> beta2, beta2 -> beta3, and beta3-beta4, has left the local environment unusable.

My environment:

Trusty, running inside a Vagrant VM
Juju 1.24-beta3
provider: local

Steps to reproduce:

1) apt-get update && apt-get upgrade
2) verify new version with `juju version`
3) run `juju upgrade-juju --upload-tools`

Once the above steps are run, juju commands become non-responsive. The `juju status --debug` output shows a connection refused: https://gist.github.com/AdamIsrael/0c67c8553bb0a485e9ca

I restarted `juju-agent-vagrant-local`, and there is a short window that I'm able to run `juju status` before it hangs with the same connection refused error. Here's the output of `juju status` in that window: https://gist.github.com/AdamIsrael/b97ef47ebecb82219dce

The machine-0.log: https://gist.github.com/AdamIsrael/aa18a0d651f3f7735dda

I've been able to recreate this reliably with each beta upgrade. The only solution I've found is to `juju destroy-environment --force` and re-bootstrap.

Changed in juju-core:
milestone: none → 1.24-beta5
importance: Undecided → High
Curtis Hovey (sinzui) on 2015-05-22
Changed in juju-core:
status: New → Triaged
tags: added: upgrade-juju vagrant
description: updated
tags: added: local-provider
Andrew Wilkins (axwalk) wrote :

I can repro, but it doesn't happen all the time for me. Looking into it.

Changed in juju-core:
milestone: 1.24-beta5 → 1.25.0
Andrew Wilkins (axwalk) wrote :

Unassigning myself for the minute, as the bug I was working on isn't actually fixed yet.

Jesse Meek (waigani) wrote :

This may be due to a bug in the golang.org/x/net package where a websocket was not being closed correctly. We hit similar error messages when we discovered this bug, in particular: "error closing codec: EOF".

The bug has been fixed in the upstream net package and in 1.24-beta4 which uses the new revision (bb64f4dc73). This appears to fix the issue on my box, but as it is intermittent could others also test and verify? Update dependencies.tsv:

golang.org/x/net git bb64f4dc73d4ab97978d5e1cb34515dcc570361b

Andrew Wilkins (axwalk) wrote :

@waigani: I'm pretty sure I was was testing on head of 1.25 yesterday, but will confirm later on. Also, this is new; our usage of websockets is not.

Andrew Wilkins (axwalk) wrote :

err sorry, s/1.25/1.24/

Andrew Wilkins (axwalk) wrote :

After much printf debugging, it appears that a call to LeadershipService.BlockUntilLeadershipReleased is, um, blocking; and that is preventing the apiserver from shutting down.

Andrew Wilkins (axwalk) wrote :

So this turns out to be quite an insidious bug, related to lease/leadership. BlockUntilLeadershipReleased will hang forever if it subscribes but isn't notified before the lease manager worker exits. We need to ensure that:
 - BlockUntilLeadershipReleased can't subscribe if there's no lease manager running.
 - Subscribers are notified (of failure) when the lease manager exits.
or something like that.

Ian Booth (wallyworld) on 2015-06-05
Changed in juju-core:
assignee: nobody → Andrew Wilkins (axwalk)
status: Triaged → Fix Committed
Curtis Hovey (sinzui) on 2015-06-05
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers