`juju upgrade-juju --upload-tools` leaves local environment unusable

Bug #1457728 reported by Adam Israel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Andrew Wilkins
1.24
Fix Released
Critical
Andrew Wilkins

Bug Description

I've been running the devel releases of juju 1.24. Each upgrade, from beta1 -> beta2, beta2 -> beta3, and beta3-beta4, has left the local environment unusable.

My environment:

Trusty, running inside a Vagrant VM
Juju 1.24-beta3
provider: local

Steps to reproduce:

1) apt-get update && apt-get upgrade
2) verify new version with `juju version`
3) run `juju upgrade-juju --upload-tools`

Once the above steps are run, juju commands become non-responsive. The `juju status --debug` output shows a connection refused: https://gist.github.com/AdamIsrael/0c67c8553bb0a485e9ca

I restarted `juju-agent-vagrant-local`, and there is a short window that I'm able to run `juju status` before it hangs with the same connection refused error. Here's the output of `juju status` in that window: https://gist.github.com/AdamIsrael/b97ef47ebecb82219dce

The machine-0.log: https://gist.github.com/AdamIsrael/aa18a0d651f3f7735dda

I've been able to recreate this reliably with each beta upgrade. The only solution I've found is to `juju destroy-environment --force` and re-bootstrap.

Changed in juju-core:
milestone: none → 1.24-beta5
importance: Undecided → High
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
tags: added: upgrade-juju vagrant
description: updated
tags: added: local-provider
Revision history for this message
Andrew Wilkins (axwalk) wrote :

I can repro, but it doesn't happen all the time for me. Looking into it.

Changed in juju-core:
milestone: 1.24-beta5 → 1.25.0
Revision history for this message
Andrew Wilkins (axwalk) wrote :

Unassigning myself for the minute, as the bug I was working on isn't actually fixed yet.

Revision history for this message
Jesse Meek (waigani) wrote :

This may be due to a bug in the golang.org/x/net package where a websocket was not being closed correctly. We hit similar error messages when we discovered this bug, in particular: "error closing codec: EOF".

The bug has been fixed in the upstream net package and in 1.24-beta4 which uses the new revision (bb64f4dc73). This appears to fix the issue on my box, but as it is intermittent could others also test and verify? Update dependencies.tsv:

golang.org/x/net git bb64f4dc73d4ab97978d5e1cb34515dcc570361b

Revision history for this message
Andrew Wilkins (axwalk) wrote :

@waigani: I'm pretty sure I was was testing on head of 1.25 yesterday, but will confirm later on. Also, this is new; our usage of websockets is not.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

err sorry, s/1.25/1.24/

Revision history for this message
Andrew Wilkins (axwalk) wrote :

After much printf debugging, it appears that a call to LeadershipService.BlockUntilLeadershipReleased is, um, blocking; and that is preventing the apiserver from shutting down.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

So this turns out to be quite an insidious bug, related to lease/leadership. BlockUntilLeadershipReleased will hang forever if it subscribes but isn't notified before the lease manager worker exits. We need to ensure that:
 - BlockUntilLeadershipReleased can't subscribe if there's no lease manager running.
 - Subscribers are notified (of failure) when the lease manager exits.
or something like that.

Ian Booth (wallyworld)
Changed in juju-core:
assignee: nobody → Andrew Wilkins (axwalk)
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.