Upgrade in progress reported, but panic happening behind scenes
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-core |
Medium
|
Eric Snow | ||
| | 1.24 |
Medium
|
Eric Snow | ||
| | 1.25 |
Medium
|
Eric Snow | ||
Bug Description
During an Autopilot deploy, calls to service-deploy and add-machine failed with "upgrade in progress - Juju functionality is limited" message.
juju log show a panic on the state server.
Logs attached
Things seem to go wrong around here:
machine-0: 2015-09-07 15:27:47 DEBUG juju.worker.logger logger.go:45 reconfiguring logging from "<root>=DEBUG" to "<root>
machine-0: 2015-09-07 15:27:47 ERROR juju.state machine.go:1433 cannot update supported containers of machine 0: EOF
machine-0: 2015-09-07 15:27:47 WARNING juju.apiserver.
machine-0: 2015-09-07 15:27:47 WARNING juju.worker.
machine-0: 2015-09-07 15:27:47 ERROR juju.worker runner.go:223 exited "diskmanager": EOF
machine-0: 2015-09-07 15:27:47 ERROR juju.worker runner.go:223 exited "api-post-upgrade": setting up container support: setting supported containers for machine-0: EOF
machine-0: 2015-09-07 15:27:47 ERROR juju.worker runner.go:223 exited "rsyslog": failed to write rsyslog certificates: cannot write settings: EOF
machine-0: 2015-09-07 15:27:47 ERROR juju.worker runner.go:223 exited "machiner": machine-0 failed to set status started: cannot get machine 0: EOF
machine-0: 2015-09-07 15:27:47 ERROR juju.worker runner.go:223 exited "certupdater": EOF
machine-0: panic: runtime error: send on closed channel
machine-0: goroutine 672 [running]:
machine-0: runtime.
machine-0: #011/usr/
machine-0: github.
machine-0: #011/build/
machine-0: github.
machine-0: #011/build/
machine-0: github.
machine-0: #011/build/
machine-0: github.
machine-0: #011/build/
machine-0: github.
machine-0: #011/build/
machine-0: github.
machine-0: #011/build/
machine-0: github.
machine-0: #011/build/
machine-0: github.
machine-0: #011/build/
machine-0: created by github.
machine-0: #011/build/
machine-0: goroutine 1 [chan receive]:
| Alberto Donato (ack) wrote : | #1 |
| tags: | added: landscape-release-29 |
| Curtis Hovey (sinzui) wrote : | #2 |
| tags: | added: upgrade-juju |
| Changed in juju-core: | |
| status: | New → Triaged |
| importance: | Undecided → Medium |
| Curtis Hovey (sinzui) wrote : | #3 |
This issue is partially fixed by bug 1455260 which prevents automatic upgrade. The panic needs to be fixed (if it is not already fixed) and autopilot should not be changing the env when an upgrade is in progress.
| David Britton (davidpbritton) wrote : | #4 |
all-machines log showing the panic attached
| description: | updated |
| no longer affects: | cloud-installer |
| David Britton (davidpbritton) wrote : | #5 |
FWIW, we pin the agent version, so we should not be seeing upgrades. I think the go panic *might* be the real error to look into on this one, there shouldn't have been a real upgrade in progress at all.
| summary: |
- OSA run: failed to deploy services + Upgrade in progress reported, but panic happening behind scenes |
| Changed in juju-core: | |
| assignee: | nobody → Eric Snow (ericsnowcurrently) |
| status: | Triaged → In Progress |
| milestone: | none → 1.26-alpha1 |
| Eric Snow (ericsnowcurrently) wrote : | #6 |
FTR, the logs indicate juju 1.24.5.
| Eric Snow (ericsnowcurrently) wrote : | #7 |
The several EOF-related log message indicate that the mongo connection is not valid at that point. There are a number of possible reasons for that (e.g. mongo was not ready yet). From the messages at the end of the log file (after the stack trace), it looks like the flaky connection gets worked out.
All the workers should handle a flaky DB/API connection correctly. I expect that the "send on closed channel" is coming from a worker that is not doing the right thing. It's just a matter of tracking it down...
| Eric Snow (ericsnowcurrently) wrote : | #8 |
There may be a relationship here with lp:1472729.
| Eric Snow (ericsnowcurrently) wrote : | #9 |
Hmm. The EOF side of things should have been addressed in lp:1468581.
| Eric Snow (ericsnowcurrently) wrote : | #10 |
Yeah, looks like the same panics we were getting with lp:1472729. It's almost as though the logs are from 1.24.2, not 1.24.5...
| Eric Snow (ericsnowcurrently) wrote : | #11 |
Ah, the difference is that we're getting the panic at a different spot (trying to send to rather than re-close a closed channel). The problem appears to be here:
https:/
Which would imply that the channel was already closed in CertificateUpda
https:/
| Eric Snow (ericsnowcurrently) wrote : | #12 |
At this point I don't see how TearDown could be called (and the channel closed) before the offending function gets called. The latter is called only one in each loop of NotifyWorker (via CertificateUpda
| Eric Snow (ericsnowcurrently) wrote : | #13 |
I may have it. If the worker is restarted (by the runner) then you could get a call to CertificateUpda
| Changed in juju-core: | |
| status: | In Progress → Triaged |
| assignee: | Eric Snow (ericsnowcurrently) → nobody |
| Changed in juju-core: | |
| assignee: | nobody → Eric Snow (ericsnowcurrently) |
| status: | Triaged → In Progress |
| Eric Snow (ericsnowcurrently) wrote : | #14 |
The original patch to close the channel was added for lp:1396099 / lp:1403721 (https:/
| Changed in juju-core: | |
| status: | In Progress → Fix Committed |
| Changed in juju-core: | |
| status: | Fix Committed → Fix Released |


This issue is caused by juju's design to automatically upgrade the agent. per the master bug. Juju should allow us to not automatically update, but also, all clients need to ask if the server is accepting commands.