Upgrade in progress reported, but panic happening behind scenes

Bug #1493123 reported by Alberto Donato on 2015-09-07
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju-core
Medium
Eric Snow
1.24
Medium
Eric Snow
1.25
Medium
Eric Snow

Bug Description

During an Autopilot deploy, calls to service-deploy and add-machine failed with "upgrade in progress - Juju functionality is limited" message.

juju log show a panic on the state server.

Logs attached

Things seem to go wrong around here:

machine-0: 2015-09-07 15:27:47 DEBUG juju.worker.logger logger.go:45 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
machine-0: 2015-09-07 15:27:47 ERROR juju.state machine.go:1433 cannot update supported containers of machine 0: EOF
machine-0: 2015-09-07 15:27:47 WARNING juju.apiserver.metricsmanager metricsmanager.go:131 failed to send metrics: EOF
machine-0: 2015-09-07 15:27:47 WARNING juju.worker.metricworker.sender sender.go:29 failed to send metrics failed to send metrics: EOF - will retry later
machine-0: 2015-09-07 15:27:47 ERROR juju.worker runner.go:223 exited "diskmanager": EOF
machine-0: 2015-09-07 15:27:47 ERROR juju.worker runner.go:223 exited "api-post-upgrade": setting up container support: setting supported containers for machine-0: EOF
machine-0: 2015-09-07 15:27:47 ERROR juju.worker runner.go:223 exited "rsyslog": failed to write rsyslog certificates: cannot write settings: EOF
machine-0: 2015-09-07 15:27:47 ERROR juju.worker runner.go:223 exited "machiner": machine-0 failed to set status started: cannot get machine 0: EOF
machine-0: 2015-09-07 15:27:47 ERROR juju.worker runner.go:223 exited "certupdater": EOF
machine-0: panic: runtime error: send on closed channel
machine-0: goroutine 672 [running]:
machine-0: runtime.panic(0x18ad1c0, 0x3fcb289)
machine-0: #011/usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
machine-0: github.com/juju/juju/cmd/jujud/agent.func·045(0x7fc04676ff10, 0xc21005a870, 0xc21005a870, 0x7fc04676ff10)
machine-0: #011/build/juju-core-dJ7Ywe/juju-core-1.24.5/src/github.com/juju/juju/cmd/jujud/agent/machine.go:1019 +0x180
machine-0: github.com/juju/juju/cmd/jujud/agent.(*agentConf).ChangeConfig(0xc2101c8570, 0xc210463700, 0x0, 0x0)
machine-0: #011/build/juju-core-dJ7Ywe/juju-core-1.24.5/src/github.com/juju/juju/cmd/jujud/agent/agent.go:125 +0xbe
machine-0: github.com/juju/juju/cmd/jujud/agent.(*MachineAgent).ChangeConfig(0xc21008d840, 0xc210463700, 0x0, 0x0)
machine-0: #011/build/juju-core-dJ7Ywe/juju-core-1.24.5/src/github.com/juju/juju/cmd/jujud/agent/machine.go:495 +0x39
machine-0: github.com/juju/juju/cmd/jujud/agent.func·046(0x42ae, 0x9099, 0xc2105c9800, 0x398, 0xc2107c4800, ...)
machine-0: #011/build/juju-core-dJ7Ywe/juju-core-1.24.5/src/github.com/juju/juju/cmd/jujud/agent/machine.go:1025 +0xc6
machine-0: github.com/juju/juju/worker/certupdater.(*CertificateUpdater).updateCertificate(0xc210075cb0, 0xc2104e6340, 0x1, 0x1, 0xc210808ea0, ...)
machine-0: #011/build/juju-core-dJ7Ywe/juju-core-1.24.5/src/github.com/juju/juju/worker/certupdater/certupdater.go:174 +0x868
machine-0: github.com/juju/juju/worker/certupdater.(*CertificateUpdater).SetUp(0xc210075cb0, 0x0, 0xc2103278c0, 0x4, 0x4)
machine-0: #011/build/juju-core-dJ7Ywe/juju-core-1.24.5/src/github.com/juju/juju/worker/certupdater/certupdater.go:101 +0x342
machine-0: github.com/juju/juju/worker.(*notifyWorker).loop(0xc2104e6100, 0x0, 0x0)
machine-0: #011/build/juju-core-dJ7Ywe/juju-core-1.24.5/src/github.com/juju/juju/worker/notifyworker.go:83 +0x45
machine-0: github.com/juju/juju/worker.func·003()
machine-0: #011/build/juju-core-dJ7Ywe/juju-core-1.24.5/src/github.com/juju/juju/worker/notifyworker.go:55 +0x54
machine-0: created by github.com/juju/juju/worker.NewNotifyWorker
machine-0: #011/build/juju-core-dJ7Ywe/juju-core-1.24.5/src/github.com/juju/juju/worker/notifyworker.go:56 +0xa1
machine-0: goroutine 1 [chan receive]:

Alberto Donato (ack) wrote :
tags: added: landscape-release-29
Curtis Hovey (sinzui) wrote :

This issue is caused by juju's design to automatically upgrade the agent. per the master bug. Juju should allow us to not automatically update, but also, all clients need to ask if the server is accepting commands.

tags: added: upgrade-juju
Changed in juju-core:
status: New → Triaged
importance: Undecided → Medium
Curtis Hovey (sinzui) wrote :

This issue is partially fixed by bug 1455260 which prevents automatic upgrade. The panic needs to be fixed (if it is not already fixed) and autopilot should not be changing the env when an upgrade is in progress.

David Britton (davidpbritton) wrote :

all-machines log showing the panic attached

description: updated
no longer affects: cloud-installer
David Britton (davidpbritton) wrote :

FWIW, we pin the agent version, so we should not be seeing upgrades. I think the go panic *might* be the real error to look into on this one, there shouldn't have been a real upgrade in progress at all.

summary: - OSA run: failed to deploy services
+ Upgrade in progress reported, but panic happening behind scenes
Changed in juju-core:
assignee: nobody → Eric Snow (ericsnowcurrently)
status: Triaged → In Progress
milestone: none → 1.26-alpha1
Eric Snow (ericsnowcurrently) wrote :

FTR, the logs indicate juju 1.24.5.

Eric Snow (ericsnowcurrently) wrote :

The several EOF-related log message indicate that the mongo connection is not valid at that point. There are a number of possible reasons for that (e.g. mongo was not ready yet). From the messages at the end of the log file (after the stack trace), it looks like the flaky connection gets worked out.

All the workers should handle a flaky DB/API connection correctly. I expect that the "send on closed channel" is coming from a worker that is not doing the right thing. It's just a matter of tracking it down...

Eric Snow (ericsnowcurrently) wrote :

There may be a relationship here with lp:1472729.

Eric Snow (ericsnowcurrently) wrote :

Hmm. The EOF side of things should have been addressed in lp:1468581.

Eric Snow (ericsnowcurrently) wrote :

Yeah, looks like the same panics we were getting with lp:1472729. It's almost as though the logs are from 1.24.2, not 1.24.5...

Eric Snow (ericsnowcurrently) wrote :

Ah, the difference is that we're getting the panic at a different spot (trying to send to rather than re-close a closed channel). The problem appears to be here:

  https://github.com/juju/juju/blob/1.24/cmd/jujud/agent/machine.go#L1030

Which would imply that the channel was already closed in CertificateUpdater.TearDown:

  https://github.com/juju/juju/blob/1.24/worker/certupdater/certupdater.go#L200

Eric Snow (ericsnowcurrently) wrote :

At this point I don't see how TearDown could be called (and the channel closed) before the offending function gets called. The latter is called only one in each loop of NotifyWorker (via CertificateUpdater.Handle), while the former is only called once we've exited the loop. So perhaps I'm barking up the wrong tree...

Eric Snow (ericsnowcurrently) wrote :

I may have it. If the worker is restarted (by the runner) then you could get a call to CertificateUpdater.Handle *after* TearDown is called. This is because the restarted worker is using the same channel (which it already closed the first time around).

Changed in juju-core:
status: In Progress → Triaged
assignee: Eric Snow (ericsnowcurrently) → nobody
Changed in juju-core:
assignee: nobody → Eric Snow (ericsnowcurrently)
status: Triaged → In Progress
Eric Snow (ericsnowcurrently) wrote :

The original patch to close the channel was added for lp:1396099 / lp:1403721 (https://github.com/juju/juju/pull/1346).

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui) on 2015-10-16
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers