Bootstrap node occasionally panicing with "not a valid unit name"

Bug #1437266 reported by William Grant on 2015-03-27
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
juju-core
High
Frank Mueller
1.24
High
Frank Mueller

Bug Description

Since upgrading from 1.22.something to 1.23-beta1 on Tuesday, my vivid local provider bootstrap node's jujud is panicing several times a day. It usually happens on destroy-service, but I've seen it on a deploy as well. It is not reliably reproducible.

The tail of the log after a destroy-service panic:

2015-03-26 22:30:49 WARNING juju.lease lease.go:301 A notification timed out after 1m0s.
2015-03-27 00:25:28 ERROR juju.apiserver debuglog.go:110 debug-log handler error: write tcp 127.0.0.1:56076: broken pipe
2015-03-27 01:01:36 ERROR juju.rpc server.go:573 error writing response: write tcp 10.0.3.153:59431: broken pipe
panic: cannot retrieve unit "m#15#n#juju-public": "m#15#n#juju-public" is not a valid unit name

Dimiter Naydenov (dimitern) wrote :

Can you paste some logs (preferably at TRACE level) and explain what commands you've run?

Dimiter Naydenov (dimitern) wrote :

At first glance this looks like related to a recent change in the megawatcher (backingOpenedPorts.remove method) - cleaning up opened ports for units when the unit goes away. What units have you deployed?

William Grant (wgrant) wrote :

I'll get logs next time it falls over. I've been deploying, redeploying, upgrading, relating, destroying, unrelating (and everything else you can think of) a selection of apache2, haproxy, gunicorn, nrpe, storage, and a couple of private charms. On one particular occasion it panicked right as I destroy-service'd a live instance of lp:~canonical-launchpad-branches/charms/trusty/turnipcake/devel

Curtis Hovey (sinzui) on 2015-03-27
Changed in juju-core:
status: New → Triaged
importance: Undecided → Critical
importance: Critical → Medium
tags: added: deploy destroy-service
Curtis Hovey (sinzui) on 2015-04-22
tags: added: destroy-machine
William Grant (wgrant) wrote :

Reproduced with trace logging. I "juju destroy-service"'d all of the services in the environment, and watched "juju status" until it started hanging.

William Grant (wgrant) wrote :

machine-0: panic: cannot retrieve unit "m#3#n#juju-public": "m#3#n#juju-public" is not a valid unit name
machine-0: goroutine 1464 [running]:
machine-0: runtime.panic(0x131f0c0, 0xc20b9e04f0)
machine-0: #011/usr/lib/go/src/pkg/runtime/panic.c:279 +0xf5
machine-0: github.com/juju/juju/state.(*backingOpenedPorts).removed(0xc20bf0c180, 0xc20814ae80, 0xc2085e1840, 0x10bd720, 0xc20b73d160)
machine-0: #011/build/buildd/juju-core-1.23-beta4/src/github.com/juju/juju/state/megawatcher.go:585 +0x1e6
machine-0: github.com/juju/juju/state.(*allWatcherStateBacking).Changed(0xc20890b410, 0xc2085e1840, 0xc20b73d130, 0xb, 0x10bd720, 0xc20b73d160, 0xffffffffffffffff, 0x0, 0x0)
machine-0: #011/build/buildd/juju-core-1.23-beta4/src/github.com/juju/juju/state/megawatcher.go:815 +0x46a
machine-0: github.com/juju/juju/state.(*storeManager).loop(0xc208728190, 0x0, 0x0)
machine-0: #011/build/buildd/juju-core-1.23-beta4/src/github.com/juju/juju/state/multiwatcher.go:189 +0x2d5
machine-0: github.com/juju/juju/state.func·028()
machine-0: #011/build/buildd/juju-core-1.23-beta4/src/github.com/juju/juju/state/multiwatcher.go:158 +0x65
machine-0: created by github.com/juju/juju/state.newStoreManager
machine-0: #011/build/buildd/juju-core-1.23-beta4/src/github.com/juju/juju/state/multiwatcher.go:167 +0x80

Stuart Bishop (stub) wrote :

The test suite in lp:~stub/charms/postgresql/enable-integration-tests seems to be reliably triggering this with 1.23 release and the local provider. 'make integration_test_93' with a bootstrapped environment. all-machines.log attached.

Stuart Bishop (stub) wrote :

I can repeat this using:

juju bootstrap
juju deploy cs:postgresql
juju deploy cs:postgresql-psql psql
juju add-relation postgresql:db psql:db
juju wait
juju-deployer -T

I haven't reproduced this using 'juju destroy-service'.

Martin Packman (gz) on 2015-05-11
Changed in juju-core:
importance: Medium → High
Curtis Hovey (sinzui) on 2015-05-11
Changed in juju-core:
milestone: none → 1.24.0
Ian Booth (wallyworld) wrote :

Frank, the recent work to add the removed() method to backingOpenedPorts does not properly process the incoming id. See the updated() method and how it calls backingEntityIdForOpenedPortsKey() for how to do it.

Changed in juju-core:
assignee: nobody → Frank Mueller (themue)
milestone: 1.24.0 → 1.24-beta2
Frank Mueller (themue) on 2015-05-12
Changed in juju-core:
status: Triaged → In Progress
Curtis Hovey (sinzui) on 2015-05-12
Changed in juju-core:
milestone: 1.24-beta2 → 1.24-beta3
Ian Booth (wallyworld) on 2015-05-12
Changed in juju-core:
milestone: 1.24-beta3 → 1.25.0
Antonio Rosales (arosales) wrote :

Note, this bug is also affecting Charm CI as reported in https://bugs.launchpad.net/juju-core/+bug/1454359. If at all possible suggest this be targeted at 1.24.

-thanks,
Antonio

Frank Mueller (themue) on 2015-05-14
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui) on 2015-05-20
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers