Bootstrap node occasionally panicing with "not a valid unit name"

Bug #1437266 reported by William Grant
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Frank Mueller
1.24
Fix Released
High
Frank Mueller

Bug Description

Since upgrading from 1.22.something to 1.23-beta1 on Tuesday, my vivid local provider bootstrap node's jujud is panicing several times a day. It usually happens on destroy-service, but I've seen it on a deploy as well. It is not reliably reproducible.

The tail of the log after a destroy-service panic:

2015-03-26 22:30:49 WARNING juju.lease lease.go:301 A notification timed out after 1m0s.
2015-03-27 00:25:28 ERROR juju.apiserver debuglog.go:110 debug-log handler error: write tcp 127.0.0.1:56076: broken pipe
2015-03-27 01:01:36 ERROR juju.rpc server.go:573 error writing response: write tcp 10.0.3.153:59431: broken pipe
panic: cannot retrieve unit "m#15#n#juju-public": "m#15#n#juju-public" is not a valid unit name

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Can you paste some logs (preferably at TRACE level) and explain what commands you've run?

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

At first glance this looks like related to a recent change in the megawatcher (backingOpenedPorts.remove method) - cleaning up opened ports for units when the unit goes away. What units have you deployed?

Revision history for this message
William Grant (wgrant) wrote :

I'll get logs next time it falls over. I've been deploying, redeploying, upgrading, relating, destroying, unrelating (and everything else you can think of) a selection of apache2, haproxy, gunicorn, nrpe, storage, and a couple of private charms. On one particular occasion it panicked right as I destroy-service'd a live instance of lp:~canonical-launchpad-branches/charms/trusty/turnipcake/devel

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → Critical
importance: Critical → Medium
tags: added: deploy destroy-service
Curtis Hovey (sinzui)
tags: added: destroy-machine
Revision history for this message
William Grant (wgrant) wrote :

Reproduced with trace logging. I "juju destroy-service"'d all of the services in the environment, and watched "juju status" until it started hanging.

Revision history for this message
William Grant (wgrant) wrote :

machine-0: panic: cannot retrieve unit "m#3#n#juju-public": "m#3#n#juju-public" is not a valid unit name
machine-0: goroutine 1464 [running]:
machine-0: runtime.panic(0x131f0c0, 0xc20b9e04f0)
machine-0: #011/usr/lib/go/src/pkg/runtime/panic.c:279 +0xf5
machine-0: github.com/juju/juju/state.(*backingOpenedPorts).removed(0xc20bf0c180, 0xc20814ae80, 0xc2085e1840, 0x10bd720, 0xc20b73d160)
machine-0: #011/build/buildd/juju-core-1.23-beta4/src/github.com/juju/juju/state/megawatcher.go:585 +0x1e6
machine-0: github.com/juju/juju/state.(*allWatcherStateBacking).Changed(0xc20890b410, 0xc2085e1840, 0xc20b73d130, 0xb, 0x10bd720, 0xc20b73d160, 0xffffffffffffffff, 0x0, 0x0)
machine-0: #011/build/buildd/juju-core-1.23-beta4/src/github.com/juju/juju/state/megawatcher.go:815 +0x46a
machine-0: github.com/juju/juju/state.(*storeManager).loop(0xc208728190, 0x0, 0x0)
machine-0: #011/build/buildd/juju-core-1.23-beta4/src/github.com/juju/juju/state/multiwatcher.go:189 +0x2d5
machine-0: github.com/juju/juju/state.func·028()
machine-0: #011/build/buildd/juju-core-1.23-beta4/src/github.com/juju/juju/state/multiwatcher.go:158 +0x65
machine-0: created by github.com/juju/juju/state.newStoreManager
machine-0: #011/build/buildd/juju-core-1.23-beta4/src/github.com/juju/juju/state/multiwatcher.go:167 +0x80

Revision history for this message
Stuart Bishop (stub) wrote :

The test suite in lp:~stub/charms/postgresql/enable-integration-tests seems to be reliably triggering this with 1.23 release and the local provider. 'make integration_test_93' with a bootstrapped environment. all-machines.log attached.

Revision history for this message
Stuart Bishop (stub) wrote :

I can repeat this using:

juju bootstrap
juju deploy cs:postgresql
juju deploy cs:postgresql-psql psql
juju add-relation postgresql:db psql:db
juju wait
juju-deployer -T

I haven't reproduced this using 'juju destroy-service'.

Martin Packman (gz)
Changed in juju-core:
importance: Medium → High
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: none → 1.24.0
Revision history for this message
Ian Booth (wallyworld) wrote :

Frank, the recent work to add the removed() method to backingOpenedPorts does not properly process the incoming id. See the updated() method and how it calls backingEntityIdForOpenedPortsKey() for how to do it.

Changed in juju-core:
assignee: nobody → Frank Mueller (themue)
milestone: 1.24.0 → 1.24-beta2
Frank Mueller (themue)
Changed in juju-core:
status: Triaged → In Progress
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.24-beta2 → 1.24-beta3
Ian Booth (wallyworld)
Changed in juju-core:
milestone: 1.24-beta3 → 1.25.0
Revision history for this message
Antonio Rosales (arosales) wrote :

Note, this bug is also affecting Charm CI as reported in https://bugs.launchpad.net/juju-core/+bug/1454359. If at all possible suggest this be targeted at 1.24.

-thanks,
Antonio

Frank Mueller (themue)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.