upgrading 1.18 to 1.19 breaks agent.conf

Bug #1333682 reported by John A Meinel
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Critical
Ian Booth

Bug Description

After running "juju upgrade" of a 1.18 environment into 1.19 we end up missing "apiaddresses" in the agent.conf file though we do have "stateaddresses" listed

This causes a panic() (attached).

We seem to be missing an upgrade step to fix the content of agent.conf, and we also have a bug in our code that we have something that can be nil but isn't checked for not being nil before we use it.

We also have a bug in our CI tests, as I have gotten 2 reports of this happening in the field, but CI itself seems to think that Upgrading is just fine.

We appear to have 0 Upgrade steps To 1.20, (though we also have to fix some code that they could potentially apply their steps for an upgrade targetting 1.19).

The particular places that have shown this issue were both with the local provider, but given the place in the code, I don't think it is specific to that.

Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :

Ok, this is weird. I just bootstrapped directly 1.18.1 and I see:
apiaddresses:
- localhost:17070

So I have the feeling that the actual bug is that 1.16 might not have put the data in there, and 1.18 didn't add it, so 1.19/1.20 isn't adding it either.

Revision history for this message
John A Meinel (jameinel) wrote :

The plot thickens.
I edited agent.conf to remove that line, and restarted the 1.18.1 agent, which now is panicing:
2014-06-24 13:08:53 INFO juju.cmd supercommand.go:297 running juju-1.18.1.1-trusty-amd64 [gc]
2014-06-24 13:08:53 INFO juju.cmd.jujud machine.go:127 machine agent machine-0 start (1.18.1.1-trusty-amd64 [gc])
2014-06-24 13:08:53 DEBUG juju.agent agent.go:384 read agent config, format "1.18"
2014-06-24 13:08:53 INFO juju.cmd.jujud machine.go:155 Starting StateWorker for machine-0
2014-06-24 13:08:53 INFO juju runner.go:262 worker: start "state"
2014-06-24 13:08:53 INFO juju.state open.go:81 opening state; mongo addresses: ["localhost:37017"]; entity "machine-0"
2014-06-24 13:08:53 INFO juju runner.go:262 worker: start "api"
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x46686b]

goroutine 9 [running]:
runtime.panic(0xcdad60, 0x1a86648)
        /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
launchpad.net/juju-core/agent.(*configInternal).OpenAPI(0xc2100da000, 0x0, 0x0, 0x0, 0x0, ...)
        /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/agent/agent.go:589 +0x3db
main.openAPIState(0x7f1326061cb8, 0xc2100da000, 0x7f1326061b88, 0xc21004fb40, 0xc2100da000, ...)
        /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/cmd/jujud/agent.go:182 +0x55
main.(*MachineAgent).APIWorker(0xc21004fb40, 0xc210043d60, 0x10, 0x7f1325ee4ec8, 0x1, ...)
        /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/cmd/jujud/machine.go:181 +0x10b
main.func·006(0xeb72d0, 0x10, 0x7f1325ee4ec8, 0x1)
        /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/cmd/jujud/machine.go:159 +0x35
launchpad.net/juju-core/worker.(*runner).runWorker(0xc210037d80, 0x0, 0xe0cb10, 0x3, 0xc210073be0)
        /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/worker/runner.go:263 +0x306
created by launchpad.net/juju-core/worker.(*runner).run
        /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/worker/runner.go:179 +0x356

So potentially it is a different bug, which is that 1.19 is actually *removing* the line that used to be there.

Revision history for this message
John A Meinel (jameinel) wrote :

A thought, the 1.19 code is expecting Login to be able to return a list of known hosts to connect to, and is probably trying to cache those addresses (to handle HA, etc).
However, if a machine-1 agent came up as 1.19 and machine-0 was still 1.18, is it possible that 1.19 is calling Login, getting a response that doesn't have any apiHostPorts in it, and then triggering a rewrite of agent.conf and *stripping out* the data?

Revision history for this message
Ian Booth (wallyworld) wrote :

After calling Login(), the client does cache the host addresses, but only if they exist. See cacheChangedAPIInfo(). The rough flow is:
- call Login
- get addresses from result, store on client state hostPorts attribute
- at some later time, cacheChangedAPIInfo() is called, but this ignores empty or nil addresses

Login gets the addresses to return by reading the apiHostPorts value from state (see below).

The mechanism to update agent.conf is via the APIAddressUpdater worker, which uses a watcher to listen for changes in the apiHostPorts value recorded in state. If something sets this value to nil or empty, then this will be propagated to each machine agent and written out to agent.conf, erasing the current non empty value.

So, any nil or empty apiHostPort addresses used by login or the address updater come from reading that value from state.

What sets the apiHostPorts value in state is the publishAPIServers() method in the peergrouper worker. This is called when triggered by a timer or when machine info changes in state.

Wild guess: going from 1.18 to 1.19 is going from a non HA to HA set up. Perhaps the peergrouper worker is triggered to publish API server info when replica set type things are still being initialised, and it calls publish with an empty set of api servers.

I'll add a check to stop empty api servers from being published and log a warning so we can see how often it might be happening.

Changed in juju-core:
status: Triaged → In Progress
assignee: nobody → Ian Booth (wallyworld)
Revision history for this message
Ian Booth (wallyworld) wrote :

I am not sure if my fix works as the issue seems difficult to reproduce. I hope we can get feedback from whomever raised the issue.

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

Ian, we have encountered this in a 1.18 → 1.20 upgrade in bug #1444912 just now.

Changed in juju-core:
status: Fix Released → Confirmed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.