LXD cluster endpoint doesn't update

Bug #1838780 reported by Chris Sanders
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

After bootstrapping a LXD cloud and cluster, I had a failure of the node that was provided during bootstrap as the LXD remote. After updating the local config for the cloud to point to one of the other machines, it appears the controller isn't updating and continues to try and reach the original machine.

Here is a debug log while trying to ssh to a machine.
juju ssh --debug -vvv salt-master/0
10:08:11 INFO juju.cmd supercommand.go:57 running juju [2.6.6 gc go1.10.4]
10:08:11 DEBUG juju.cmd supercommand.go:58 args: []string{"/snap/juju/8594/bin/juju", "ssh", "--debug", "-vvv", "salt-master/0"}
10:08:11 INFO juju.juju api.go:67 connecting to API addresses: [192.168.1.242:17070]
10:08:11 DEBUG juju.api apiclient.go:1092 successfully dialed "wss://192.168.1.242:17070/model/ea27db4d-1292-4108-831d-ae64cb5815ea/api"
10:08:11 INFO juju.api apiclient.go:624 connection established to "wss://192.168.1.242:17070/model/ea27db4d-1292-4108-831d-ae64cb5815ea/api"
10:08:11 DEBUG juju.cmd.juju.commands ssh_common.go:260 proxy-ssh is false
10:08:16 DEBUG juju.cmd.juju.commands ssh_common.go:377 getting target "salt-master/0" address(es) failed: opening environment: Get https://192.168.0.224:8443/1.0: Unable to connect to: 192.168.0.224:8443 (retrying)
10:08:16 DEBUG juju.api monitor.go:35 RPC connection died
ERROR opening environment: Get https://192.168.0.224:8443/1.0: Unable to connect to: 192.168.0.224:8443
10:08:16 DEBUG cmd supercommand.go:496 error stack:
opening environment: Get https://192.168.0.224:8443/1.0: Unable to connect to: 192.168.0.224:8443
/build/juju/parts/juju/go/src/github.com/juju/juju/rpc/client.go:178:
/build/juju/parts/juju/go/src/github.com/juju/juju/api/apiclient.go:1187:
/build/juju/parts/juju/go/src/github.com/juju/juju/api/sshclient/facade.go:52:
/build/juju/parts/juju/go/src/github.com/juju/juju/cmd/juju/commands/ssh_common.go:405:
/build/juju/parts/juju/go/src/github.com/juju/juju/cmd/juju/commands/ssh_common.go:385:

You'll see it's trying to access 192.168.0.224 which is the machine that died.
I've changed the config for the cloud in clouds.yaml as you can see verified here.

juju show-cloud prod-lxd
defined: local
type: lxd
description: LXD Container Hypervisor
auth-types: [certificate]
endpoint: https://192.168.0.107:8443
regions:
  default: {}

I also updated the address in bootstrap-config.yaml
    controller-model-uuid: a10036c1-cb0c-443d-8316-b7158fe90a83
    credential: prod-lxd-creds
    cloud: prod-lxd
    type: lxd
    region: default
    endpoint: https://192.168.0.107:8443

I understand that juju doesn't accept a list of IP addresses, but there should be a way to move the endpoint when a single machine in the cluster is lost.

What can I do from here to restore this model's controller to a functioning state?

Revision history for this message
Chris Sanders (chris.sanders) wrote :

I've logged in and edited the mongodb with:
db.clouds.updateOne({"_id":"prod-lxd"},{$set: {"endpoint":"https://192.168.0.107:8443", "regions": { "default" : { "endpoint" : "https://192.168.0.107:8443" } }}})
{ "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 }
juju:PRIMARY> db.clouds.find({"_id": "prod-lxd"})

However now I'm getting a 'not authorized' error.

juju ssh --debug -vvv salt-master/0
10:55:18 INFO juju.cmd supercommand.go:57 running juju [2.6.6 gc go1.10.4]
10:55:18 DEBUG juju.cmd supercommand.go:58 args: []string{"/snap/juju/8594/bin/juju", "ssh", "--debug", "-vvv", "salt-master/0"}
10:55:18 INFO juju.juju api.go:67 connecting to API addresses: [192.168.1.242:17070]
10:55:18 DEBUG juju.api apiclient.go:1092 successfully dialed "wss://192.168.1.242:17070/model/ea27db4d-1292-4108-831d-ae64cb5815ea/api"
10:55:18 INFO juju.api apiclient.go:624 connection established to "wss://192.168.1.242:17070/model/ea27db4d-1292-4108-831d-ae64cb5815ea/api"
10:55:19 DEBUG juju.cmd.juju.commands ssh_common.go:260 proxy-ssh is false
10:55:19 DEBUG juju.cmd.juju.commands ssh_common.go:377 getting target "salt-master/0" address(es) failed: opening environment: not authorized (retrying)
10:55:19 DEBUG juju.cmd.juju.commands ssh_common.go:377 getting target "salt-master/0" address(es) failed: opening environment: not authorized (retrying)
10:55:20 DEBUG juju.cmd.juju.commands ssh_common.go:377 getting target "salt-master/0" address(es) failed: opening environment: not authorized (retrying)
10:55:20 DEBUG juju.cmd.juju.commands ssh_common.go:377 getting target "salt-master/0" address(es) failed: opening environment: not authorized (retrying)
10:55:21 DEBUG juju.cmd.juju.commands ssh_common.go:377 getting target "salt-master/0" address(es) failed: opening environment: not authorized (retrying)
10:55:22 DEBUG juju.cmd.juju.commands ssh_common.go:377 getting target "salt-master/0" address(es) failed: opening environment: not authorized (retrying)
10:55:22 DEBUG juju.cmd.juju.commands ssh_common.go:377 getting target "salt-master/0" address(es) failed: opening environment: not authorized (retrying)
10:55:23 DEBUG juju.cmd.juju.commands ssh_common.go:377 getting target "salt-master/0" address(es) failed: opening environment: not authorized (retrying)
10:55:23 DEBUG juju.cmd.juju.commands ssh_common.go:377 getting target "salt-master/0" address(es) failed: opening environment: not authorized (retrying)
10:55:24 DEBUG juju.cmd.juju.commands ssh_common.go:377 getting target "salt-master/0" address(es) failed: opening environment: not authorized (retrying)
10:55:24 DEBUG juju.api monitor.go:35 RPC connection died
ERROR opening environment: not authorized
10:55:24 DEBUG cmd supercommand.go:496 error stack:
opening environment: not authorized
/build/juju/parts/juju/go/src/github.com/juju/juju/rpc/client.go:178:
/build/juju/parts/juju/go/src/github.com/juju/juju/api/apiclient.go:1187:
/build/juju/parts/juju/go/src/github.com/juju/juju/api/sshclient/facade.go:52:
/build/juju/parts/juju/go/src/github.com/juju/juju/cmd/juju/commands/ssh_common.go:405:
/build/juju/parts/juju/go/src/github.com/juju/juju/cmd/juju/commands/ssh_common.go:385:

Revision history for this message
Chris Sanders (chris.sanders) wrote :

Poking around in the database some more I found a server certificate, which I assumed is from the old server. I pulled the one off the new server x.x.x.107 and poked it into mongo with:

db.cloudCredentials.updateOne({"_id":"prod-lxd#admin#prod-lxd-creds"},{$set: {"attributes.server-cert" : "server.cert here" }})

Now the error message when I get when trying to ssh to a unite is:

ERROR opening environment: Get https://192.168.0.107:8443/1.0: x509: certificate is valid for 192.168.0.226, not 192.168.0.107

I think I'm closer, but this is madness for what should be an expected failure mode for a cluster. I guess I'll do some more database splunking to see if I can figure out what's going on. Any advice or documentation I'm missing available by chance?

Revision history for this message
Chris Sanders (chris.sanders) wrote :

This is all I've found, and at this point I don't know enough about how it's suppose to work to fix it beyond making some guess.

I verified that the certificate that is returned when the status is queried at 192.168.0.107:8443 is in fact issued for 192.168.0.266. This it turns out is the cluster.crt found on the LXD unit, not the server.crt. However, 192.168.0.266 isn't in the cluster anymore and it appears all units in the cluster have this same cluster certificate. Since the machine that I replaced was 192.168.0.244 I can only guess as a cluster member it had this same cert as well. I'm not clear why 192.168.0.244 was usable by the controller (or juju client client?) with this certificate but 192.168.0.107 returning it is an issue.

Revision history for this message
Chris Sanders (chris.sanders) wrote :

Setting the certificate to the /var/lib/lxd/cluster.crt seems to be working. What's really strange is that comparing that to the original value from the database they appear to be the same with only a white line difference (one extra line ending) which doesn't change the contents of the actual certificate.

I guess it's possible after changing the endpoint I needed to restart the LXD unit, as both the controller and LXD unit were restarted during my testing.

I think the best solution here is that Juju when working with a LXD cluster should have a command to update the endpoints from the cluster as well as try contacting all cluster members before failing.

Revision history for this message
John A Meinel (jameinel) wrote :

I think if we have connected to an LXD cluster, polling its configuration to find out what endpoints should be available and keeping that up to date internally would be reasonable. Do we know if the LXD API itself expose this information?

Changed in juju:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Ante Karamatić (ivoks) wrote :
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This Medium-priority bug has not been updated in 60 days, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.