Bug #1866623 “Not able to deploy on a remote LXD cluster model u...” : Bugs : Canonical Juju

Revision history for this message

Dominik Fleischmann (dominik.f) wrote on 2020-03-09:

#1

lxd-remote-credentials.yaml Edit (7.8 KiB, text/plain)

Revision history for this message

Dominik Fleischmann (dominik.f) wrote on 2020-03-09:

#2

debug.tar.gz Edit (2.4 KiB, application/x-tar)

Adding juju debug-log also.

John A Meinel (jameinel) on 2020-03-09

Changed in juju:
status:	New → In Progress
importance:	Undecided → High
importance:	High → Critical
assignee:	nobody → John A Meinel (jameinel)
milestone:	none → 2.8-beta1

Revision history for this message

John A Meinel (jameinel) wrote on 2020-03-09:

#3

So the basic errors that you see are:
controller-0: 17:21:25 ERROR juju.worker.dependency "instance-mutater" manifold worker returned unexpected error: cannot start machine instancemutater worker: Tag not valid
controller-0: 17:21:38 ERROR juju.worker.dependency "compute-provisioner" manifold worker returned unexpected error: no controller machines found

Digging into Tag not valid, that one is because Instance Mutater wants a Machine Agent.
worker/instancemutater/worker.go:

func (config Config) Validate() error {

...
if config.Tag == nil {
return errors.NotValidf("nil Tag")
}
if _, ok := config.Tag.(names.MachineTag); !ok {
return errors.NotValidf("Tag")
}
...
}

However, Controllers in K8s models have type: names.ControllerAgentTag

That said, what are we running here anyway? Why would InstanceMutater need a Machine, given it is probing a *remote* LXD service. It might be for this:

// NewContainerWorker returns a worker that keeps track of
// the containers in the state for this machine agent and
// polls their instance for addition or removal changes.
func NewContainerWorker(config Config) (worker.Worker, error) {
m, err := config.Facade.Machine(config.Tag.(names.MachineTag))

However, this is an entire cloud, not a single machine that we are provisioning. So while we *do* want to be monitoring in case we need to set profiles on the instances that we spawn, we *definitely* shouldn't be doing so within the context of the Controller's machine.

(Even without K8s in play, it doesn't make sense for my local MAAS-based controller to be inspecting its own machine's containers while it is provisioning machines in a remote LXD cluster.)

I'm guessing we just shouldn't be running the InstanceMutater the way that we are, but I can't quite figure out whether it is just that I can allow ControllerAgent tags, and just never start a NewContainerWorker, since we aren't provisioning containers inside the controller machine.

The 'compute-provisioner' failing with 'no controller machines found' is more of a problem. Looking for that error string leads me to:
func (st *State) controllerAddresses() ([]string, error) {
...
        err = machines.Find(bson.D{{"jobs", JobManageModel}}).All(&allAddresses)
        if err != nil {
                return nil, err
        }
        if len(allAddresses) == 0 {
                return nil, errors.New("no controller machines found")
        }

That seems to be assuming that the API addresses for the controller will be found by searching for machines in the Controller model (st.ControllerInfo().ModelTag) that have JobManageModel, and then finding ScopeMatchCloudLocal. However,

a) That completely ignores Spaces modeling and the "juju-mgmt-space".
b) There are no Machines to find in K8s models.

I have the feeling that func controllerAddresses just needs to die and be replaced by GetAPIHostPorts.

So the basic errors that you see are:
controller-0: 17:21:25 ERROR juju.worker.dependency "instance-mutater" manifold worker returned unexpected error: cannot start machine instancemutater worker: Tag not valid
controller-0: 17:21:38 ERROR juju.worker.dependency "compute-provisioner" manifold worker returned unexpected error: no controller machines found

Digging into Tag not valid, that one is because Instance Mutater wants a Machine Agent. 
worker/instancemutater/worker.go:

func (config Config) Validate() error {

...
	if config.Tag == nil {
	    return errors.NotValidf("nil Tag")
	}
	if _, ok := config.Tag.(names.MachineTag); !ok {
		return errors.NotValidf("Tag")
	}
...
}

However, Controllers in K8s models have type: names.ControllerAgentTag

That said, what are we running here anyway? Why would InstanceMutater need a Machine, given it is probing a *remote* LXD service. It might be for this:

// NewContainerWorker returns a worker that keeps track of
// the containers in the state for this machine agent and
// polls their instance for addition or removal changes.
func NewContainerWorker(config Config) (worker.Worker, error) {
	m, err := config.Facade.Machine(config.Tag.(names.MachineTag))

However, this is an entire cloud, not a single machine that we are provisioning. So while we *do* want to be monitoring in case we need to set profiles on the instances that we spawn, we *definitely* shouldn't be doing so within the context of the Controller's machine.

(Even without K8s in play, it doesn't make sense for my local MAAS-based controller to be inspecting its own machine's containers while it is provisioning machines in a remote LXD cluster.)

I'm guessing we just shouldn't be running the InstanceMutater the way that we are, but I can't quite figure out whether it is just that I can allow ControllerAgent tags, and just never start a NewContainerWorker, since we aren't provisioning containers inside the controller machine.

The 'compute-provisioner' failing with 'no controller machines found' is more of a problem.  Looking for that error string leads me to:
func (st *State) controllerAddresses() ([]string, error) {
...
        err = machines.Find(bson.D{{"jobs", JobManageModel}}).All(&allAddresses)
        if err != nil {
                return nil, err
        }
        if len(allAddresses) == 0 {
                return nil, errors.New("no controller machines found")
        }

That seems to be assuming that the API addresses for the controller will be found by searching for machines in the Controller model (st.ControllerInfo().ModelTag) that have JobManageModel, and then finding ScopeMatchCloudLocal. However,

a) That completely ignores Spaces modeling and the "juju-mgmt-space".
b) There are no Machines to find in K8s models.

I have the feeling that func controllerAddresses just needs to die and be replaced by GetAPIHostPorts.

Revision history for this message

John A Meinel (jameinel) wrote on 2020-03-09:

#4

So I have a branch here:
https://github.com/jameinel/juju/tree/2.7-k8s-controlling-lxd-1866623

Using that, I can do:
make install
make JUJU_SKIP_DEP=true JUJU_BUILD_NUMBER=5 JUJUD_STAGING_DIR=$GOPATH/tmp/jujud-operator microk8s-operator-update

(you have to build Jujud at least once so that the second build can figure out what version number it is installing, see bug #1866658).

With that, I can then do:
juju bootstrap microk8s micro

juju add-cloud -c micro bio ./lxd-bio-cloud.yaml --force
juju add-credential -c micro bio -f lxd-bio-cred.yaml
juju add-model bio-test bio

juju debug-log -m controller --replay --tail

lxd-bio-cloud.yaml is just:
clouds:
  bio:
    type: lxd
    auth-types: [certificate]
    endpoint: "https://192.168.185.105:8443"
    config:
      ssl-hostname-verification: false

where the endpoint is an accessible IP from my local machine.
lxd-bio-cred.yaml is:
credentials:
    bio:
        admin:
            auth-type: certificate
            server-cert: |
<copied from the credential section of 'lxd info'>
            client-cert: |
...
            client-key: |
...

client-cert and client-key was generated from openssl with:
openssl genrsa -out private.key 1024
openssl req -new -x509 -key private.key -out publickey.cer -days 365

I then told LXD to trust that certificate with:
lxc config trust add local: ./publickey.cer

My above patch takes the 2 code paths that are 'obviously wrong' and makes them not care about not having information that they don't actually use. (stateAddresses are only used for Controller provisioning bug #1866643, and MachineTag is only needed on a machine that can run local containers. Not for the 'Environ' provisioner that is pulling them as external machines.)

After those patches, it now fails with:
$ juju deploy cs:~jameinel/ubuntu-lite
$ juju status
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
bio-test micro bio/default 2.7.4 unsupported 21:37:52+04:00

App Version Status Scale Charm Store Rev OS Notes
ubuntu-lite waiting 0/1 ubuntu-lite jujucharms 7 ubuntu

Unit Workload Agent Machine Public address Ports Message
ubuntu-lite/0 waiting allocating 0 waiting for machine

Machine State DNS Inst id Series AZ Message
0 down pending bionic no matching agent binaries available

So it is preparing to start an instance, but needs to seed the cloud-init with what Agent binary it should download from the controller. Presumably K8s based images don't store the agent for it to provide (because they are assumed to be seeded in the operator images).

However, we already have all the ability we need to populate Mongo with updated agents (it is how 'juju upgrade-controller' works.)
The only thing I'm not sure on is if a K8s controller would intentionally also prevent you from uploading agents. I'll have to dig into it tomorrow. But I did, at least, get past the bad workers.

So I have a branch here:
https://github.com/jameinel/juju/tree/2.7-k8s-controlling-lxd-1866623

Using that, I can do:
make install
make JUJU_SKIP_DEP=true JUJU_BUILD_NUMBER=5 JUJUD_STAGING_DIR=$GOPATH/tmp/jujud-operator microk8s-operator-update

(you have to build Jujud at least once so that the second build can figure out what version number it is installing, see bug #1866658).

With that, I can then do:
juju bootstrap microk8s micro

juju add-cloud -c micro bio ./lxd-bio-cloud.yaml --force
juju add-credential -c micro bio -f lxd-bio-cred.yaml 
juju add-model bio-test bio

juju debug-log -m controller --replay --tail

lxd-bio-cloud.yaml is just:
clouds:
  bio:
    type: lxd
    auth-types: [certificate]
    endpoint: "https://192.168.185.105:8443"
    config:
      ssl-hostname-verification: false

where the endpoint is an accessible IP from my local machine.
lxd-bio-cred.yaml is:
credentials:
    bio:
        admin:
            auth-type: certificate
            server-cert: |
<copied from the credential section of 'lxd info'>
            client-cert: |
...
            client-key: |
...

client-cert and client-key was generated from openssl with:
openssl genrsa -out private.key 1024
openssl req -new -x509 -key private.key -out publickey.cer -days 365

I then told LXD to trust that certificate with:
lxc config trust add local: ./publickey.cer

My above patch takes the 2 code paths that are 'obviously wrong' and makes them not care about not having information that they don't actually use. (stateAddresses are only used for Controller provisioning bug #1866643, and MachineTag is only needed on a machine that can run local containers. Not for the 'Environ' provisioner that is pulling them as external machines.)

After those patches, it now fails with:
$ juju deploy cs:~jameinel/ubuntu-lite
$ juju status
$ juju status
Model     Controller  Cloud/Region  Version  SLA          Timestamp
bio-test  micro       bio/default   2.7.4    unsupported  21:37:52+04:00

App          Version  Status   Scale  Charm        Store       Rev  OS      Notes
ubuntu-lite           waiting    0/1  ubuntu-lite  jujucharms    7  ubuntu

Unit           Workload  Agent       Machine  Public address  Ports  Message
ubuntu-lite/0  waiting   allocating  0                               waiting for machine

Machine  State  DNS  Inst id  Series  AZ  Message
0        down        pending  bionic      no matching agent binaries available

So it is preparing to start an instance, but needs to seed the cloud-init with what Agent binary it should download from the controller. Presumably K8s based images don't store the agent for it to provide (because they are assumed to be seeded in the operator images).

However, we already have all the ability we need to populate Mongo with updated agents (it is how 'juju upgrade-controller' works.)
The only thing I'm not sure on is if a K8s controller would intentionally also prevent you from uploading agents. I'll have to dig into it tomorrow. But I did, at least, get past the bad workers.

Revision history for this message

Ian Booth (wallyworld) wrote on 2020-03-10:

#5

The controller database does not need to be seeded with the agent binaries ahead of time. So long as the agent version is in simplestreams metadata, the agent binaries will be downloaded - any agents stored in the controller simply short circuit the download, acting as a local cache.

The issue with "waiting for machine" is that the test is being done with tip of 2.7 which has juju version 2.7.4 for which there is no simplestreams agent metadata yet published.

Ian Booth (wallyworld) on 2020-03-24

Changed in juju:
status:	In Progress → Fix Committed

Harry Pidcock (hpidcock) on 2020-06-04

Changed in juju:
status:	Fix Committed → Fix Released

Canonical Juju

Not able to deploy on a remote LXD cluster model using Kubernetes Juju controller

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	John A Meinel	Canonical Juju 2.8-beta1
	2.7	Fix Released	Critical	John A Meinel	Canonical Juju 2.7.4