Not able to deploy on a remote LXD cluster model using Kubernetes Juju controller

Bug #1866623 reported by Dominik Fleischmann
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
John A Meinel
2.7
Fix Released
Critical
John A Meinel

Bug Description

I'm trying to use a microk8s controller to manage at the same time a remote LXD cluster.

I executed the following commands:
sudo snap install juju --classic
sudo snap install microk8s --classic
sudo snap install lxd --classic
lxd init --auto
lxc network set lxdbr0 ipv6.address none
juju bootstap microk8s
juju add-cloud --controller microk8s-localhost lxd-remote -f clouds.yaml --force
lxc config trust add lxd-client.crt
juju add-credential lxd-remote --controller microk8s -f lxd-remote-credentials.yaml
juju add-model test lxd-remote
juju deploy ubuntu

With lxd-client.crt being my own generated lxc certifcate and is also stated in lxd-remote-credentials.yaml (see attachment). the server-crt is the one located in /var/lib/lxd/cluster.crt

When following steps a model is created and juju tries to deploy the charm but never allocated the machine and returns the following errors in the controller model:

controller-0: 11:57:08 ERROR juju.worker.dependency "instance-mutater" manifold worker returned unexpected error: cannot start machine instancemutater worker: Tag not valid
controller-0: 11:57:19 ERROR juju.worker.dependency "firewaller" manifold worker returned unexpected error: machine 0 not provisioned
controller-0: 11:57:30 ERROR juju.worker.dependency "compute-provisioner" manifold worker returned unexpected error: no controller machines found

clouds.yaml:
clouds:
  lxd-remote:
    type: lxd
    auth-types: [certificate]
    endpoint: https://172.31.84.136:8443

Revision history for this message
Dominik Fleischmann (dominik.f) wrote :
Revision history for this message
Dominik Fleischmann (dominik.f) wrote :

Adding juju debug-log also.

John A Meinel (jameinel)
Changed in juju:
status: New → In Progress
importance: Undecided → High
importance: High → Critical
assignee: nobody → John A Meinel (jameinel)
milestone: none → 2.8-beta1
Revision history for this message
John A Meinel (jameinel) wrote :

So the basic errors that you see are:
controller-0: 17:21:25 ERROR juju.worker.dependency "instance-mutater" manifold worker returned unexpected error: cannot start machine instancemutater worker: Tag not valid
controller-0: 17:21:38 ERROR juju.worker.dependency "compute-provisioner" manifold worker returned unexpected error: no controller machines found

Digging into Tag not valid, that one is because Instance Mutater wants a Machine Agent.
worker/instancemutater/worker.go:

func (config Config) Validate() error {

...
 if config.Tag == nil {
     return errors.NotValidf("nil Tag")
 }
 if _, ok := config.Tag.(names.MachineTag); !ok {
  return errors.NotValidf("Tag")
 }
...
}

However, Controllers in K8s models have type: names.ControllerAgentTag

That said, what are we running here anyway? Why would InstanceMutater need a Machine, given it is probing a *remote* LXD service. It might be for this:

// NewContainerWorker returns a worker that keeps track of
// the containers in the state for this machine agent and
// polls their instance for addition or removal changes.
func NewContainerWorker(config Config) (worker.Worker, error) {
 m, err := config.Facade.Machine(config.Tag.(names.MachineTag))

However, this is an entire cloud, not a single machine that we are provisioning. So while we *do* want to be monitoring in case we need to set profiles on the instances that we spawn, we *definitely* shouldn't be doing so within the context of the Controller's machine.

(Even without K8s in play, it doesn't make sense for my local MAAS-based controller to be inspecting its own machine's containers while it is provisioning machines in a remote LXD cluster.)

I'm guessing we just shouldn't be running the InstanceMutater the way that we are, but I can't quite figure out whether it is just that I can allow ControllerAgent tags, and just never start a NewContainerWorker, since we aren't provisioning containers inside the controller machine.

The 'compute-provisioner' failing with 'no controller machines found' is more of a problem. Looking for that error string leads me to:
func (st *State) controllerAddresses() ([]string, error) {
...
        err = machines.Find(bson.D{{"jobs", JobManageModel}}).All(&allAddresses)
        if err != nil {
                return nil, err
        }
        if len(allAddresses) == 0 {
                return nil, errors.New("no controller machines found")
        }

That seems to be assuming that the API addresses for the controller will be found by searching for machines in the Controller model (st.ControllerInfo().ModelTag) that have JobManageModel, and then finding ScopeMatchCloudLocal. However,

a) That completely ignores Spaces modeling and the "juju-mgmt-space".
b) There are no Machines to find in K8s models.

I have the feeling that func controllerAddresses just needs to die and be replaced by GetAPIHostPorts.

Revision history for this message
John A Meinel (jameinel) wrote :

So I have a branch here:
https://github.com/jameinel/juju/tree/2.7-k8s-controlling-lxd-1866623

Using that, I can do:
make install
make JUJU_SKIP_DEP=true JUJU_BUILD_NUMBER=5 JUJUD_STAGING_DIR=$GOPATH/tmp/jujud-operator microk8s-operator-update

(you have to build Jujud at least once so that the second build can figure out what version number it is installing, see bug #1866658).

With that, I can then do:
juju bootstrap microk8s micro

juju add-cloud -c micro bio ./lxd-bio-cloud.yaml --force
juju add-credential -c micro bio -f lxd-bio-cred.yaml
juju add-model bio-test bio

juju debug-log -m controller --replay --tail

lxd-bio-cloud.yaml is just:
clouds:
  bio:
    type: lxd
    auth-types: [certificate]
    endpoint: "https://192.168.185.105:8443"
    config:
      ssl-hostname-verification: false

where the endpoint is an accessible IP from my local machine.
lxd-bio-cred.yaml is:
credentials:
    bio:
        admin:
            auth-type: certificate
            server-cert: |
<copied from the credential section of 'lxd info'>
            client-cert: |
...
            client-key: |
...

client-cert and client-key was generated from openssl with:
openssl genrsa -out private.key 1024
openssl req -new -x509 -key private.key -out publickey.cer -days 365

I then told LXD to trust that certificate with:
lxc config trust add local: ./publickey.cer

My above patch takes the 2 code paths that are 'obviously wrong' and makes them not care about not having information that they don't actually use. (stateAddresses are only used for Controller provisioning bug #1866643, and MachineTag is only needed on a machine that can run local containers. Not for the 'Environ' provisioner that is pulling them as external machines.)

After those patches, it now fails with:
$ juju deploy cs:~jameinel/ubuntu-lite
$ juju status
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
bio-test micro bio/default 2.7.4 unsupported 21:37:52+04:00

App Version Status Scale Charm Store Rev OS Notes
ubuntu-lite waiting 0/1 ubuntu-lite jujucharms 7 ubuntu

Unit Workload Agent Machine Public address Ports Message
ubuntu-lite/0 waiting allocating 0 waiting for machine

Machine State DNS Inst id Series AZ Message
0 down pending bionic no matching agent binaries available

So it is preparing to start an instance, but needs to seed the cloud-init with what Agent binary it should download from the controller. Presumably K8s based images don't store the agent for it to provide (because they are assumed to be seeded in the operator images).

However, we already have all the ability we need to populate Mongo with updated agents (it is how 'juju upgrade-controller' works.)
The only thing I'm not sure on is if a K8s controller would intentionally also prevent you from uploading agents. I'll have to dig into it tomorrow. But I did, at least, get past the bad workers.

Revision history for this message
Ian Booth (wallyworld) wrote :

The controller database does not need to be seeded with the agent binaries ahead of time. So long as the agent version is in simplestreams metadata, the agent binaries will be downloaded - any agents stored in the controller simply short circuit the download, acting as a local cache.

The issue with "waiting for machine" is that the test is being done with tip of 2.7 which has juju version 2.7.4 for which there is no simplestreams agent metadata yet published.

Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Harry Pidcock (hpidcock)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.