Google Cloud Platform Integrator Charm

No nodes available in cluster

Bug #1815034 reported by Tyler Treat on 2019-02-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Google Cloud Platform Integrator Charm	Invalid	High	Cory Johns

Bug Description

I'm attempting to run a k8s cluster on GCE, but it appears the worker node is not starting appropriately. Here are the commands I've run after setting up GCP credentials:

$ juju bootstrap google
$ juju deploy kubernetes-core
$ juju deploy cs:~containers/gcp-integrator
$ juju trust gcp-integrator
$ juju relate gcp-integrator kubernetes-master
$ juju relate gcp-integrator kubernetes-worker

These commands appear to all run successfully. After waiting for the cluster to quiesce, juju status shows the following output: https://paste.ubuntu.com/p/RgPYwVZ3Yb/

As you can see, both k8s worker and master have status "waiting." Also, after SCPing the kube config to my machine, the output of `kubectl get nodes` shows: "No resources found."

When I attempt to deploy mariadb to the k8s cluster, the status shows "blocked" with the message "no nodes available to schedule pods" since there appear to be no nodes running.

I have attached the output of cdk-field-agent. I'm not sure how to appropriately debug or resolve this issue.

Tags:

Revision history for this message

Tyler Treat (tylertreat) wrote on 2019-02-07:

cdk-field-agent output Edit (9.3 KiB, application/x-tar)

Revision history for this message

Cory Johns (johnsca) wrote on 2019-02-07:

It looks like the cdk-field-agent was run against the Kubernetes model. Can you please run `juju switch default` and then run cdk-field-agent against that model?

Revision history for this message

Tyler Treat (tylertreat) wrote on 2019-02-07:

Field agent results for default model Edit (9.3 KiB, application/x-tar)

Apologies, see the attached tar for results after switching to default model.

Revision history for this message

Tyler Treat (tylertreat) wrote on 2019-02-07:

I just saw that the output appears to be the same. I should just have to run `juju switch default` and then run the cdk-field-agent correct? It seems it's still pulling from myk8smodel however. These are my models:

$ juju models
Controller: google-us-east1

Model Cloud/Region Status Machines Cores Access Last connection
controller google/us-east1 available 1 4 admin just now
default* google/us-east1 available 4 17 admin 6 minutes ago
myk8smodel myk8scloud available 0 - admin 16 minutes ago

Revision history for this message

Tyler Treat (tylertreat) wrote on 2019-02-07:

Field agent results for default model Edit (13.0 MiB, application/x-tar)

I must have screwed up getting the collector results. I verified THIS tar's status.out is for the default model. :)

Tim Van Steenburgh (tvansteenburgh) on 2019-02-07

tags:

added: real-kinetic

Revision history for this message

Cory Johns (johnsca) wrote on 2019-02-08:

I got as far as determining that the start_worker handler on kubernetes-worker never ran because it was missing the cni.available flag. I'm not super familiar with the CNI implementation and I have to call it an evening. I'll pick back up and possibly bring in help tracking down what is going wrong.

Tim Van Steenburgh (tvansteenburgh) on 2019-02-08

Changed in charm-gcp-integrator:
status:	New → In Progress
assignee:	nobody → Cory Johns (johnsca)
importance:	Undecided → High

Revision history for this message

Cory Johns (johnsca) wrote on 2019-02-08:

So the issue is that flannel is unable to talk to etcd over the fan network. We were unable to reproduce this, so it seems likely that it is an edge case involving something in the configuration of the Google account. The fan network is not required, so you can disable it with:

juju model-config fan-config= container-networking-method=local

This needs to be done before deploying the cluster into the model.

At this point, I don't know enough about how Juju utilizes and configures the fan network to figure out what aspect of the cloud configuration would be leading to this issue, but it does seem like something we should get assistance with to track down.

During testing, I also ran into an issue with our workaround for https://github.com/kubernetes/kubernetes/issues/44254 in which it seems something changed upstream such that our workaround was not getting applied. I've fixed that in the charms, which should be available in the edge charms shortly, which can be deployed with:

juju deploy cs:~containers/kubernetes-core --channel=edge

Specifically, it's rev 591 or later of kubernetes-master that you need, which is in rev 565 or later of the kubernetes-core bundle.

Revision history for this message

Tyler Treat (tylertreat) wrote on 2019-02-13:

Cory, to clarify, are these the commands I should be running?

juju model-config fan-config= container-networking-method=local
juju deploy cs:~containers/kubernetes-core --channel=edge

I'm still hitting issues with the master/worker starting up running this on a newly bootstrapped controller: https://paste.ubuntu.com/p/2PxrTHHGXd/

Maybe we could setup a pairing session to work through issues?

Revision history for this message

Tyler Treat (tylertreat) wrote on 2019-02-13:

Field agent output Edit (2.2 MiB, application/x-tar)

Attaching the field agent output in case it's helpful.

Revision history for this message

Cory Johns (johnsca) wrote on 2019-02-15:

#10

Upon further debugging with Tyler, it turns out that the fan involvement was in fact a red herring, and the issue ended up being that the GCP firewall rules were blocking traffic between instances on the internal network. We were able to confirm this by manually running `open-port` on the etcd instance followed by `juju expose etcd` which allowed that bit of traffic to successfully go through and that flannel service start up. We thought that maybe we could get them unblocked by doing the same for all other required ports, but Konstantinos pointed out that to really have a healthy cluster, some dynamically allocated ports would need to be included, so that workaround isn't really tenable.

We looked at the Google Cloud Console in the configuration of the firewall, VPC, and subnet, and nothing jumped out as immediately obvious that it was the culprit. Rick is going to do more research into GCP VPC / firewall / subnet configuration to determine what settings might possibly result in this behavior so that we can take a more informed look at the config to determine what's going on.

Revision history for this message

Tyler Treat (tylertreat) wrote on 2019-02-15:

#11

Follow up on this: we discovered that there was a default firewall rule that at some point got removed from the VPC. `default-allow-internal` is needed to allow VMs to talk to each other in GCE. Once this rule was added back in, the k8s cluster spun up properly. Here is the relevant bug: https://bugs.launchpad.net/juju/+bug/1816108

This bug can now be closed.

Tim Van Steenburgh (tvansteenburgh) on 2019-02-15

Changed in charm-gcp-integrator:
status:	In Progress → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

auto-github-kubernetes-kubernetes #44254
[closed lifecycle/frozen sig/network triage/unresolved] Edit

Bug watches keep track of this bug in other bug trackers.