No nodes available in cluster

Bug #1815034 reported by Tyler Treat
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Google Cloud Platform Integrator Charm
Invalid
High
Cory Johns

Bug Description

I'm attempting to run a k8s cluster on GCE, but it appears the worker node is not starting appropriately. Here are the commands I've run after setting up GCP credentials:

$ juju bootstrap google
$ juju deploy kubernetes-core
$ juju deploy cs:~containers/gcp-integrator
$ juju trust gcp-integrator
$ juju relate gcp-integrator kubernetes-master
$ juju relate gcp-integrator kubernetes-worker

These commands appear to all run successfully. After waiting for the cluster to quiesce, juju status shows the following output: https://paste.ubuntu.com/p/RgPYwVZ3Yb/

As you can see, both k8s worker and master have status "waiting." Also, after SCPing the kube config to my machine, the output of `kubectl get nodes` shows: "No resources found."

When I attempt to deploy mariadb to the k8s cluster, the status shows "blocked" with the message "no nodes available to schedule pods" since there appear to be no nodes running.

I have attached the output of cdk-field-agent. I'm not sure how to appropriately debug or resolve this issue.

Tags: real-kinetic
Revision history for this message
Tyler Treat (tylertreat) wrote :
Revision history for this message
Cory Johns (johnsca) wrote :

It looks like the cdk-field-agent was run against the Kubernetes model. Can you please run `juju switch default` and then run cdk-field-agent against that model?

Revision history for this message
Tyler Treat (tylertreat) wrote :

Apologies, see the attached tar for results after switching to default model.

Revision history for this message
Tyler Treat (tylertreat) wrote :

I just saw that the output appears to be the same. I should just have to run `juju switch default` and then run the cdk-field-agent correct? It seems it's still pulling from myk8smodel however. These are my models:

$ juju models
Controller: google-us-east1

Model Cloud/Region Status Machines Cores Access Last connection
controller google/us-east1 available 1 4 admin just now
default* google/us-east1 available 4 17 admin 6 minutes ago
myk8smodel myk8scloud available 0 - admin 16 minutes ago

Revision history for this message
Tyler Treat (tylertreat) wrote :

I must have screwed up getting the collector results. I verified THIS tar's status.out is for the default model. :)

tags: added: real-kinetic
Revision history for this message
Cory Johns (johnsca) wrote :

I got as far as determining that the start_worker handler on kubernetes-worker never ran because it was missing the cni.available flag. I'm not super familiar with the CNI implementation and I have to call it an evening. I'll pick back up and possibly bring in help tracking down what is going wrong.

Changed in charm-gcp-integrator:
status: New → In Progress
assignee: nobody → Cory Johns (johnsca)
importance: Undecided → High
Revision history for this message
Cory Johns (johnsca) wrote :

So the issue is that flannel is unable to talk to etcd over the fan network. We were unable to reproduce this, so it seems likely that it is an edge case involving something in the configuration of the Google account. The fan network is not required, so you can disable it with:

juju model-config fan-config= container-networking-method=local

This needs to be done before deploying the cluster into the model.

At this point, I don't know enough about how Juju utilizes and configures the fan network to figure out what aspect of the cloud configuration would be leading to this issue, but it does seem like something we should get assistance with to track down.

During testing, I also ran into an issue with our workaround for https://github.com/kubernetes/kubernetes/issues/44254 in which it seems something changed upstream such that our workaround was not getting applied. I've fixed that in the charms, which should be available in the edge charms shortly, which can be deployed with:

juju deploy cs:~containers/kubernetes-core --channel=edge

Specifically, it's rev 591 or later of kubernetes-master that you need, which is in rev 565 or later of the kubernetes-core bundle.

Revision history for this message
Tyler Treat (tylertreat) wrote :

Cory, to clarify, are these the commands I should be running?

juju model-config fan-config= container-networking-method=local
juju deploy cs:~containers/kubernetes-core --channel=edge

I'm still hitting issues with the master/worker starting up running this on a newly bootstrapped controller: https://paste.ubuntu.com/p/2PxrTHHGXd/

Maybe we could setup a pairing session to work through issues?

Revision history for this message
Tyler Treat (tylertreat) wrote :

Attaching the field agent output in case it's helpful.

Revision history for this message
Cory Johns (johnsca) wrote :

Upon further debugging with Tyler, it turns out that the fan involvement was in fact a red herring, and the issue ended up being that the GCP firewall rules were blocking traffic between instances on the internal network. We were able to confirm this by manually running `open-port` on the etcd instance followed by `juju expose etcd` which allowed that bit of traffic to successfully go through and that flannel service start up. We thought that maybe we could get them unblocked by doing the same for all other required ports, but Konstantinos pointed out that to really have a healthy cluster, some dynamically allocated ports would need to be included, so that workaround isn't really tenable.

We looked at the Google Cloud Console in the configuration of the firewall, VPC, and subnet, and nothing jumped out as immediately obvious that it was the culprit. Rick is going to do more research into GCP VPC / firewall / subnet configuration to determine what settings might possibly result in this behavior so that we can take a more informed look at the config to determine what's going on.

Revision history for this message
Tyler Treat (tylertreat) wrote :

Follow up on this: we discovered that there was a default firewall rule that at some point got removed from the VPC. `default-allow-internal` is needed to allow VMs to talk to each other in GCE. Once this rule was added back in, the k8s cluster spun up properly. Here is the relevant bug: https://bugs.launchpad.net/juju/+bug/1816108

This bug can now be closed.

Changed in charm-gcp-integrator:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.