deploy never finishes

Bug #1849349 reported by Marian Gasparovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Invalid
Undecided
George Kraft

Bug Description

DEBUG collect-metrics No resources found
DEBUG collect-metrics No resources found in default namespace.

repeats for over two hours and build is aborted

charm rev 754
juju_2.6.10+2.6-9f8a13f

Tags: cdo-qa
Revision history for this message
Marian Gasparovic (marosg) wrote :
tags: added: cdo-qa
Revision history for this message
Mike Wilson (knobby) wrote :

Is this an intermittent failure?

Revision history for this message
Marian Gasparovic (marosg) wrote :

We saw it on two test runs in two days

Revision history for this message
Mike Wilson (knobby) wrote :

Was this two runs out of two or two runs out of 50? Is it openstack only? When this happens, if you kick the services for kubelet and kube-proxy does the node come up properly?

Also, This crashdump is helpful, but not as helpful as it could be. You need to be sure you are using the edge version of crash-dump in order to pick up the k8s-specific debug information from our nodes:

```
sudo snap install juju-crashdump --channel edge --classic
juju-crashdump -a debug-layer -a config
```

That will give us more information to help.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

The edge and stable versions of the crashdump snap are the same but we may not be using those options.

Revision history for this message
Joshua Genet (genet022) wrote :

This occurs roughly to 1/5 runs in our CI right now. So far openstack only. Two baremetal deploy of openstack and one on serverstack.

Just to add a little bit more information, on each kubernetes-master subordinate charms, I'm seeing:

juju-run[17250]: ERROR cannot write leadership settings: cannot write settings: not the leader

apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=5993 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0

Revision history for this message
John George (jog) wrote :

Solutions QA is hitting this daily. The test runs and their artifacts can be found here:
https://solutions.qa.canonical.com/#/qa/bug/1849349

The requested crashdump options appear to be set in the more recent test runs, if you could please take another look?

George Kraft (cynerva)
Changed in charm-kubernetes-master:
assignee: nobody → George Kraft (cynerva)
Revision history for this message
George Kraft (cynerva) wrote :

Looking at 5 recent runs, I saw 2 failure cases.

---- Case 1: Chicken and egg condition between kube-apiserver and Calico CNI?

Kubelet is logging failures to start pods:

CreatePodSandbox for pod "metrics-server-v0.3.4-589dbbf5f8-pkz7n_kube-system(27d37795-2455-403d-ac7d-cf2d74f64406)" failed: rpc error: code = Unknown desc = failed to setup network for sandbox "fde6812a692275407da36a68f89a4541c753bacab96730e3b5bdb830336c0146": Get https://10.5.0.31:443/api/v1/namespaces/kube-system: Service Unavailable

Meanwhile, kube-apiserver is logging a go panic + vague "service unavailable" message on the metrics API:

I1127 15:19:59.646222 30165 httplog.go:90] GET /apis/metrics.k8s.io/v1beta1?timeout=32s: (606.249µs) 503
goroutine 225999 [running]:
... stacktrace ...
logging error output: "service unavailable\n"
[kubectl/v1.16.2 (linux/amd64) kubernetes/c97fe50 127.0.0.1:42580]

At first glance it looks like a chicken and egg condition where kube-apiserver is responding with Service Unavailable because the metrics-server service is unavailable, but it's unavailable because the metrics-server pod is not getting created since Calico CNI is getting a Service Unavailable response from kube-apiserver.

I will look into this further.

Runs where I saw this: 2f54a8a1-992b-411c-8f65-3fe3a77f527a, 0e23bbdc-a323-434b-947b-73ae8261487d, b7e9ab4f-bf47-4211-b241-bd38269d28d1

---- Case 2: Calico charm is misconfigured with cidr=FCE_TEMPLATE

Various charms are encountering problems because the Calico charm's cidr config appears to be set to 'FCE_TEMPLATE', which is not a valid CIDR.

The calico charm logs this:

Failed to execute command: error with the following fields:
- CIDR = 'FCE_TEMPLATE' (Reason: failed to validate Field: CIDR because of Tag: net )
- IPpool.CIDR = 'FCE_TEMPLATE' (IPPool CIDR must be a valid subnet)

kube-proxy is failing to start:

F1126 06:26:14.739448 14578 server.go:439] failed validate: KubeProxyConfiguration.ClusterCIDR: Invalid value: "FCE_TEMPLATE": must be a valid CIDR block (e.g. 10.100.0.0/16 or FD02::0:0:0/96)

I'm guessing this is coming from the FCE tool, which our team doesn't use or maintain. We can't offer any further assistance here.

Runs where I saw this: 967f1841-dc12-4d6c-b4fc-bd783284f725, 34851d27-d013-40e0-9db7-f9e3f762db11

Revision history for this message
George Kraft (cynerva) wrote :

Disregard case 1 that was mentioned in my last comment. It has been reported as a separate issue and will be addressed there: https://bugs.launchpad.net/charm-kubernetes-master/+bug/1854520

None of the 5 runs I looked at showed the same symptoms from the original crashdump, where one of the kubernetes-worker units was stuck with a "Waiting for kubelet,kube-proxy to start." status. We'll need to follow up on that, but...

> The requested crashdump options appear to be set in the more recent test runs, if you could please take another look?

The crashdumps from recent runs still don't have the requested info. I see in foundation.log that you are indeed running the debug-layer and config addons, but they're not running properly. I'm able to reproduce the issue using juju-crashdump locally. I've opened an issue about it: https://github.com/juju/juju-crashdump/issues/50

Revision history for this message
George Kraft (cynerva) wrote :

I'm closing this issue. The failures seen here are being addressed in other issues:
https://bugs.launchpad.net/charm-kubernetes-worker/+bug/1850176
https://bugs.launchpad.net/charm-kubernetes-master/+bug/1854520

Changed in charm-kubernetes-master:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.