Kubernetes Worker Charm

images are being pulled from k8s.gcr.io without going through the proxy

Bug #1841438 reported by Jason Hobbs on 2019-08-26

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Kubernetes Worker Charm	Fix Released	Undecided	Kevin W Monroe	Kubernetes Worker Charm 1.16

Bug Description

A deployment in a firewalled environment is failing because the proxy isn't being used.

Aug 25 14:28:56 ralts kubelet.daemon[6913]: E0825 19:28:56.404395 6913 pod_workers.go:190] Error syncing pod b20dc1b5-b040-408f-9174-d4abad3f5b8e ("nginx-ingress-controller-kubernetes-worker-4vr6p_ingress-nginx-kubernetes-worker(b20dc1b5-b040-408f-9174-d4abad3f5b8e)"), skipping: failed to "CreatePodSandbox" for "nginx-ingress-controller-kubernetes-worker-4vr6p_ingress-nginx-kubernetes-worker(b20dc1b5-b040-408f-9174-d4abad3f5b8e)" with CreatePodSandboxError: "CreatePodSandbox for pod \"nginx-ingress-controller-kubernetes-worker-4vr6p_ingress-nginx-kubernetes-worker(b20dc1b5-b040-408f-9174-d4abad3f5b8e)\" failed: rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.1\": failed to pull image \"k8s.gcr.io/pause:3.1\": failed to resolve image \"k8s.gcr.io/pause:3.1\": no available registry endpoint: failed to do request: Head https://k8s.gcr.io/v2/pause/manifests/3.1: dial tcp 74.125.68.82:443: i/o timeout"

model_config shows juju_http_proxy and juju_https_proxy are set, so we should not be going to k8s.gcr.io directly.

http://paste.ubuntu.com/p/wKnwJzQrPf/

See original description

Tags:

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-08-26:

juju-crashdump-kubernetes-2019-08-25-19.28.37.tar.gz Edit (25.1 MiB, application/x-tar)

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-08-26:

kubernetes_bundle.yaml Edit (6.3 KiB, text/plain)

description:

updated

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-08-26:

Until the 1.16 release, containerd proxy settings have to be made directly on the charm - it does not respect the juju model level ones.

https://github.com/charmed-kubernetes/charm-containerd/blob/master/config.yaml#L26-L39

Revision history for this message

Kevin W Monroe (kwmonroe) wrote on 2019-08-26:

I don't think the lack of proxy support is causing this issue. We should only ever try to fetch k8s.gcr.io/pause if there is no image-registry configured on the k8s-master. I think what's happening here is that kubelet is starting before that registry info is available on the kube-control relation.

Fix that by always reconfiguring kubelet if/when the image registry changes:

https://github.com/charmed-kubernetes/charm-kubernetes-worker/pull/29

There may still be a proxy issue here, but the failure should at least manifest itself as a problem pulling the image from image-registry.canonical.com (not k8s.gcr.io).

Changed in charm-kubernetes-worker:
assignee:	nobody → Kevin W Monroe (kwmonroe)
status:	New → In Progress

Tim Van Steenburgh (tvansteenburgh) on 2019-08-28

Changed in charm-kubernetes-worker:
milestone:	none → 1.16

Revision history for this message

Kevin W Monroe (kwmonroe) wrote on 2019-09-06:

In the above PR, George pointed me at some good info about the use of the pause container in docker vs containerd. That PR addresses docker deployments, but the related fix for containerd (which is what jhobbs has) is actually bug 1841701.

Both bugs will be fixed to make sure we cover both container runtimes.

Changed in charm-kubernetes-worker:
status:	In Progress → Fix Committed

Tim Van Steenburgh (tvansteenburgh) on 2019-09-27

Changed in charm-kubernetes-worker:
status:	Fix Committed → Fix Released

Revision history for this message

Joshua Genet (genet022) wrote on 2019-10-22:

I'm not convinced this is fixed all the way. The deployment is able to stand up just fine, but fails when running the k8s team's integration (validation.py) pytests. Using k8s 1.16.2, my pod is failing to pull an image during test_audit_webhook when the proxy is set only in juju model config. But the test passes just fine when we set the proxy in containerd config.

Here are 2 supporting runs.

Setting proxy in juju model config:
https://solutions.qa.canonical.com/#/qa/testRun/b52ec348-8296-4910-b1d3-0d8711f7f3b1

Setting proxy in containerd config:
https://solutions.qa.canonical.com/#/qa/testRun/7f09e785-ae54-4579-9767-e7b7b4658302

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-10-22:

Setting back to New based on genet022's comment.

Changed in charm-kubernetes-worker:
status:	Fix Released → New

Revision history for this message

Joshua Genet (genet022) wrote on 2019-10-23:

Update: Simply setting the containerd config no_proxy to anything allows my pods to pull images.

Here's a successful run where I set 'no_proxy: dogs'
https://solutions.qa.canonical.com/#/qa/testRun/e1ecb386-8858-43b5-a51a-4faa295d4455

Revision history for this message

Canonical Solutions QA Bot (oil-ci-bot) wrote on 2019-10-23:

This bug is fixed with commit fb3356e0 to cpe-foundation on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cpe-foundation/commit/?id=fb3356e0

Revision history for this message

Joshua Genet (genet022) wrote on 2019-10-23:

#10

^ Don't listen to our bot. Not actually fixed. Thats just our workaround for now.

Revision history for this message

George Kraft (cynerva) wrote on 2019-10-25:

#11

I can reproduce this. It occurs whenever there are large CIDRs in juju-no-proxy.

In failing runs, the model is configured with:

juju-no-proxy=10.0.0.0/8,192.168.0.0/16,172.16.0.0/12

Most applications (including containerd) do not support CIDRs in the NO_PROXY environment variable. To work around this, the charm expands the CIDRs into individual IPs: 10.0.0.0,10.0.0.1,10.0.0.2,10.0.0.3,... -- in this case, the charm expands them into a single NO_PROXY line that's roughly 235MB long. Yikes.

This ridiculously long NO_PROXY line gets rendered to /etc/systemd/system/containerd.service.d/proxy.conf, which systemd seems to ignore, as it starts containerd with none of the proxy environment variables set. Presumably this happens because the file is too large, the line is too long, or the resulting environment variable would be too large. There don't seem to be any errors logged about it though.

Interestingly, if I use a smaller prefix like 10.0.0.0/16, then I get a ~778KB proxy.conf, and the containerd service fails to start with "Failed to execute command: Argument list too long". This seems to come from Linux's execve call. So even in a case where we can get systemd to pass the environment variable through, the variable is simply too large to work in Linux.

Revision history for this message

Joshua Genet (genet022) wrote on 2019-10-28:

#12

Wow good stuff, thanks for reproducing that! We'll work on paring down our juju-no-proxy config then. We don't see a reason to be using the large CIDRs. I moved this back to fix-released.

Changed in charm-kubernetes-worker:
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.