images are being pulled from k8s.gcr.io without going through the proxy

Bug #1841438 reported by Jason Hobbs
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Kubernetes Worker Charm
Fix Released
Undecided
Kevin W Monroe

Bug Description

A deployment in a firewalled environment is failing because the proxy isn't being used.

Aug 25 14:28:56 ralts kubelet.daemon[6913]: E0825 19:28:56.404395 6913 pod_workers.go:190] Error syncing pod b20dc1b5-b040-408f-9174-d4abad3f5b8e ("nginx-ingress-controller-kubernetes-worker-4vr6p_ingress-nginx-kubernetes-worker(b20dc1b5-b040-408f-9174-d4abad3f5b8e)"), skipping: failed to "CreatePodSandbox" for "nginx-ingress-controller-kubernetes-worker-4vr6p_ingress-nginx-kubernetes-worker(b20dc1b5-b040-408f-9174-d4abad3f5b8e)" with CreatePodSandboxError: "CreatePodSandbox for pod \"nginx-ingress-controller-kubernetes-worker-4vr6p_ingress-nginx-kubernetes-worker(b20dc1b5-b040-408f-9174-d4abad3f5b8e)\" failed: rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.1\": failed to pull image \"k8s.gcr.io/pause:3.1\": failed to resolve image \"k8s.gcr.io/pause:3.1\": no available registry endpoint: failed to do request: Head https://k8s.gcr.io/v2/pause/manifests/3.1: dial tcp 74.125.68.82:443: i/o timeout"

model_config shows juju_http_proxy and juju_https_proxy are set, so we should not be going to k8s.gcr.io directly.

http://paste.ubuntu.com/p/wKnwJzQrPf/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
description: updated
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Until the 1.16 release, containerd proxy settings have to be made directly on the charm - it does not respect the juju model level ones.

https://github.com/charmed-kubernetes/charm-containerd/blob/master/config.yaml#L26-L39

Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

I don't think the lack of proxy support is causing this issue. We should only ever try to fetch k8s.gcr.io/pause if there is no image-registry configured on the k8s-master. I think what's happening here is that kubelet is starting before that registry info is available on the kube-control relation.

Fix that by always reconfiguring kubelet if/when the image registry changes:

https://github.com/charmed-kubernetes/charm-kubernetes-worker/pull/29

There may still be a proxy issue here, but the failure should at least manifest itself as a problem pulling the image from image-registry.canonical.com (not k8s.gcr.io).

Changed in charm-kubernetes-worker:
assignee: nobody → Kevin W Monroe (kwmonroe)
status: New → In Progress
Changed in charm-kubernetes-worker:
milestone: none → 1.16
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

In the above PR, George pointed me at some good info about the use of the pause container in docker vs containerd. That PR addresses docker deployments, but the related fix for containerd (which is what jhobbs has) is actually bug 1841701.

Both bugs will be fixed to make sure we cover both container runtimes.

Changed in charm-kubernetes-worker:
status: In Progress → Fix Committed
Changed in charm-kubernetes-worker:
status: Fix Committed → Fix Released
Revision history for this message
Joshua Genet (genet022) wrote :

I'm not convinced this is fixed all the way. The deployment is able to stand up just fine, but fails when running the k8s team's integration (validation.py) pytests. Using k8s 1.16.2, my pod is failing to pull an image during test_audit_webhook when the proxy is set only in juju model config. But the test passes just fine when we set the proxy in containerd config.

Here are 2 supporting runs.

Setting proxy in juju model config:
https://solutions.qa.canonical.com/#/qa/testRun/b52ec348-8296-4910-b1d3-0d8711f7f3b1

Setting proxy in containerd config:
https://solutions.qa.canonical.com/#/qa/testRun/7f09e785-ae54-4579-9767-e7b7b4658302

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Setting back to New based on genet022's comment.

Changed in charm-kubernetes-worker:
status: Fix Released → New
Revision history for this message
Joshua Genet (genet022) wrote :

Update: Simply setting the containerd config no_proxy to anything allows my pods to pull images.

Here's a successful run where I set 'no_proxy: dogs'
https://solutions.qa.canonical.com/#/qa/testRun/e1ecb386-8858-43b5-a51a-4faa295d4455

Revision history for this message
Canonical Solutions QA Bot (oil-ci-bot) wrote :

This bug is fixed with commit fb3356e0 to cpe-foundation on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cpe-foundation/commit/?id=fb3356e0

Revision history for this message
Joshua Genet (genet022) wrote :

^ Don't listen to our bot. Not actually fixed. Thats just our workaround for now.

Revision history for this message
George Kraft (cynerva) wrote :

I can reproduce this. It occurs whenever there are large CIDRs in juju-no-proxy.

In failing runs, the model is configured with:

juju-no-proxy=10.0.0.0/8,192.168.0.0/16,172.16.0.0/12

Most applications (including containerd) do not support CIDRs in the NO_PROXY environment variable. To work around this, the charm expands the CIDRs into individual IPs: 10.0.0.0,10.0.0.1,10.0.0.2,10.0.0.3,... -- in this case, the charm expands them into a single NO_PROXY line that's roughly 235MB long. Yikes.

This ridiculously long NO_PROXY line gets rendered to /etc/systemd/system/containerd.service.d/proxy.conf, which systemd seems to ignore, as it starts containerd with none of the proxy environment variables set. Presumably this happens because the file is too large, the line is too long, or the resulting environment variable would be too large. There don't seem to be any errors logged about it though.

Interestingly, if I use a smaller prefix like 10.0.0.0/16, then I get a ~778KB proxy.conf, and the containerd service fails to start with "Failed to execute command: Argument list too long". This seems to come from Linux's execve call. So even in a case where we can get systemd to pass the environment variable through, the variable is simply too large to work in Linux.

Revision history for this message
Joshua Genet (genet022) wrote :

Wow good stuff, thanks for reproducing that! We'll work on paring down our juju-no-proxy config then. We don't see a reason to be using the large CIDRs. I moved this back to fix-released.

Changed in charm-kubernetes-worker:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.