Cross-node traffic does not work with Cilium CNI

Bug #2016905 reported by Vladimir Grevtsev
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cilium Charm
Fix Released
Medium
Mateo Florido

Bug Description

TL;DR

When Charmed K8s is deployed with Cilium, there are some problems with the traffic flow between the different node. In the snippet below, there are three pods, two of them residing on the same node; so communication is possible between those two - but if one'd try to reach out to another (3rd one, residing on ANOTHER node), then there's no traffic flow:

######## Get all pods

ubuntu@ip-172-31-44-89:~$ kubectl get pod -o wide
E0418 15:34:49.240814 176349 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0418 15:34:49.244867 176349 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0418 15:34:49.249119 176349 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0418 15:34:49.252581 176349 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-748c667d99-km8sb 1/1 Running 0 55s 10.1.1.135 ip-172-31-47-232.eu-central-1.compute.internal <none> <none>
nginx-748c667d99-tdxfc 1/1 Running 0 55s 10.1.2.19 ip-172-31-24-46.eu-central-1.compute.internal <none> <none>
nginx-748c667d99-zcnd8 1/1 Running 0 55s 10.1.2.115 ip-172-31-24-46.eu-central-1.compute.internal <none> <none>
ubuntu@ip-172-31-44-89:~$ kubectl exec -it nginx-748c667d99-tdxfc -- bash
E0418 15:34:52.247098 177214 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0418 15:34:52.256970 177214 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0418 15:34:52.262418 177214 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

######## Try to curl self

root@nginx-748c667d99-tdxfc:/# curl 10.1.2.19
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

######## curl-ing ANOTHER pod on the SAME node

root@nginx-748c667d99-tdxfc:/# curl 10.1.2.115
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>
root@nginx-748c667d99-tdxfc:/#

####### curl-ing ANOTHER pod on ANOTHER node

root@nginx-748c667d99-tdxfc:/# curl 10.1.1.135
^C

########## additional info

juju status: https://paste.ubuntu.com/p/NJ9KdVv5DN/
bundle: http://paste.ubuntu.com/p/ddV2N8fHst/
env: charmed k8s on top of AWS EC2 hosts; the same behaviour was also observed once bundle was deployed on top of OpenStack.

ubuntu@ip-172-31-47-232:~$ cilium status
    /¯¯\
 /¯¯\__/¯¯\ Cilium: OK
 \__/¯¯\__/ Operator: OK
 /¯¯\__/¯¯\ Hubble Relay: OK
 \__/¯¯\__/ ClusterMesh: disabled
    \__/

Deployment hubble-relay Desired: 1, Ready: 1/1, Available: 1/1
Deployment cilium-operator Desired: 2, Ready: 2/2, Available: 2/2
DaemonSet cilium Desired: 3, Ready: 3/3, Available: 3/3
Containers: hubble-relay Running: 1
                  cilium-operator Running: 2
                  cilium Running: 3
Cluster Pods: 10/10 managed by Cilium
Image versions cilium rocks.canonical.com:443/cdk/cilium/cilium:v1.12.5@sha256:06ce2b0a0a472e73334a7504ee5c5d8b2e2d7b72ef728ad94e564740dd505be5: 3
                  hubble-relay rocks.canonical.com:443/cdk/cilium/hubble-relay:v1.12.5@sha256:22039a7a6cb1322badd6b0e5149ba7b11d35a54cf3ac93ce651bebe5a71ac91a: 1
                  cilium-operator rocks.canonical.com:443/cdk/cilium/operator-generic:v1.12.5@sha256:b296eb7f0f7656a5cc19724f40a8a7121b7fd725278b7d61dc91fe0b7ffd7c0e: 2

Tags: cdo-qa
Changed in charm-cilium:
assignee: nobody → Mateo Florido (mateoflorido)
Revision history for this message
Mateo Florido (mateoflorido) wrote (last edit ):

This issue is related to the bootstrap process of a new model with Juju on both AWS and O7k, although vSphere is currently unaffected. By default, Juju adds a new model with fan networking on AWS and O7k, the interfaces created for this purpose conflict with the Cilium VXLAN interface, resulting in the interface failing to start. To resolve this issue, modify the container-networking-method and fan-config before deploying CK+Cilium.

Here are the steps to overcome this issue. Please note that these should be performed before creating the cluster.

1. Add a new model.
2. Set the model defaults for container-networking-method to local and fan-config to "". Example: `juju model-config container-networking-method=local fan-config=`
3. Deploy CK+Cilium.

Furthermore, I noticed that the bundle uses CK 1.26. Please switch to the 1.27/edge channel, as this is the K8s version we used to test Cilium. It's also important to note that AWS has switched to the out-of-tree provider, so please refer to these two overlays to deploy it:
[AWS Overlay]
https://github.com/charmed-kubernetes/bundle/blob/main/overlays/aws-overlay.yaml
[AWS Storage]
https://github.com/charmed-kubernetes/bundle/blob/main/overlays/aws-storage-overlay.yaml

Changed in charm-cilium:
importance: Undecided → Medium
status: New → Triaged
tags: added: cdo-qa
Changed in charm-cilium:
milestone: none → 1.28
Adam Dyess (addyess)
Changed in charm-cilium:
milestone: 1.28 → 1.28+ck1
Adam Dyess (addyess)
Changed in charm-cilium:
milestone: 1.28+ck1 → 1.29
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

Model config explanation from comment #1 is committed to docs:

https://ubuntu.com/kubernetes/docs/cni-cilium

Changed in charm-cilium:
status: Triaged → Fix Committed
Changed in charm-cilium:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.