Load balancer created on K8s on top of Openstack Octavia are not working

Bug #1884995 reported by Giuseppe Petralia
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
CDK Addons
Fix Released
High
Kevin W Monroe

Bug Description

We have a k8s cluster created on top of OpenStack. K8s release 1.13/edge

charm-k8s-master rev. 724
charm-openstack-integrator rev. 22

When creating a new loadbalancer, this is not able to reach its backend pool members.

The loadbalancer created underneath is an amphora VM and the pools members are
kubernetes-workers on a given port

If we try to curl the loadbalancer on its VIP we get a failure.

grpcurl -connect-timeout 10 -plaintext <LB_IP>:80 describe
Failed to dial target host "REDACTED:80": context deadline exceeded

If we add to the openstack default security group of the kubernetes-workers VMs in openstack
a rule to allow the traffic from the security group of the amphora VM on the specific port with:

openstack security group rule create --ingress --protocol tcp --remote-group <GROUP_OF_THE_AMPHORA_VM> --dst-port REDACTED:REDACTED --ethertype ipv4 <DEFAULT_GROUP_OF_THE_JUJU_WORKERS>

then we are able to use the loadbalancer

description: updated
Revision history for this message
George Kraft (cynerva) wrote :

field-critical is subscribed to this.

In the future, please comment on the issue when subscribing field SLA to issues, as defined in the field SLA process for escalating to product engineering. It's easy for us to miss it otherwise.

https://wiki.canonical.com/engineering/FieldSLA

Revision history for this message
George Kraft (cynerva) wrote :

What version of OpenStack are you using?

Revision history for this message
Chris Sanders (chris.sanders) wrote :

The Openstack is:

Ubuntu 18.04 (Bionic)
OpenStack Rocky

Revision history for this message
George Kraft (cynerva) wrote :

The kubernetes-master revision in play here is quite old, somewhere around Charmed Kubernetes 1.15. It uses the old Kubernetes built-in openstack cloud provider, not the external cloud provider that we started using in Charmed Kubernetes 1.16.

Changed in charm-kubernetes-master:
importance: Undecided → Critical
Changed in charm-openstack-integrator:
importance: Undecided → Critical
Changed in charm-kubernetes-master:
status: New → Triaged
Changed in charm-openstack-integrator:
status: New → Triaged
Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

This is happening also in another environment where the underlay infrastructure is:

Ubuntu 18.04 (Bionic)
OpenStack Stein

and Kubernetes is:

k8s release: 1.17/stable
charm-k8s-master is rev. 808
openstack-integrator: 1.17 (unable to get revision as the charm was forked to have the fix for LP#1852974)

Revision history for this message
Tim Van Steenburgh (tvansteenburgh) wrote :

This should not be field-critical since a workaround exists and is documented in the bug description.

Revision history for this message
Cory Johns (johnsca) wrote :

This doesn't seem like a bug with Kubernetes, Charmed Kubernetes, nor the integrator.

Per the description of the manage-security-groups option in https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/using-openstack-cloud-controller-manager.md#load-balancer Kubernetes expects the Amphora VM to be within the same subnet as the instance and for the security policy to allow connections within the subnet on the NodePort range (30000-32767). Additionally, Juju generally expects that the instances are inside something like a VPC or subnet where traffic between the instances is generally open, with it managing the port SG rules for public access as controlled via open-port and juju expose.

It sounds like these expectations are not met in this environment, and traffic between the Amphora VMs and the instances is blocked by default. It seems reasonable in that case to defer to the OpenStack admin to manually manage the SG rules in such an environment, since they've already expressed a desire to have more control over what access is allowed within the internal network and it would be unclear whether they would be ok with having Juju or the charms override their decisions.

Since Kubernetes typically chooses the NodePort automatically (although it can be explicitly specified in the Service definition, as long as it falls within the NodePort range), you would presumably want to set up the rules to allow that entire range (again, 30000-32767) from the Amphora VMs to the instances.

Revision history for this message
Cory Johns (johnsca) wrote :

I should also note that you could try setting the manage-security-groups config on the OpenStack integrator charm to force Kubernetes to try to manage the SGs for the LBs it creates for in-cluster services.

Revision history for this message
Cory Johns (johnsca) wrote :

Added https://github.com/charmed-kubernetes/kubernetes-docs/pull/424 to improve the documentation around this.

Revision history for this message
Cory Johns (johnsca) wrote :

Included in the above PR is a description of what enabling the manage-security-groups option will do. To wit, Kubernetes will automatically ensure the port security group for each node includes a rule allowing ingress from the Amphorae to the node on the ports in the NodePort range.

Changed in charm-kubernetes-master:
status: Triaged → Invalid
Changed in charm-openstack-integrator:
status: Triaged → Invalid
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Note bug 1892164 whereby changing manage-security-groups does not take effect post-deployment.

Revision history for this message
Cory Johns (johnsca) wrote :

This has come up again and decided that it would be worth having the integrator charm optionally create a SG rule to open the NodePort range from within the subnet to ensure that the amphorae can connect.

Changed in charm-openstack-integrator:
importance: Critical → High
status: Invalid → Triaged
assignee: nobody → Cory Johns (johnsca)
Revision history for this message
Cory Johns (johnsca) wrote :

Per discussion with Ed, it sounds like this is no longer an issue due to a better understanding of the interactions between the Kubernetes and OpenStack settings and a better network configuration.

Giuseppe, can you please confirm if this can be closed out (save for the referenced bug to ensure that configuration changes in the integrator get properly propagated to K8s)?

Changed in charm-openstack-integrator:
status: Triaged → Incomplete
Revision history for this message
Jake Hill (routergod) wrote :

I disagree.

I installed a degenerate charmed-kubernetes in Openstack with overlay to include openstack-integrator. Everything, juju controller, kubes, juju client, is on one internal network in Openstack.

The provisioned load balancer does not have security group permissions to speak to it's peers.

(bionic-ussuri FWIW)

Revision history for this message
loudgefly (loudgefly) wrote :
Revision history for this message
Szymon Roczniak (szymonroczniakgamma) wrote :

I'm also affected by this.

Environment is Ussuri on 18.04. Everything deployed on a tenant network.

Security group rules in the SG shared between all worker nodes don't contain a rule to allow amphora instances to talk to worker nodes.

A workaround is to add this to the shared SG:

openstack --os-cloud $cloud security group rule create --dst-port 30000:32767 --protocol tcp --description "access fix" --ingress --ethertype ipv4 $security_group

However, the problem is that sometimes this additional rule gets deleted. I have no clue yet what removes it but it's happened a few times. Might be a coincidence but the last time it was after all cluster machines were shut down.

Also - what is the expected behavior of the integrator with manage-security-groups=True? I can see the group created by the integrator but it only contains an ingress rule for kubeapi (6443:6443)

Revision history for this message
Przemyslaw Hausman (phausman) wrote :
Download full text (5.1 KiB)

I have just hit this problem with OpenStack Yoga on Ubuntu Focal and Kubernetes 1.25.4. Troubleshooting led to the conclusion that the image for openstack-cloud-controller-manager is a bit outdated and probably either does not support manage-security-groups config option or there's something wrong with the logic for security groups.

Here's how to reproduce the problem and later update the image for openstack-cloud-controller-manager to prove the assumption.

STEPS TO REPRODUCE

1. Deploy service with a Load Balancer:

```
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: cdk-cats
  name: cdk-cats
spec:
  replicas: 3
  selector:
    matchLabels:
      app: cdk-cats
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: cdk-cats
    spec:
      containers:
        - image: calvinhartwell/cdk-cats:latest
          imagePullPolicy: ""
          name: cdk-cats
          ports:
            - containerPort: 80
          livenessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
            timeoutSeconds: 30
          resources: {}
      restartPolicy: Always
      serviceAccountName: ""
status: {}

---
apiVersion: v1
kind: Service
metadata:
  name: cdk-cats
spec:
  type: LoadBalancer
  selector:
    app: cdk-cats
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
EOF
```

2. Wait until EXTERNAL-IP is populated:
```
$ kubectl get svc cdk-cats
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cdk-cats LoadBalancer 10.152.183.148 172.27.82.77 80:30060/TCP 3m33s
```

3. Try to access the service using the Floating IP of the Load Balancer:
```
$ curl http://172.27.82.77/
curl: (52) Empty reply from server
```

^this is incorrect, we should be able to access the service already.

4. Verify that Load Balancer in OpenStack is OK:
```
$ openstack loadbalancer list -f yaml
- id: 467a4d7c-5f96-4084-bfd3-1da70068fa83
  name: kube_service_kubernetes-jlpmnz587dqhnvezivi9crnyt9rtk0cf_default_cdk-cats
  operating_status: ONLINE
  project_id: e54528bf42fd43df90d0990147e617c2
  provider: amphora
  provisioning_status: ACTIVE
  vip_address: 192.168.0.118
```

OK, looks good, it is active and online.

5. Check if the security group allowing access to kubernertes-worker nodes is present
```
$ openstack security group rule list | grep 30060
```

This is incorrect, security group should have been already created.

TROUBLESHOOTING

1. Check the cloud-config secret and make sure `manage-security-groups` is configured
```
$ kubectl get secret -o yaml -n kube-system cloud-config

apiVersion: v1
data:
  cloud.conf: W0dsb2JhbF... [REDACTED]
  endpoint-ca.cert: LS0tLS1CRUdJTi... [REDACTED]
kind: Secret
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"cloud.conf":"W0dsb2JhbF... [REDACTED]"},"...

Read more...

Changed in charm-kubernetes-master:
status: Invalid → New
Revision history for this message
George Kraft (cynerva) wrote :
Changed in charm-kubernetes-master:
importance: Critical → High
Changed in cdk-addons:
importance: Undecided → High
milestone: none → 1.26
Changed in charm-kubernetes-master:
milestone: none → 1.26
Changed in cdk-addons:
status: New → In Progress
Changed in charm-kubernetes-master:
status: New → In Progress
George Kraft (cynerva)
Changed in cdk-addons:
status: In Progress → Fix Committed
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
Adam Dyess (addyess)
Changed in cdk-addons:
status: Fix Committed → Fix Released
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
Revision history for this message
Przemyslaw Hausman (phausman) wrote :

I still see this issue with Kubernetes 1.26.2. openstack-cloud-controller-manager image version being used is still 1.25.0:

```
$ kubectl get -o yaml ds openstack-cloud-controller-manager -n kube-system | grep image:
        image: rocks.canonical.com:443/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.0
```

kubernetes-control-plane
- version: 1.26.2
- charm revision: 240

kubernetes-worker:
- version: 1.26.2
- charm revision: 92

I can't even apply a workaround from #17 because openstack-cloud-controller-manager:v1.25.3 is not available in ROCKS:

```
$ kubectl describe -n kube-system pod/openstack-cloud-controller-manager-zq5cb
[...]
  Warning Failed 3s (x2 over 16s) kubelet Failed to pull image "rocks.canonical.com:443/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.3": rpc error: code = NotFound desc = failed to pull and unpack image "rocks.canonical.com:443/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.3": failed to resolve reference "rocks.canonical.com:443/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.3": rocks.canonical.com:443/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.3: not found
```

Revision history for this message
George Kraft (cynerva) wrote :

Sigh. Sorry Przemyslaw, looks like I was confused. The above PR pulls in 1.25.0, not 1.25.3.

Actually, I still am confused because the commit that adds Security Group support for Octavia[1] first appears in OCCM 1.26.0. The code introduced there does not exist in 1.25.3. How did that version work?

[1]: https://github.com/kubernetes/cloud-provider-openstack/commit/42f4ede114638091b5f6ab851a0873c479eeea32

Revision history for this message
George Kraft (cynerva) wrote :

In case it's helpful to you, I've synced the following images to rocks:

rocks.canonical.com/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.3
rocks.canonical.com/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.4
rocks.canonical.com/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.26.0
rocks.canonical.com/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.26.1

But I would not expect any manual edits of the DaemonSet to persist. cdk-addons will generally revert any changes you made within 5 minutes. To override the deployed version in a persistent way, I think you would have to remove the relation between kubernetes-control-plane and openstack-integrator, then deploy OCCM yourself.

no longer affects: charm-kubernetes-master
Changed in cdk-addons:
status: Fix Released → Triaged
milestone: 1.26 → 1.27
Revision history for this message
George Kraft (cynerva) wrote :

I've re-targeted this to cdk-addons 1.27 for now, which would go out with Charmed Kubernetes 1.27 by April 18th.

The feedback we've received is that updating image versions in a cdk-addons point release is very disruptive to offline deployments, so it's something that we would prefer not to do. That said, if you need this fixed in a 1.26 release, let us know and we will consider it.

no longer affects: charm-openstack-integrator
Revision history for this message
Przemyslaw Hausman (phausman) wrote :

Thank you for looking into it @cynerva! I do have a workaround for this issue, i.e. updating the default security group so that it allows ingress traffic for ports 30000-32767, which is good enough for me. So, no pressure for pushing it to 1.26.
Thanks again!

Changed in cdk-addons:
assignee: nobody → Kevin W Monroe (kwmonroe)
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

The o7k cloud provider in cdk-addons has been bumped to v1.26.2 with:

https://github.com/charmed-kubernetes/cdk-addons/pull/223

Note the upstream image URLs have changed due to the move to registry.k8s.io. The relevant images in rocks now look like this:

rocks.canonical.com/cdk/provider-os/cinder-csi-plugin:v1.26.2
rocks.canonical.com/cdk/provider-os/k8s-keystone-auth:v1.26.2
rocks.canonical.com/cdk/provider-os/openstack-cloud-controller-manager:v1.26.2

Changed in cdk-addons:
status: Triaged → Fix Committed
Changed in cdk-addons:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.