Bug #1884995 “Load balancer created on K8s on top of Openstack O...” : Bugs : CDK Addons

Chris Sanders (chris.sanders) on 2020-06-24

description:

updated

Revision history for this message

George Kraft (cynerva) wrote on 2020-06-24:

#1

field-critical is subscribed to this.

In the future, please comment on the issue when subscribing field SLA to issues, as defined in the field SLA process for escalating to product engineering. It's easy for us to miss it otherwise.

https://wiki.canonical.com/engineering/FieldSLA

Revision history for this message

George Kraft (cynerva) wrote on 2020-06-24:

#2

What version of OpenStack are you using?

Revision history for this message

Chris Sanders (chris.sanders) wrote on 2020-06-24:

#3

The Openstack is:

Ubuntu 18.04 (Bionic)
OpenStack Rocky

Revision history for this message

George Kraft (cynerva) wrote on 2020-06-24:

#4

The kubernetes-master revision in play here is quite old, somewhere around Charmed Kubernetes 1.15. It uses the old Kubernetes built-in openstack cloud provider, not the external cloud provider that we started using in Charmed Kubernetes 1.16.

Changed in charm-kubernetes-master:
importance:	Undecided → Critical
Changed in charm-openstack-integrator:
importance:	Undecided → Critical
Changed in charm-kubernetes-master:
status:	New → Triaged
Changed in charm-openstack-integrator:
status:	New → Triaged

Revision history for this message

Giuseppe Petralia (peppepetra) wrote on 2020-06-25:

#5

This is happening also in another environment where the underlay infrastructure is:

Ubuntu 18.04 (Bionic)
OpenStack Stein

and Kubernetes is:

k8s release: 1.17/stable
charm-k8s-master is rev. 808
openstack-integrator: 1.17 (unable to get revision as the charm was forked to have the fix for LP#1852974)

Revision history for this message

Tim Van Steenburgh (tvansteenburgh) wrote on 2020-06-25:

#6

This should not be field-critical since a workaround exists and is documented in the bug description.

Revision history for this message

Cory Johns (johnsca) wrote on 2020-06-25:

#7

This doesn't seem like a bug with Kubernetes, Charmed Kubernetes, nor the integrator.

Per the description of the manage-security-groups option in https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/using-openstack-cloud-controller-manager.md#load-balancer Kubernetes expects the Amphora VM to be within the same subnet as the instance and for the security policy to allow connections within the subnet on the NodePort range (30000-32767). Additionally, Juju generally expects that the instances are inside something like a VPC or subnet where traffic between the instances is generally open, with it managing the port SG rules for public access as controlled via open-port and juju expose.

It sounds like these expectations are not met in this environment, and traffic between the Amphora VMs and the instances is blocked by default. It seems reasonable in that case to defer to the OpenStack admin to manually manage the SG rules in such an environment, since they've already expressed a desire to have more control over what access is allowed within the internal network and it would be unclear whether they would be ok with having Juju or the charms override their decisions.

Since Kubernetes typically chooses the NodePort automatically (although it can be explicitly specified in the Service definition, as long as it falls within the NodePort range), you would presumably want to set up the rules to allow that entire range (again, 30000-32767) from the Amphora VMs to the instances.

Revision history for this message

Cory Johns (johnsca) wrote on 2020-06-25:

#8

I should also note that you could try setting the manage-security-groups config on the OpenStack integrator charm to force Kubernetes to try to manage the SGs for the LBs it creates for in-cluster services.

Revision history for this message

Cory Johns (johnsca) wrote on 2020-06-25:

#9

Added https://github.com/charmed-kubernetes/kubernetes-docs/pull/424 to improve the documentation around this.

Revision history for this message

Cory Johns (johnsca) wrote on 2020-06-26:

#10

Included in the above PR is a description of what enabling the manage-security-groups option will do. To wit, Kubernetes will automatically ensure the port security group for each node includes a rule allowing ingress from the Amphorae to the node on the ports in the NodePort range.

Tim Van Steenburgh (tvansteenburgh) on 2020-06-26

Changed in charm-kubernetes-master:
status:	Triaged → Invalid
Changed in charm-openstack-integrator:
status:	Triaged → Invalid

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2020-08-19:

#11

Note bug 1892164 whereby changing manage-security-groups does not take effect post-deployment.

Revision history for this message

Cory Johns (johnsca) wrote on 2020-08-20:

#12

This has come up again and decided that it would be worth having the integrator charm optionally create a SG rule to open the NodePort range from within the subnet to ensure that the amphorae can connect.

Changed in charm-openstack-integrator:
importance:	Critical → High
status:	Invalid → Triaged
assignee:	nobody → Cory Johns (johnsca)

Revision history for this message

Cory Johns (johnsca) wrote on 2020-09-01:

#13

Per discussion with Ed, it sounds like this is no longer an issue due to a better understanding of the interactions between the Kubernetes and OpenStack settings and a better network configuration.

Giuseppe, can you please confirm if this can be closed out (save for the referenced bug to ensure that configuration changes in the integrator get properly propagated to K8s)?

Changed in charm-openstack-integrator:
status:	Triaged → Incomplete

Revision history for this message

Jake Hill (routergod) wrote on 2020-09-28:

#14

I disagree.

I installed a degenerate charmed-kubernetes in Openstack with overlay to include openstack-integrator. Everything, juju controller, kubes, juju client, is on one internal network in Openstack.

The provisioned load balancer does not have security group permissions to speak to it's peers.

(bionic-ussuri FWIW)

Revision history for this message

loudgefly (loudgefly) wrote on 2021-01-04:

#15

Maybe is the same as https://bugs.launchpad.net/charm-openstack-integrator/+bug/1905008

Revision history for this message

Szymon Roczniak (szymonroczniakgamma) wrote on 2022-06-16:

#16

I'm also affected by this.

Environment is Ussuri on 18.04. Everything deployed on a tenant network.

Security group rules in the SG shared between all worker nodes don't contain a rule to allow amphora instances to talk to worker nodes.

A workaround is to add this to the shared SG:

openstack --os-cloud $cloud security group rule create --dst-port 30000:32767 --protocol tcp --description "access fix" --ingress --ethertype ipv4 $security_group

However, the problem is that sometimes this additional rule gets deleted. I have no clue yet what removes it but it's happened a few times. Might be a coincidence but the last time it was after all cluster machines were shut down.

Also - what is the expected behavior of the integrator with manage-security-groups=True? I can see the group created by the integrator but it only contains an ingress rule for kubeapi (6443:6443)

Revision history for this message

Przemyslaw Hausman (phausman) wrote on 2022-11-18:

#17

Download full text (5.1 KiB)

I have just hit this problem with OpenStack Yoga on Ubuntu Focal and Kubernetes 1.25.4. Troubleshooting led to the conclusion that the image for openstack-cloud-controller-manager is a bit outdated and probably either does not support manage-security-groups config option or there's something wrong with the logic for security groups.

Here's how to reproduce the problem and later update the image for openstack-cloud-controller-manager to prove the assumption.

STEPS TO REPRODUCE

1. Deploy service with a Load Balancer:

```
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: cdk-cats
  name: cdk-cats
spec:
  replicas: 3
  selector:
    matchLabels:
      app: cdk-cats
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: cdk-cats
    spec:
      containers:
        - image: calvinhartwell/cdk-cats:latest
          imagePullPolicy: ""
          name: cdk-cats
          ports:
            - containerPort: 80
          livenessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
            timeoutSeconds: 30
          resources: {}
      restartPolicy: Always
      serviceAccountName: ""
status: {}

---
apiVersion: v1
kind: Service
metadata:
  name: cdk-cats
spec:
  type: LoadBalancer
  selector:
    app: cdk-cats
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
EOF
```

2. Wait until EXTERNAL-IP is populated:
```
$ kubectl get svc cdk-cats
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cdk-cats LoadBalancer 10.152.183.148 172.27.82.77 80:30060/TCP 3m33s
```

3. Try to access the service using the Floating IP of the Load Balancer:
```
$ curl http://172.27.82.77/
curl: (52) Empty reply from server
```

^this is incorrect, we should be able to access the service already.

4. Verify that Load Balancer in OpenStack is OK:
```
$ openstack loadbalancer list -f yaml
- id: 467a4d7c-5f96-4084-bfd3-1da70068fa83
  name: kube_service_kubernetes-jlpmnz587dqhnvezivi9crnyt9rtk0cf_default_cdk-cats
  operating_status: ONLINE
  project_id: e54528bf42fd43df90d0990147e617c2
  provider: amphora
  provisioning_status: ACTIVE
  vip_address: 192.168.0.118
```

OK, looks good, it is active and online.

5. Check if the security group allowing access to kubernertes-worker nodes is present
```
$ openstack security group rule list | grep 30060
```

This is incorrect, security group should have been already created.

TROUBLESHOOTING

1. Check the cloud-config secret and make sure `manage-security-groups` is configured
```
$ kubectl get secret -o yaml -n kube-system cloud-config

apiVersion: v1
data:
  cloud.conf: W0dsb2JhbF... [REDACTED]
  endpoint-ca.cert: LS0tLS1CRUdJTi... [REDACTED]
kind: Secret
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"cloud.conf":"W0dsb2JhbF... [REDACTED]"},"...

I have just hit this problem with OpenStack Yoga on Ubuntu Focal and Kubernetes 1.25.4. Troubleshooting led to the conclusion that the image for openstack-cloud-controller-manager is a bit outdated and probably either does not support manage-security-groups config option or there's something wrong with the logic for security groups.

Here's how to reproduce the problem and later update the image for openstack-cloud-controller-manager to prove the assumption.

STEPS TO REPRODUCE

1. Deploy service with a Load Balancer:

```
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:                         
  creationTimestamp: null
  labels:  
    app: cdk-cats 
  name: cdk-cats
spec:     
  replicas: 3                                                
  selector:                     
    matchLabels:                       
      app: cdk-cats
  strategy: {}  
  template:          
    metadata:                 
      creationTimestamp: null
      labels:           
        app: cdk-cats           
    spec:              
      containers:             
        - image: calvinhartwell/cdk-cats:latest
          imagePullPolicy: ""
          name: cdk-cats
          ports:
            - containerPort: 80
          livenessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
            timeoutSeconds: 30
          resources: {}
      restartPolicy: Always
      serviceAccountName: ""
status: {}

---
apiVersion: v1
kind: Service
metadata:
  name: cdk-cats
spec:
  type: LoadBalancer
  selector:
    app: cdk-cats
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
EOF
```

2. Wait until EXTERNAL-IP is populated:
```
$ kubectl get svc cdk-cats
NAME       TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)        AGE
cdk-cats   LoadBalancer   10.152.183.148   172.27.82.77   80:30060/TCP   3m33s
```

3. Try to access the service using the Floating IP of the Load Balancer:
```
$ curl http://172.27.82.77/
curl: (52) Empty reply from server
```

^this is incorrect, we should be able to access the service already.

4. Verify that Load Balancer in OpenStack is OK:
```
$ openstack loadbalancer list -f yaml
- id: 467a4d7c-5f96-4084-bfd3-1da70068fa83
  name: kube_service_kubernetes-jlpmnz587dqhnvezivi9crnyt9rtk0cf_default_cdk-cats
  operating_status: ONLINE
  project_id: e54528bf42fd43df90d0990147e617c2
  provider: amphora
  provisioning_status: ACTIVE
  vip_address: 192.168.0.118
```

OK, looks good, it is active and online.

5. Check if the security group allowing access to kubernertes-worker nodes is present
```
$ openstack security group rule list | grep 30060
```

This is incorrect, security group should have been already created.

TROUBLESHOOTING

1. Check the cloud-config secret and make sure `manage-security-groups` is configured
```
$ kubectl get secret -o yaml -n kube-system cloud-config

apiVersion: v1
data:
  cloud.conf: W0dsb2JhbF... [REDACTED]
  endpoint-ca.cert: LS0tLS1CRUdJTi... [REDACTED]
kind: Secret
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"cloud.conf":"W0dsb2JhbF... [REDACTED]"},"kind":"Secret","metadata":{"annotations":{},"labels":{"cdk-addons":"true"},"name":"cloud-config","namespace":"kube-system"}}
  creationTimestamp: "2022-11-17T19:32:13Z"
  labels:
    cdk-addons: "true"
  name: cloud-config
  namespace: kube-system
  resourceVersion: "148790"
  uid: 7d84ccc7-c976-4add-b37c-04a55d3e9ef5
type: Opaque

$ echo 'W0dsb2JhbF... [REDACTED]' | base64 -d
[Global]
auth-url = https://keystone.orange.box:5000/v3
region = RegionOne
username = admin
password = [REDACTED]
tenant-name = admin
domain-name = admin_domain
tenant-domain-name = admin_domain
ca-file = /etc/config/endpoint-ca.cert

[LoadBalancer]
use-octavia = true
subnet-id = dd344c91-d5dd-464d-b1cb-f39e4366db9f
floating-network-id = f52e5d35-65bb-4b0b-b5c4-9037cdb50536
lb-method = ROUND_ROBIN
manage-security-groups = True
```

All good, `manage-security-groups = True` is defined.

2. Check the image for openstack-cloud-controller-manager
```
$ kubectl get -o yaml ds openstack-cloud-controller-manager -n kube-system | grep image:
        image: rocks.canonical.com:443/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.23.0
```

3. Update the image to more recent version
```
$ kubectl edit ds openstack-cloud-controller-manager -n kube-system
```
...and update the `image` key with `k8scloudprovider/openstack-cloud-controller-manager:v1.25.3`.

4. Recreate the deployment of the service with Load Balancer.

When done, check if LB works.
  -> yes, now it works

Verify the presence of security group
  -> yes, the security group is now present, allowing access to the port that the service is listening on:

```
$ openstack security group rule list | grep 30088
| d78d9120-722a-4519-b67a-07cc15fd8343 | tcp         | IPv4      | 192.168.0.0/24 | 30088:30088 | ingress   | None                                 | None                 | fa940337-7103-472e-83b9-86791cc326b9 |
```

This leads to conclusion that the image for openstack-cloud-controller-manager should most likely be updated.

Changed in charm-kubernetes-master:
status:	Invalid → New

Revision history for this message

George Kraft (cynerva) wrote on 2022-11-21:

#18

This should be fixed by https://github.com/charmed-kubernetes/cdk-addons/pull/216

Changed in charm-kubernetes-master:
importance:	Critical → High
Changed in cdk-addons:
importance:	Undecided → High
milestone:	none → 1.26
Changed in charm-kubernetes-master:
milestone:	none → 1.26
Changed in cdk-addons:
status:	New → In Progress
Changed in charm-kubernetes-master:
status:	New → In Progress

George Kraft (cynerva) on 2022-11-22

Changed in cdk-addons:
status:	In Progress → Fix Committed
Changed in charm-kubernetes-master:
status:	In Progress → Fix Committed

Adam Dyess (addyess) on 2022-12-15

Changed in cdk-addons:
status:	Fix Committed → Fix Released
Changed in charm-kubernetes-master:
status:	Fix Committed → Fix Released

Revision history for this message

Przemyslaw Hausman (phausman) wrote on 2023-03-05:

#19

I still see this issue with Kubernetes 1.26.2. openstack-cloud-controller-manager image version being used is still 1.25.0:

```
$ kubectl get -o yaml ds openstack-cloud-controller-manager -n kube-system | grep image:
image: rocks.canonical.com:443/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.0
```

kubernetes-control-plane
- version: 1.26.2
- charm revision: 240

kubernetes-worker:
- version: 1.26.2
- charm revision: 92

I can't even apply a workaround from #17 because openstack-cloud-controller-manager:v1.25.3 is not available in ROCKS:

```
$ kubectl describe -n kube-system pod/openstack-cloud-controller-manager-zq5cb
[...]
Warning Failed 3s (x2 over 16s) kubelet Failed to pull image "rocks.canonical.com:443/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.3": rpc error: code = NotFound desc = failed to pull and unpack image "rocks.canonical.com:443/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.3": failed to resolve reference "rocks.canonical.com:443/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.3": rocks.canonical.com:443/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.3: not found
```

Revision history for this message

George Kraft (cynerva) wrote on 2023-03-06:

#20

Sigh. Sorry Przemyslaw, looks like I was confused. The above PR pulls in 1.25.0, not 1.25.3.

Actually, I still am confused because the commit that adds Security Group support for Octavia[1] first appears in OCCM 1.26.0. The code introduced there does not exist in 1.25.3. How did that version work?

[1]: https://github.com/kubernetes/cloud-provider-openstack/commit/42f4ede114638091b5f6ab851a0873c479eeea32

Revision history for this message

George Kraft (cynerva) wrote on 2023-03-06:

#21

In case it's helpful to you, I've synced the following images to rocks:

rocks.canonical.com/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.3
rocks.canonical.com/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.25.4
rocks.canonical.com/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.26.0
rocks.canonical.com/cdk/k8scloudprovider/openstack-cloud-controller-manager:v1.26.1

But I would not expect any manual edits of the DaemonSet to persist. cdk-addons will generally revert any changes you made within 5 minutes. To override the deployed version in a persistent way, I think you would have to remove the relation between kubernetes-control-plane and openstack-integrator, then deploy OCCM yourself.

no longer affects:	charm-kubernetes-master
Changed in cdk-addons:
status:	Fix Released → Triaged
milestone:	1.26 → 1.27

Revision history for this message

George Kraft (cynerva) wrote on 2023-03-06:

#22

I've re-targeted this to cdk-addons 1.27 for now, which would go out with Charmed Kubernetes 1.27 by April 18th.

The feedback we've received is that updating image versions in a cdk-addons point release is very disruptive to offline deployments, so it's something that we would prefer not to do. That said, if you need this fixed in a 1.26 release, let us know and we will consider it.

no longer affects:

charm-openstack-integrator

Revision history for this message

Przemyslaw Hausman (phausman) wrote on 2023-03-07:

#23

Thank you for looking into it @cynerva! I do have a workaround for this issue, i.e. updating the default security group so that it allows ingress traffic for ports 30000-32767, which is good enough for me. So, no pressure for pushing it to 1.26.
Thanks again!

Kevin W Monroe (kwmonroe) on 2023-04-05

Changed in cdk-addons:
assignee:	nobody → Kevin W Monroe (kwmonroe)

Revision history for this message

Kevin W Monroe (kwmonroe) wrote on 2023-04-07:

#24

The o7k cloud provider in cdk-addons has been bumped to v1.26.2 with:

https://github.com/charmed-kubernetes/cdk-addons/pull/223

Note the upstream image URLs have changed due to the move to registry.k8s.io. The relevant images in rocks now look like this:

rocks.canonical.com/cdk/provider-os/cinder-csi-plugin:v1.26.2
rocks.canonical.com/cdk/provider-os/k8s-keystone-auth:v1.26.2
rocks.canonical.com/cdk/provider-os/openstack-cloud-controller-manager:v1.26.2

Changed in cdk-addons:
status:	Triaged → Fix Committed

Kevin W Monroe (kwmonroe) on 2023-04-23

Changed in cdk-addons:
status:	Fix Committed → Fix Released

CDK Addons

Load balancer created on K8s on top of Openstack Octavia are not working

Bug Description

Other bug subscribers

Remote bug watches