OpenStack Magnum k8s autoscaling failed to join the Kubernetes cluster

Bug #2015870 reported by TCSECP
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Magnum Charm
New
Undecided
Unassigned

Bug Description

Hi team,

Please check the below error.

kube-system k8s-keystone-auth-7sqhw 1/1 Running 0 7d5h
kube-system k8s-keystone-auth-857h9 1/1 Running 0 7d5h
kube-system k8s-keystone-auth-h5fkr 1/1 Running 1 7d5h
kube-system kube-dns-autoscaler-5b4c644874-flhxt 0/1 Pending 0 4d20h
kube-system kubernetes-dashboard-7f844d86d6-j54pn 1/1 Running 52 7d5h
kube-system magnum-auto-healer-6nwjq 1/1 Running 158 7d5h
kube-system magnum-auto-healer-lv8kj 1/1 Running 179 7d5h
kube-system magnum-auto-healer-p65cl 1/1 Running 177 7d5h
kube-system magnum-grafana-f5b889c6f-fdwbl 0/2 Pending 0 4d20h
kube-system magnum-kube-state-metrics-6c97c54fd5-l9kt4 0/1 Pending 0 4d20h
kube-system magnum-metrics-server-7cc4fc5c64-478gq 0/1 Pending 0 4d20h
kube-system magnum-prometheus-adapter-648584d96c-v9gcw 0/1 Pending 0 4d20h
kube-system magnum-prometheus-node-exporter-jh4jm 1/1 Running 0 7d5h
kube-system magnum-prometheus-node-exporter-rrcj2 1/1 Running 0 7d5h
kube-system magnum-prometheus-node-exporter-sb6qr 1/1 Running 0 7d5h
kube-system magnum-prometheus-operator-operator-7885b9c9d9-tpjcz 0/2 Pending 0 4d20h
kube-system openstack-cloud-controller-manager-nqmgq 0/1 CrashLoopBackOff 756 7d5h
kube-system openstack-cloud-controller-manager-q5hp8 0/1 CrashLoopBackOff 762 7d5h
kube-system openstack-cloud-controller-manager-vsp8x 0/1 CrashLoopBackOff 765 7d5h
kube-system prometheus-magnum-prometheus-operator-prometheus-0 0/3 Pending 0 4d8h
[root@k8snew-o7diy4pcuaou-master-0 core]# kubectl get nodes -o wide
Error from server: etcdserver: leader changed
[root@k8snew-o7diy4pcuaou-master-0 core]# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8snew-o7diy4pcuaou-master-0 Ready master 7d5h v1.21.1 192.168.45.13 192.168.101.17 Fedora CoreOS 32.20201104.3.0 5.8.17-200.fc32.x86_64 docker://19.3.11
k8snew-o7diy4pcuaou-master-1 Ready master 7d5h v1.21.1 192.168.45.12 192.168.101.11 Fedora CoreOS 32.20201104.3.0 5.8.17-200.fc32.x86_64 docker://19.3.11
k8snew-o7diy4pcuaou-master-2 Ready master 7d5h v1.21.1 192.168.45.11 192.168.101.9 Fedora CoreOS 32.20201104.3.0 5.8.17-200.fc32.x86_64 docker://19.3.11
[root@k8snew-o7diy4pcuaou-master-0 core]# kubectl logs -n kube-system cluster-autoscaler-c6c4fc9fd-qbh48
I0411 12:31:25.450095 1 leaderelection.go:242] attempting to acquire leader lease kube-system/cluster-autoscaler...
I0411 12:31:25.474139 1 leaderelection.go:252] successfully acquired lease kube-system/cluster-autoscaler
I0411 12:31:25.527377 1 registry.go:150] Registering EvenPodsSpread predicate and priority function
I0411 12:31:25.527413 1 registry.go:150] Registering EvenPodsSpread predicate and priority function
F0411 12:31:26.253479 1 magnum_cloud_provider.go:162] Failed to create magnum manager: unable to access cluster (c8640bfb-563d-449a-b0d3-e8c55bfbe7f2): The service is currently unable to handle the request due to a temporary overloading or maintenance. This is a temporary condition. Try again later.
[root@k8snew-o7diy4pcuaou-master-0 core]#

Revision history for this message
TCSECP (tcsecp) wrote :
Download full text (6.5 KiB)

Hi Team, Below I have mentioned the Autos calling issue in Magnum.
Issue Summary:
The nodes are scaling down but , it's failed to scaling up and shooting below error.

Template:

openstack coe cluster template create k8s--calico-cinder-auto-health-largef_min4_max5_lb_21_1 --image fedora-coreos-32 --keypair k8s --external-network Magnum-Test --master-lb-enabled --dns-nameserver 8.8.8.8 --master-flavor g1t1.large --flavor g1t1.large --network-driver calico --coe kubernetes --label container_infra_prefix="tcsmagnum.tcsecp.com/tcsmagnum/" --label 'docker_volume_type=az1-stable2' --label 'boot_volume_size=40' --label boot_volume_type=az1-stable2 --docker-volume-size 20 --docker-storage-driver overlay2 --label kube_tag=v1.21.1 --label calico_ipv4pool=10.100.0.0/24 --label flannel_network_subnetlen=28 --label flannel_backend=host-gw --fixed-network 532ebede-e9d0-4ec4-8bf1-abab1e8d786f --fixed-subnet eebe853c-70bb-48f6-8edc-6bb8f92b181e --label metrics_server_enabled=true --label monitoring_enabled=true --label prometheus_adapter_enabled=true --label cinder_csi_enabled=true --label grafana_admin_passwd=linux --volume-driver cinder --label 'auto_healing_enabled=True' --label 'auto_healing_controller=magnum-auto-healer' --label 'auto_scaling_enabled=True' --label 'min_node_count=1' --label 'max_node_count=6' --label 'health_status=True' --label 'health_status_reason=True'

Auto scaler pod logs

I0423 20:06:55.746258 1 scale_down.go:638] Can't retrieve node maynew-mln6rohb3yuf-node-3 from snapshot, removing from unremovable map, err: node not found
I0423 20:07:25.965144 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 20:07:25.965332 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 7.888µs
[root@maynew-mln6rohb3yuf-master-0 core]#

I0423 19:39:25.958460 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:39:25.958687 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 9.151µs
I0423 19:41:25.958877 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:41:25.958919 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 9.119µs
I0423 19:43:25.959125 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:43:25.959655 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 9.968µs
I0423 19:45:25.959969 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:45:25.960640 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 12.974µs
I0423 19:47:25.961019 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:47:25.961073 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.614µs
I0423 19:49:25.961326 1 node_instances_cache.go:156] Start refreshing cloud provider node insta...

Read more...

Revision history for this message
TCSECP (tcsecp) wrote :
Download full text (32.2 KiB)

Hi Team,

Please check the coe template which I have used for auto scaling and attached the logs.

openstack coe cluster template create k8s--calico-cinder-auto-health-largef_min4_max5_lb --image fedora-coreos-32 --keypair k8s --external-network Magnum-Test --master-lb-enabled --dns-nameserver 8.8.8.8 --master-flavor g1t1.large --flavor g1t1.large --network-driver calico --coe kubernetes --label container_infra_prefix="tcsmagnum.tcsecp.com/tcsmagnum/" --label 'docker_volume_type=az1-stable2' --label 'boot_volume_size=40' --label boot_volume_type=az1-stable2 --docker-volume-size 20 --docker-storage-driver overlay2 --label kube_tag=v1.18.9 --label calico_ipv4pool=10.100.0.0/24 --label flannel_network_subnetlen=28 --label flannel_backend=host-gw --fixed-network 532ebede-e9d0-4ec4-8bf1-abab1e8d786f --fixed-subnet eebe853c-70bb-48f6-8edc-6bb8f92b181e --label metrics_server_enabled=true --label monitoring_enabled=true --label prometheus_adapter_enabled=true --label cinder_csi_enabled=true --label grafana_admin_passwd=linux --volume-driver cinder --label 'auto_healing_enabled=True' --label 'auto_healing_controller=magnum-auto-healer' --label 'auto_scaling_enabled=True' --label 'min_node_count=4' --label 'max_node_count=5' --label 'health_status=True' --label 'health_status_reason=True'

root@maniltest:~# openstack coe cluster show cca3d284-33d5-478d-ba1f-c967285bbce9 --fit-width
+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| status | CREATE_COMPLETE |
| health_status | None |
| cluster_template_id | 87b12a80-103a-46ea-8df8-b7ef0877abdc |
| node_addresses | ['192.168.101.51', '192.168.101.65', '192.168.101.64', '192.168.101.57'] |
| uuid | cca3d284-33d5-478d-ba1f-c967285bbce9 |
| stack_id | 87de61cb-b949-430b-9c17-8223b3b30244 |
| status_reason | None |
| created_at | 2023-05-04T07:53:07+00:00 |
| updated_at | 2023-05-04T08:07:59+00:00 ...

Revision history for this message
TCSECP (tcsecp) wrote :

Hi Team,

Could you please provide an update on this.

Regards,
Sriramu Desingh.

Revision history for this message
Felipe Reyes (freyes) wrote (last edit ):

Hi, I'm taking a look into this bug, and I would like to get the steps you are following to reproduce this issue, in the bug description there is a command to create the cluster template, although it would be nice to have the list of steps so what command you are using to create the cluster, and then how you are inducing the scale up and down of the cluster, so we can attempt to reproduce the problem under similar conditions.

Revision history for this message
TCSECP (tcsecp) wrote :

Hi Felipe,

Thanks for your update. I have mentioned the below steps.

Step1: Launching magnum cluster

openstack coe cluster create --cluster-template=k8s--calico-cinder-auto-health-largef_min4_max5_lb --master-count 3 --node-count 4 --flavor g1t1.large --master-flavor g1t1.large k8stest

step2: once the cluster creation is completed, then

step3: I will login to one of the worker nodes and stop the kubelet service.

step4: After that magnum auto healer pod will automatically stop drain the worker node and shutdown the instance.

Step 5: Here I am facing a problem like instead of Autoscaling, the k8s cluster started to scale down the worker node.

Please let me know if you need any further information.

Regards,
Sriramu Desingh.

description: updated
Revision history for this message
TCSECP (tcsecp) wrote :

Hi Team,
Any update on this.

Regards,
Sriramu Desingh.

Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 2015870] Re: OpenStack Magnum k8s autoscaling failed to join the Kubernetes cluster

> step4: After that magnum auto healer pod will automatically stop drain
> the worker node and shutdown the instance.
>
> Step 5: Here I am facing a problem like instead of Autoscaling, the k8s
> cluster started to scale down the worker node.

IIUC, the behavior you expect here is that if you have a 4 workers environment, the one where you
stopped kubelet gets killed and a new worker node is added so meet the expectation of having a 4
nodes cluster running, is this correct?

Revision history for this message
TCSECP (tcsecp) wrote :

Hi Felipe,

Thanks for your update. Yes, it should add a new worker node to meet the expectation of minimum worker node count 4.

Regards,
Sriramu Desingh.

Revision history for this message
Felipe Reyes (freyes) wrote :

> Thanks for your update. Yes, it should add a new worker node to meet the
> expectation of minimum worker node count 4.

These 2 labels are being passed to the cluster template: --label 'min_node_count=4' --label
'max_node_count=5'

Please provide us the output of the following commands:

1. openstack coe nodegroup list k8stest

2. For nodegroup run: openstack coe nodegroup show k8stest <NODEGROUP_NAME>

Revision history for this message
TCSECP (tcsecp) wrote :
Download full text (14.8 KiB)

Hi Felipe,

I have already updated that in previous bug chain, below I am updating the same.

oot@maniltest:~# openstack coe nodegroup list cca3d284-33d5-478d-ba1f-c967285bbce9
+--------------------------------------+----------------+------------+------------------+------------+-----------------+--------+
| uuid | name | flavor_id | image_id | node_count | status | role |
+--------------------------------------+----------------+------------+------------------+------------+-----------------+--------+
| 98491117-6ab3-463c-8d12-9937c6b5bdc4 | default-master | g1t1.large | fedora-coreos-32 | 1 | CREATE_COMPLETE | master |
| 4bb80d26-d1f6-44b8-994b-1dbd25290a63 | default-worker | g1t1.large | fedora-coreos-32 | 4 | CREATE_COMPLETE | worker |
+--------------------------------------+----------------+------------+------------------+------------+-----------------+--------+
root@maniltest:~# openstack coe nodegroup show cca3d284-33d5-478d-ba1f-c967285bbce9 4bb80d26-d1f6-44b8-994b-1dbd25290a63
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| uuid | 4bb80d26-d1f6-44b8-994b-1dbd25290a63 ...

Revision history for this message
TCSECP (tcsecp) wrote :

Hi Felipe,

Please provide an update on this.

Regards,
Sriramu Desingh.

Revision history for this message
TCSECP (tcsecp) wrote :

Hi Felipe,

Please Provide an update on this.

Regards,
Sriramu Desingh.

Revision history for this message
TCSECP (tcsecp) wrote :

Hi Felipe,

We are waiting for a positive update.

Regards,
Sriramu Desingh.

Revision history for this message
Felipe Reyes (freyes) wrote :

Hi Sriramu,

I haven't been able to reproduce the issue you are seeing on your environment. I would like to ask you to capture some extra logs to understand.

1. Enable magnum in debugging mode and wait until the config-changed hook completes:

    juju config magnum debug=True

2. Create a new cluster template with the settings you've tested so far.
3. Create a new cluster using the template created previously.
4. Wait until the cluster has fully deployed.
5. Capture the following logs:
    - in magnum units /var/log/magnum and /var/log/apache2 directories.
    - juju config magnum/leader date # this allow us to get a reference of the time window we need to analyze.
    - "kubectl logs -n kube-system cluster-autoscaler-XXXX" (replacing XXXX with suffix of the deployment).

Thanks in advance.

Revision history for this message
TCSECP (tcsecp) wrote :

Hi Felipe,

I have attached the requested logs.

Regards,
Sriramu D?field.comment=Hi Felipe,

I have attached the requested logs.

Regards,
Sriramu D?field.comment=Hi Felipe,

I have attached the requested logs.

Regards,
Sriramu D?field.comment=Hi Felipe,

I have attached the requested logs.

Regards,
Sriramu D

Revision history for this message
TCSECP (tcsecp) wrote :
Revision history for this message
Felipe Reyes (freyes) wrote :

F0411 12:31:26.253479 1 magnum_cloud_provider.go:162] Failed to create magnum manager: unable
to access cluster (c8640bfb-563d-449a-b0d3-e8c55bfbe7f2): The service is currently unable to handle
the request due to a temporary overloading or maintenance. This is a temporary condition. Try again
later.

This autoscaler error is referencing a request id c8640bfb-563d-449a-b0d3-e8c55bfbe7f2 which is not
present in the logs shared, can you please grep it in your logs and share the relevant bits, or the
whole log file and I can do it on my own if that works better for you.

Revision history for this message
TCSECP (tcsecp) wrote :

Hi Felipe,

The Mangum log is around 10 to 29GB. Kindly share the FTP link to upload the logs.

Regards,
Sriramu Desingh.

Revision history for this message
TCSECP (tcsecp) wrote :

Hi Felipe,

I have attached the Magnum log with requested ref id.

Regards,
Sriramu Deisngh.

Revision history for this message
TCSECP (tcsecp) wrote :

Hi Felipe,

Please provide an update.

Regards,
Sriramu Desingh.

Revision history for this message
Felipe Reyes (freyes) wrote :
Download full text (6.4 KiB)

Hi,

I've been going through the logs that were handed off internally[0] where I
found some unexpected failures that I believe could be affecting the behaviour
of Magnum, I will list them and explain what the could mean separately.

1. barbicanclient.exceptions.HTTPServerError: Internal Server Error: Secret payload retrieval failure seen
2. pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
3. oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED [...]
4. ConnectionResetError: [Errno 104] Connection reset by peer
5. keystoneauth1.exceptions.http.Unauthorized: The request you have made requires authentication.

About (1), this error is present in the logs ~14k times[1], the first occurrence
on January 27th and the last one in the log is on June 19th. The absence of a
healthy Barbican service prevents Magnum from stablishing a connection to k8s
since that's the place where the secrets (e.g. private keys) are stored and read from.

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/magnum/service/periodic.py", line 106, in _update_health_status
    monitor.poll_health_status()
  File "/usr/lib/python3/dist-packages/magnum/drivers/common/k8s_monitor.py", line 55, in poll_health_status
    k8s_api = k8s.create_k8s_api(self.context, self.cluster)
  File "/usr/lib/python3/dist-packages/magnum/conductor/k8s_api.py", line 145, in create_k8s_api
    return K8sAPI(context, cluster)
  File "/usr/lib/python3/dist-packages/magnum/conductor/k8s_api.py", line 114, in __init__
    self.cert_file) = create_client_files(cluster, context)
  File "/usr/lib/python3/dist-packages/magnum/conductor/handlers/common/cert_manager.py", line 159, in create_client_files
    magnum_cert.get_decrypted_private_key()))
  File "/usr/lib/python3/dist-packages/magnum/common/cert_manager/cert_manager.py", line 46, in get_decrypted_private_key
    return operations.decrypt_key(self.get_private_key(),
  File "/usr/lib/python3/dist-packages/magnum/common/cert_manager/barbican_cert_manager.py", line 52, in get_private_key
    return self._cert_container.private_key.payload
  File "/usr/lib/python3/dist-packages/barbicanclient/v1/secrets.py", line 193, in payload
    self._fetch_payload()
  File "/usr/lib/python3/dist-packages/barbicanclient/v1/secrets.py", line 271, in _fetch_payload
    payload = self._api._get_raw(payload_url, headers=headers)
  File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 83, in _get_raw
    return self.request(path, 'GET', *args, **kwargs).content
  File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 63, in request
    self._check_status_code(resp)
  File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 97, in _check_status_code
    raise exceptions.HTTPServerError(
barbicanclient.exceptions.HTTPServerError: Internal Server Error: Secret payload retrieval failure seen - please contact site administrator.

About (2), this is a better understood error in general, it's basically that the
database dropped the client (in this case magnum-conductor process), this can be
due to numerous reasons, more data is needed to understan...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.