Bug #2015870 “OpenStack Magnum k8s autoscaling failed to join th...” : Bugs : OpenStack Magnum Charm

Revision history for this message

TCSECP (tcsecp) wrote on 2023-04-24:

#1

Download full text (6.5 KiB)

Hi Team, Below I have mentioned the Autos calling issue in Magnum.
Issue Summary:
The nodes are scaling down but , it's failed to scaling up and shooting below error.

Template:

openstack coe cluster template create k8s--calico-cinder-auto-health-largef_min4_max5_lb_21_1 --image fedora-coreos-32 --keypair k8s --external-network Magnum-Test --master-lb-enabled --dns-nameserver 8.8.8.8 --master-flavor g1t1.large --flavor g1t1.large --network-driver calico --coe kubernetes --label container_infra_prefix="tcsmagnum.tcsecp.com/tcsmagnum/" --label 'docker_volume_type=az1-stable2' --label 'boot_volume_size=40' --label boot_volume_type=az1-stable2 --docker-volume-size 20 --docker-storage-driver overlay2 --label kube_tag=v1.21.1 --label calico_ipv4pool=10.100.0.0/24 --label flannel_network_subnetlen=28 --label flannel_backend=host-gw --fixed-network 532ebede-e9d0-4ec4-8bf1-abab1e8d786f --fixed-subnet eebe853c-70bb-48f6-8edc-6bb8f92b181e --label metrics_server_enabled=true --label monitoring_enabled=true --label prometheus_adapter_enabled=true --label cinder_csi_enabled=true --label grafana_admin_passwd=linux --volume-driver cinder --label 'auto_healing_enabled=True' --label 'auto_healing_controller=magnum-auto-healer' --label 'auto_scaling_enabled=True' --label 'min_node_count=1' --label 'max_node_count=6' --label 'health_status=True' --label 'health_status_reason=True'

Auto scaler pod logs

I0423 20:06:55.746258 1 scale_down.go:638] Can't retrieve node maynew-mln6rohb3yuf-node-3 from snapshot, removing from unremovable map, err: node not found
I0423 20:07:25.965144 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 20:07:25.965332 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 7.888µs
[root@maynew-mln6rohb3yuf-master-0 core]#

I0423 19:39:25.958460 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:39:25.958687 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 9.151µs
I0423 19:41:25.958877 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:41:25.958919 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 9.119µs
I0423 19:43:25.959125 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:43:25.959655 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 9.968µs
I0423 19:45:25.959969 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:45:25.960640 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 12.974µs
I0423 19:47:25.961019 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:47:25.961073 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.614µs
I0423 19:49:25.961326 1 node_instances_cache.go:156] Start refreshing cloud provider node insta...

Hi Team, Below I have mentioned the Autos calling issue in Magnum.
Issue Summary:
The nodes are scaling down but , it's failed to scaling up and shooting below error.

Template:

openstack coe cluster template create k8s--calico-cinder-auto-health-largef_min4_max5_lb_21_1 --image fedora-coreos-32 --keypair k8s --external-network Magnum-Test --master-lb-enabled   --dns-nameserver 8.8.8.8 --master-flavor g1t1.large --flavor g1t1.large --network-driver calico --coe kubernetes --label container_infra_prefix="tcsmagnum.tcsecp.com/tcsmagnum/"  --label 'docker_volume_type=az1-stable2' --label 'boot_volume_size=40' --label boot_volume_type=az1-stable2 --docker-volume-size 20 --docker-storage-driver overlay2 --label kube_tag=v1.21.1 --label calico_ipv4pool=10.100.0.0/24 --label flannel_network_subnetlen=28 --label flannel_backend=host-gw --fixed-network 532ebede-e9d0-4ec4-8bf1-abab1e8d786f --fixed-subnet eebe853c-70bb-48f6-8edc-6bb8f92b181e --label metrics_server_enabled=true   --label monitoring_enabled=true --label prometheus_adapter_enabled=true --label cinder_csi_enabled=true --label grafana_admin_passwd=linux --volume-driver cinder --label 'auto_healing_enabled=True' --label 'auto_healing_controller=magnum-auto-healer' --label 'auto_scaling_enabled=True' --label 'min_node_count=1' --label 'max_node_count=6' --label 'health_status=True' --label 'health_status_reason=True'

Auto scaler pod logs

I0423 20:06:55.746258       1 scale_down.go:638] Can't retrieve node maynew-mln6rohb3yuf-node-3 from snapshot, removing from unremovable map, err: node not found
I0423 20:07:25.965144       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 20:07:25.965332       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 7.888µs
[root@maynew-mln6rohb3yuf-master-0 core]#

I0423 19:39:25.958460       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:39:25.958687       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 9.151µs
I0423 19:41:25.958877       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:41:25.958919       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 9.119µs
I0423 19:43:25.959125       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:43:25.959655       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 9.968µs
I0423 19:45:25.959969       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:45:25.960640       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 12.974µs
I0423 19:47:25.961019       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:47:25.961073       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.614µs
I0423 19:49:25.961326       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:49:25.961711       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 29.921µs
I0423 19:50:39.599246       1 scale_down.go:638] Can't retrieve node maynew-mln6rohb3yuf-node-1 from snapshot, removing from unremovable map, err: node not found
I0423 19:51:25.961994       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:51:25.962051       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 9.99µs
I0423 19:53:25.962300       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:53:25.962613       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 8.136µs
I0423 19:55:25.962822       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:55:25.962863       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 9.68µs
I0423 19:57:25.963197       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:57:25.963441       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.752µs
I0423 19:59:25.963650       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 19:59:25.963852       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 21.935µs
I0423 20:01:25.964029       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 20:01:25.964067       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 9.843µs
I0423 20:03:25.964256       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 20:03:25.964430       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 7.094µs
I0423 20:05:25.964730       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 20:05:25.964921       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 17.052µs
I0423 20:05:40.544105       1 static_autoscaler.go:559] Decreasing size of default-worker, expected=4 current=3 delta=-1
I0423 20:05:40.544181       1 magnum_nodegroup.go:255] Decreasing target size by -1, 4->3
I0423 20:05:43.051101       1 static_autoscaler.go:342] Some node group target size was fixed, skipping the iteration
I0423 20:06:55.746258       1 scale_down.go:638] Can't retrieve node maynew-mln6rohb3yuf-node-3 from snapshot, removing from unremovable map, err: node not found
I0423 20:07:25.965144       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 20:07:25.965332       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 7.888µs
I0423 20:09:25.965569       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 20:09:25.965605       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 7.894µs

Couldn't find template for node group default-worker
E0423 19:13:14.998665       1 static_autoscaler.go:415] Failed to scale up: Could not compute total resources: No node info for: default-worker

Regards,
Sriramu Desingh.

Revision history for this message

TCSECP (tcsecp) wrote on 2023-05-04:

#2

Download full text (32.2 KiB)

Hi Team,

Please check the coe template which I have used for auto scaling and attached the logs.

openstack coe cluster template create k8s--calico-cinder-auto-health-largef_min4_max5_lb --image fedora-coreos-32 --keypair k8s --external-network Magnum-Test --master-lb-enabled --dns-nameserver 8.8.8.8 --master-flavor g1t1.large --flavor g1t1.large --network-driver calico --coe kubernetes --label container_infra_prefix="tcsmagnum.tcsecp.com/tcsmagnum/" --label 'docker_volume_type=az1-stable2' --label 'boot_volume_size=40' --label boot_volume_type=az1-stable2 --docker-volume-size 20 --docker-storage-driver overlay2 --label kube_tag=v1.18.9 --label calico_ipv4pool=10.100.0.0/24 --label flannel_network_subnetlen=28 --label flannel_backend=host-gw --fixed-network 532ebede-e9d0-4ec4-8bf1-abab1e8d786f --fixed-subnet eebe853c-70bb-48f6-8edc-6bb8f92b181e --label metrics_server_enabled=true --label monitoring_enabled=true --label prometheus_adapter_enabled=true --label cinder_csi_enabled=true --label grafana_admin_passwd=linux --volume-driver cinder --label 'auto_healing_enabled=True' --label 'auto_healing_controller=magnum-auto-healer' --label 'auto_scaling_enabled=True' --label 'min_node_count=4' --label 'max_node_count=5' --label 'health_status=True' --label 'health_status_reason=True'

Hi Team,

Please check the coe template which I have used for auto scaling and attached the logs.

openstack coe cluster template create k8s--calico-cinder-auto-health-largef_min4_max5_lb --image fedora-coreos-32 --keypair k8s --external-network Magnum-Test --master-lb-enabled   --dns-nameserver 8.8.8.8 --master-flavor g1t1.large --flavor g1t1.large --network-driver calico --coe kubernetes --label container_infra_prefix="tcsmagnum.tcsecp.com/tcsmagnum/"  --label 'docker_volume_type=az1-stable2' --label 'boot_volume_size=40' --label boot_volume_type=az1-stable2 --docker-volume-size 20 --docker-storage-driver overlay2 --label kube_tag=v1.18.9 --label calico_ipv4pool=10.100.0.0/24 --label flannel_network_subnetlen=28 --label flannel_backend=host-gw --fixed-network 532ebede-e9d0-4ec4-8bf1-abab1e8d786f --fixed-subnet eebe853c-70bb-48f6-8edc-6bb8f92b181e --label metrics_server_enabled=true   --label monitoring_enabled=true --label prometheus_adapter_enabled=true --label cinder_csi_enabled=true --label grafana_admin_passwd=linux --volume-driver cinder --label 'auto_healing_enabled=True' --label 'auto_healing_controller=magnum-auto-healer' --label 'auto_scaling_enabled=True' --label 'min_node_count=4' --label 'max_node_count=5' --label 'health_status=True' --label 'health_status_reason=True'

root@maniltest:~# openstack coe cluster show cca3d284-33d5-478d-ba1f-c967285bbce9 --fit-width
+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| Field                | Value                                                                                                                              |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| status               | CREATE_COMPLETE                                                                                                                    |
| health_status        | None                                                                                                                               |
| cluster_template_id  | 87b12a80-103a-46ea-8df8-b7ef0877abdc                                                                                               |
| node_addresses       | ['192.168.101.51', '192.168.101.65', '192.168.101.64', '192.168.101.57']                                                           |
| uuid                 | cca3d284-33d5-478d-ba1f-c967285bbce9                                                                                               |
| stack_id             | 87de61cb-b949-430b-9c17-8223b3b30244                                                                                               |
| status_reason        | None                                                                                                                               |
| created_at           | 2023-05-04T07:53:07+00:00                                                                                                          |
| updated_at           | 2023-05-04T08:07:59+00:00                                                                                                          |
| coe_version          | v1.18.9                                                                                                                            |
| labels               | {'container_infra_prefix': 'tcsmagnum.tcsecp.com/tcsmagnum/', 'docker_volume_type': 'az1-stable2', 'boot_volume_size': '40',       |
|                      | 'boot_volume_type': 'az1-stable2', 'kube_tag': 'v1.18.9', 'calico_ipv4pool': '10.100.0.0/24', 'flannel_network_subnetlen': '28',   |
|                      | 'flannel_backend': 'host-gw', 'metrics_server_enabled': 'true', 'monitoring_enabled': 'true', 'prometheus_adapter_enabled':        |
|                      | 'true', 'cinder_csi_enabled': 'true', 'grafana_admin_passwd': 'linux', 'auto_healing_enabled': 'True', 'auto_healing_controller':  |
|                      | 'magnum-auto-healer', 'auto_scaling_enabled': 'True', 'min_node_count': '4', 'max_node_count': '5'}                                |
| labels_overridden    | {}                                                                                                                                 |
| labels_skipped       | {}                                                                                                                                 |
| labels_added         | {}                                                                                                                                 |
| faults               |                                                                                                                                    |
| keypair              | k8s                                                                                                                                |
| api_address          | https://192.168.101.50:6443                                                                                                        |
| master_addresses     | ['192.168.101.50']                                                                                                                 |
| create_timeout       | 60                                                                                                                                 |
| node_count           | 4                                                                                                                                  |
| discovery_url        | https://discovery.etcd.io/d25a19902d2509218505f7ef75e7aad7                                                                         |
| master_count         | 1                                                                                                                                  |
| container_version    | 1.12.6                                                                                                                             |
| name                 | testnew                                                                                                                            |
| master_flavor_id     | g1t1.large                                                                                                                         |
| flavor_id            | g1t1.large                                                                                                                         |
| health_status_reason | {}                                                                                                                                 |
| project_id           | f97c046e99e146fea19127de7f9c63df                                                                                                   |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
root@maniltest:~#

root@maniltest:~# openstack coe nodegroup list cca3d284-33d5-478d-ba1f-c967285bbce9
+--------------------------------------+----------------+------------+------------------+------------+-----------------+--------+
| uuid                                 | name           | flavor_id  | image_id         | node_count | status          | role   |
+--------------------------------------+----------------+------------+------------------+------------+-----------------+--------+
| 98491117-6ab3-463c-8d12-9937c6b5bdc4 | default-master | g1t1.large | fedora-coreos-32 |          1 | CREATE_COMPLETE | master |
| 4bb80d26-d1f6-44b8-994b-1dbd25290a63 | default-worker | g1t1.large | fedora-coreos-32 |          4 | CREATE_COMPLETE | worker |
+--------------------------------------+----------------+------------+------------------+------------+-----------------+--------+
root@maniltest:~# openstack coe nodegroup show cca3d284-33d5-478d-ba1f-c967285bbce9 4bb80d26-d1f6-44b8-994b-1dbd25290a63
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field              | Value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| uuid               | 4bb80d26-d1f6-44b8-994b-1dbd25290a63                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| name               | default-worker                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| cluster_id         | cca3d284-33d5-478d-ba1f-c967285bbce9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| project_id         | f97c046e99e146fea19127de7f9c63df                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| docker_volume_size | 20                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| labels             | {'container_infra_prefix': 'tcsmagnum.tcsecp.com/tcsmagnum/', 'docker_volume_type': 'az1-stable2', 'boot_volume_size': '40', 'boot_volume_type': 'az1-stable2', 'kube_tag': 'v1.18.9', 'calico_ipv4pool': '10.100.0.0/24', 'flannel_network_subnetlen': '28', 'flannel_backend': 'host-gw', 'metrics_server_enabled': 'true', 'monitoring_enabled': 'true', 'prometheus_adapter_enabled': 'true', 'cinder_csi_enabled': 'true', 'grafana_admin_passwd': 'linux', 'auto_healing_enabled': 'True', 'auto_healing_controller': 'magnum-auto-healer', 'auto_scaling_enabled': 'True', 'min_node_count': '4', 'max_node_count': '5'} |
| labels_overridden  | {}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| labels_skipped     | {}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| labels_added       | {}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| flavor_id          | g1t1.large                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| image_id           | fedora-coreos-32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| node_addresses     | ['192.168.101.51', '192.168.101.65', '192.168.101.64', '192.168.101.57']                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| node_count         | 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| role               | worker                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| max_node_count     | None                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| min_node_count     | 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| is_default         | True                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| stack_id           | 87de61cb-b949-430b-9c17-8223b3b30244                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| status             | CREATE_COMPLETE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| status_reason      | Stack CREATE completed successfully                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
root@maniltest:~#

autoscaler logs

W0504 08:06:09.929077       1 clusterstate.go:629] Readiness for node group default-worker not found
W0504 08:06:09.929109       1 static_autoscaler.go:785] Couldn't find template for node group default-worker
E0504 08:06:09.929189       1 static_autoscaler.go:415] Failed to scale up: Could not compute total resources: No node info for: default-worker
W0504 08:06:09.929199       1 clusterstate.go:389] Failed to find readiness information for default-worker
W0504 08:06:09.929205       1 clusterstate.go:451] Failed to find readiness information for default-worker
W0504 08:06:09.929208       1 clusterstate.go:389] Failed to find readiness information for default-worker
W0504 08:06:19.935182       1 clusterstate.go:451] Failed to find readiness information for default-worker
W0504 08:06:19.935302       1 static_autoscaler.go:785] Couldn't find template for node group default-worker
E0504 08:06:19.935469       1 static_autoscaler.go:415] Failed to scale up: Could not compute total resources: No node info for: default-worker
W0504 08:06:29.943524       1 static_autoscaler.go:785] Couldn't find template for node group default-worker
E0504 08:06:29.943658       1 static_autoscaler.go:415] Failed to scale up: Could not compute total resources: No node info for: default-worker
W0504 08:06:39.952479       1 static_autoscaler.go:785] Couldn't find template for node group default-worker
E0504 08:06:39.952610       1 static_autoscaler.go:415] Failed to scale up: Could not compute total resources: No node info for: default-worker
W0504 08:06:49.959214       1 static_autoscaler.go:785] Couldn't find template for node group default-worker
E0504 08:06:49.959385       1 static_autoscaler.go:415] Failed to scale up: Could not compute total resources: No node info for: default-worker
I0504 08:07:48.237123       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:07:48.237772       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 50.382µs
I0504 08:09:48.238007       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:09:48.238063       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 13.774µs
I0504 08:11:48.238281       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:11:48.238693       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 14.355µs
I0504 08:13:48.238948       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:13:48.239091       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 15.455µs
I0504 08:15:48.239318       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:15:48.239515       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 14.885µs
I0504 08:17:48.239766       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:17:48.239809       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 12.235µs
I0504 08:19:48.240000       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:19:48.240405       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 17.217µs
I0504 08:21:48.240894       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:21:48.241093       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 47.07µs
I0504 08:23:48.241364       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:23:48.241408       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 11.781µs
[root@testnew-wxdecq7kb4p5-master-0 core]# kubectl logs -n kube-system cluster-autoscaler-57b98d69c8-gvn82
------------------------------------------------------------------------------------------------------------------------------
root@maniltest:~# openstack server list | grep -i testnew
| efa09698-3b73-48a5-964e-abf9d5222e68 | testnew-wxdecq7kb4p5-node-0          | ACTIVE  | k8sinternal=192.168.45.47, 192.168.101.51                                  |       | g1t1.large |
| 3fa5258d-2ec6-4980-a7f3-fb343db982fb | testnew-wxdecq7kb4p5-node-1          | ACTIVE  | k8sinternal=192.168.45.51, 192.168.101.65                                  |       | g1t1.large |
| 344507e6-f733-4998-9fad-0908a658cdf4 | testnew-wxdecq7kb4p5-node-3          | SHUTOFF | k8sinternal=192.168.45.48, 192.168.101.57                                  |       | g1t1.large |
| d8603d5f-a495-466b-b4a0-f197b0377590 | testnew-wxdecq7kb4p5-node-2          | ACTIVE  | k8sinternal=192.168.45.49, 192.168.101.64                                  |       | g1t1.large |
| 7fc31c4e-ad7c-4f5f-bba9-a850c9a5c144 | testnew-wxdecq7kb4p5-master-0        | ACTIVE  | k8sinternal=192.168.45.46, 192.168.101.50                                  |       | g1t1.large |
root@maniltest:~#

-------------------------------------------------------------------------------------------------------------------

I0504 08:17:48.239766       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:17:48.239809       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 12.235µs
I0504 08:19:48.240000       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:19:48.240405       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 17.217µs
I0504 08:21:48.240894       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:21:48.241093       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 47.07µs
I0504 08:23:48.241364       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:23:48.241408       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 11.781µs
I0504 08:25:48.241592       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:25:48.242042       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 12.116µs
I0504 08:27:48.242311       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:27:48.242506       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 15.12µs

I0504 08:29:48.242705       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:29:48.242936       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 12.089µs
I0504 08:31:48.243129       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:31:48.243600       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 11.499µs
I0504 08:33:48.243797       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:33:48.243991       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 12.687µs
I0504 08:35:48.244181       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:35:48.244363       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 12.686µs
I0504 08:37:48.244565       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:37:48.244783       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 12.946µs
I0504 08:39:48.245114       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:39:48.245180       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 13.118µs
I0504 08:41:48.245382       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:41:48.245743       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 13.474µs
I0504 08:43:48.245956       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:43:48.245998       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 12.055µs
I0504 08:45:48.246208       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:45:48.246459       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 14.325µs
I0504 08:45:54.594580       1 static_autoscaler.go:559] Decreasing size of default-worker, expected=4 current=3 delta=-1
I0504 08:45:54.594630       1 magnum_nodegroup.go:255] Decreasing target size by -1, 4->3
I0504 08:45:56.872199       1 static_autoscaler.go:342] Some node group target size was fixed, skipping the iteration
I0504 08:47:48.247550       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0504 08:47:48.247855       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 73.5µs

Regards,
Sriramu Desingh

Revision history for this message

TCSECP (tcsecp) wrote on 2023-05-10:

#3

Hi Team,

Could you please provide an update on this.

Regards,
Sriramu Desingh.

Revision history for this message

Felipe Reyes (freyes) wrote on 2023-05-12 (last edit on 2023-05-12):

#4

Hi, I'm taking a look into this bug, and I would like to get the steps you are following to reproduce this issue, in the bug description there is a command to create the cluster template, although it would be nice to have the list of steps so what command you are using to create the cluster, and then how you are inducing the scale up and down of the cluster, so we can attempt to reproduce the problem under similar conditions.

Revision history for this message

TCSECP (tcsecp) wrote on 2023-05-12:

#5

Hi Felipe,

Thanks for your update. I have mentioned the below steps.

Step1: Launching magnum cluster

openstack coe cluster create --cluster-template=k8s--calico-cinder-auto-health-largef_min4_max5_lb --master-count 3 --node-count 4 --flavor g1t1.large --master-flavor g1t1.large k8stest

step2: once the cluster creation is completed, then

step3: I will login to one of the worker nodes and stop the kubelet service.

step4: After that magnum auto healer pod will automatically stop drain the worker node and shutdown the instance.

Step 5: Here I am facing a problem like instead of Autoscaling, the k8s cluster started to scale down the worker node.

Please let me know if you need any further information.

Regards,
Sriramu Desingh.

description:

updated

Revision history for this message

TCSECP (tcsecp) wrote on 2023-05-15:

#6

Hi Team,
Any update on this.

Regards,
Sriramu Desingh.

Revision history for this message

Felipe Reyes (freyes) wrote on 2023-05-15: Re: [Bug 2015870] Re: OpenStack Magnum k8s autoscaling failed to join the Kubernetes cluster

#7

> step4: After that magnum auto healer pod will automatically stop drain
> the worker node and shutdown the instance.
>
> Step 5: Here I am facing a problem like instead of Autoscaling, the k8s
> cluster started to scale down the worker node.

IIUC, the behavior you expect here is that if you have a 4 workers environment, the one where you
stopped kubelet gets killed and a new worker node is added so meet the expectation of having a 4
nodes cluster running, is this correct?

Revision history for this message

TCSECP (tcsecp) wrote on 2023-05-15:

#8

Hi Felipe,

Thanks for your update. Yes, it should add a new worker node to meet the expectation of minimum worker node count 4.

Regards,
Sriramu Desingh.

Revision history for this message

Felipe Reyes (freyes) wrote on 2023-05-15:

#9

> Thanks for your update. Yes, it should add a new worker node to meet the
> expectation of minimum worker node count 4.

These 2 labels are being passed to the cluster template: --label 'min_node_count=4' --label
'max_node_count=5'

Please provide us the output of the following commands:

1. openstack coe nodegroup list k8stest

2. For nodegroup run: openstack coe nodegroup show k8stest <NODEGROUP_NAME>

Revision history for this message

TCSECP (tcsecp) wrote on 2023-05-15:

#10

Download full text (14.8 KiB)

Hi Felipe,

I have already updated that in previous bug chain, below I am updating the same.

oot@maniltest:~# openstack coe nodegroup list cca3d284-33d5-478d-ba1f-c967285bbce9
+--------------------------------------+----------------+------------+------------------+------------+-----------------+--------+
| uuid | name | flavor_id | image_id | node_count | status | role |
+--------------------------------------+----------------+------------+------------------+------------+-----------------+--------+
| 98491117-6ab3-463c-8d12-9937c6b5bdc4 | default-master | g1t1.large | fedora-coreos-32 | 1 | CREATE_COMPLETE | master |
| 4bb80d26-d1f6-44b8-994b-1dbd25290a63 | default-worker | g1t1.large | fedora-coreos-32 | 4 | CREATE_COMPLETE | worker |
+--------------------------------------+----------------+------------+------------------+------------+-----------------+--------+
root@maniltest:~# openstack coe nodegroup show cca3d284-33d5-478d-ba1f-c967285bbce9 4bb80d26-d1f6-44b8-994b-1dbd25290a63
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| uuid | 4bb80d26-d1f6-44b8-994b-1dbd25290a63 ...

Hi Felipe,

I have already updated that in previous bug chain, below I am updating  the same.

oot@maniltest:~# openstack coe nodegroup list cca3d284-33d5-478d-ba1f-c967285bbce9
+--------------------------------------+----------------+------------+------------------+------------+-----------------+--------+
| uuid                                 | name           | flavor_id  | image_id         | node_count | status          | role   |
+--------------------------------------+----------------+------------+------------------+------------+-----------------+--------+
| 98491117-6ab3-463c-8d12-9937c6b5bdc4 | default-master | g1t1.large | fedora-coreos-32 |          1 | CREATE_COMPLETE | master |
| 4bb80d26-d1f6-44b8-994b-1dbd25290a63 | default-worker | g1t1.large | fedora-coreos-32 |          4 | CREATE_COMPLETE | worker |
+--------------------------------------+----------------+------------+------------------+------------+-----------------+--------+
root@maniltest:~# openstack coe nodegroup show cca3d284-33d5-478d-ba1f-c967285bbce9 4bb80d26-d1f6-44b8-994b-1dbd25290a63
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field              | Value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| uuid               | 4bb80d26-d1f6-44b8-994b-1dbd25290a63                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| name               | default-worker                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| cluster_id         | cca3d284-33d5-478d-ba1f-c967285bbce9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| project_id         | f97c046e99e146fea19127de7f9c63df                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| docker_volume_size | 20                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| labels             | {'container_infra_prefix': 'tcsmagnum.tcsecp.com/tcsmagnum/', 'docker_volume_type': 'az1-stable2', 'boot_volume_size': '40', 'boot_volume_type': 'az1-stable2', 'kube_tag': 'v1.18.9', 'calico_ipv4pool': '10.100.0.0/24', 'flannel_network_subnetlen': '28', 'flannel_backend': 'host-gw', 'metrics_server_enabled': 'true', 'monitoring_enabled': 'true', 'prometheus_adapter_enabled': 'true', 'cinder_csi_enabled': 'true', 'grafana_admin_passwd': 'linux', 'auto_healing_enabled': 'True', 'auto_healing_controller': 'magnum-auto-healer', 'auto_scaling_enabled': 'True', 'min_node_count': '4', 'max_node_count': '5'} |
| labels_overridden  | {}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| labels_skipped     | {}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| labels_added       | {}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| flavor_id          | g1t1.large                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| image_id           | fedora-coreos-32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| node_addresses     | ['192.168.101.51', '192.168.101.65', '192.168.101.64', '192.168.101.57']                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| node_count         | 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| role               | worker                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| max_node_count     | None                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| min_node_count     | 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| is_default         | True                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| stack_id           | 87de61cb-b949-430b-9c17-8223b3b30244                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| status             | CREATE_COMPLETE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| status_reason      | Stack CREATE completed successfully

Regards,
Sriramu Desingh.

Revision history for this message

TCSECP (tcsecp) wrote on 2023-05-19:

#11

Hi Felipe,

Please provide an update on this.

Regards,
Sriramu Desingh.

Revision history for this message

TCSECP (tcsecp) wrote on 2023-05-22:

#12

Hi Felipe,

Please Provide an update on this.

Regards,
Sriramu Desingh.

Revision history for this message

TCSECP (tcsecp) wrote on 2023-05-30:

#13

Hi Felipe,

We are waiting for a positive update.

Regards,
Sriramu Desingh.

Revision history for this message

Felipe Reyes (freyes) wrote on 2023-06-05:

#14

Hi Sriramu,

I haven't been able to reproduce the issue you are seeing on your environment. I would like to ask you to capture some extra logs to understand.

1. Enable magnum in debugging mode and wait until the config-changed hook completes:

juju config magnum debug=True

2. Create a new cluster template with the settings you've tested so far.
3. Create a new cluster using the template created previously.
4. Wait until the cluster has fully deployed.
5. Capture the following logs:
    - in magnum units /var/log/magnum and /var/log/apache2 directories.
    - juju config magnum/leader date # this allow us to get a reference of the time window we need to analyze.
    - "kubectl logs -n kube-system cluster-autoscaler-XXXX" (replacing XXXX with suffix of the deployment).

Thanks in advance.

Revision history for this message

TCSECP (tcsecp) wrote on 2023-06-07:

#15

magnum.log Edit (470.0 KiB, text/plain)

Hi Felipe,

I have attached the requested logs.

Regards,
Sriramu D?field.comment=Hi Felipe,

I have attached the requested logs.

Regards,
Sriramu D?field.comment=Hi Felipe,

I have attached the requested logs.

Regards,
Sriramu D?field.comment=Hi Felipe,

I have attached the requested logs.

Regards,
Sriramu D

Revision history for this message

TCSECP (tcsecp) wrote on 2023-06-07:

#16

magnum.txt Edit (4.0 KiB, text/plain)

Revision history for this message

Felipe Reyes (freyes) wrote on 2023-06-09:

#17

F0411 12:31:26.253479 1 magnum_cloud_provider.go:162] Failed to create magnum manager: unable
to access cluster (c8640bfb-563d-449a-b0d3-e8c55bfbe7f2): The service is currently unable to handle
the request due to a temporary overloading or maintenance. This is a temporary condition. Try again
later.

This autoscaler error is referencing a request id c8640bfb-563d-449a-b0d3-e8c55bfbe7f2 which is not
present in the logs shared, can you please grep it in your logs and share the relevant bits, or the
whole log file and I can do it on my own if that works better for you.

Revision history for this message

TCSECP (tcsecp) wrote on 2023-06-12:

#18

magnum_log.log Edit (470.0 KiB, text/plain)

Hi Felipe,

The Mangum log is around 10 to 29GB. Kindly share the FTP link to upload the logs.

Regards,
Sriramu Desingh.

Revision history for this message

TCSECP (tcsecp) wrote on 2023-06-12:

#19

magnum_log.log Edit (470.0 KiB, text/plain)

Hi Felipe,

I have attached the Magnum log with requested ref id.

Regards,
Sriramu Deisngh.

Revision history for this message

TCSECP (tcsecp) wrote on 2023-06-23:

#20

Hi Felipe,

Please provide an update.

Regards,
Sriramu Desingh.

Revision history for this message

Felipe Reyes (freyes) wrote on 2023-07-28:

#21

Download full text (6.4 KiB)

Hi,

I've been going through the logs that were handed off internally[0] where I
found some unexpected failures that I believe could be affecting the behaviour
of Magnum, I will list them and explain what the could mean separately.

1. barbicanclient.exceptions.HTTPServerError: Internal Server Error: Secret payload retrieval failure seen
2. pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
3. oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED [...]
4. ConnectionResetError: [Errno 104] Connection reset by peer
5. keystoneauth1.exceptions.http.Unauthorized: The request you have made requires authentication.

About (1), this error is present in the logs ~14k times[1], the first occurrence
on January 27th and the last one in the log is on June 19th. The absence of a
healthy Barbican service prevents Magnum from stablishing a connection to k8s
since that's the place where the secrets (e.g. private keys) are stored and read from.

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/magnum/service/periodic.py", line 106, in _update_health_status
    monitor.poll_health_status()
  File "/usr/lib/python3/dist-packages/magnum/drivers/common/k8s_monitor.py", line 55, in poll_health_status
    k8s_api = k8s.create_k8s_api(self.context, self.cluster)
  File "/usr/lib/python3/dist-packages/magnum/conductor/k8s_api.py", line 145, in create_k8s_api
    return K8sAPI(context, cluster)
  File "/usr/lib/python3/dist-packages/magnum/conductor/k8s_api.py", line 114, in __init__
    self.cert_file) = create_client_files(cluster, context)
  File "/usr/lib/python3/dist-packages/magnum/conductor/handlers/common/cert_manager.py", line 159, in create_client_files
    magnum_cert.get_decrypted_private_key()))
  File "/usr/lib/python3/dist-packages/magnum/common/cert_manager/cert_manager.py", line 46, in get_decrypted_private_key
    return operations.decrypt_key(self.get_private_key(),
  File "/usr/lib/python3/dist-packages/magnum/common/cert_manager/barbican_cert_manager.py", line 52, in get_private_key
    return self._cert_container.private_key.payload
  File "/usr/lib/python3/dist-packages/barbicanclient/v1/secrets.py", line 193, in payload
    self._fetch_payload()
  File "/usr/lib/python3/dist-packages/barbicanclient/v1/secrets.py", line 271, in _fetch_payload
    payload = self._api._get_raw(payload_url, headers=headers)
  File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 83, in _get_raw
    return self.request(path, 'GET', *args, **kwargs).content
  File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 63, in request
    self._check_status_code(resp)
  File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 97, in _check_status_code
    raise exceptions.HTTPServerError(
barbicanclient.exceptions.HTTPServerError: Internal Server Error: Secret payload retrieval failure seen - please contact site administrator.

About (2), this is a better understood error in general, it's basically that the
database dropped the client (in this case magnum-conductor process), this can be
due to numerous reasons, more data is needed to understan...

OpenStack Magnum Charm

OpenStack Magnum k8s autoscaling failed to join the Kubernetes cluster

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches