magnum create cluster "create_in_progress" and changes to "create_failed" after timeout

Bug #1720816 reported by srujan
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Magnum
New
Undecided
Unassigned

Bug Description

Issue is observed with all COEs; Kubernetes, swarm and mesos
observed same issue is in both fedora-atomic and ubuntu-mesos images.
magnum --version
2.5.0
Openstack-ansible deployed
DISTRIB_ID="OSA"
DISTRIB_RELEASE="15.1.9"
DISTRIB_CODENAME="Ocata"

root@utility-container:~# magnum cluster-list
+--------------------------------------+------------------+------------+------------+--------------+---------------+
| uuid | name | keypair | node_count | master_count | status |
+--------------------------------------+------------------+------------+------------+--------------+---------------+
| b1170284-9ac3-4676-870c-af2597ea6d76 | test-cluster01 | mykey | 1 | 1 | CREATE_FAILED |
| c67e4610-2384-4ad0-bc03-072e6c10713f | test-cluster02 | mykey | 1 | 1 | CREATE_FAILED |
| 5bb46ecd-dd2e-425a-b548-78d6eaf750c8 | test03 | mykey | 1 | 1 | CREATE_FAILED |
| a70d819f-5e94-4e56-ad6f-9be07c4ee58e | mesos-cluster-02 | mykey | 1 | 1 | CREATE_FAILED |
| c8e23ae5-64b8-4ddc-9395-4698bc6f9337 | mesos-cluster-03 | mykey | 1 | 1 | CREATE_FAILED |
| 314e50be-ce9c-4615-8a4b-cccad1800bc7 | mesos-cluster-04 | mykey | 1 | 1 | CREATE_FAILED |
+--------------------------------------+------------------+------------+------------+--------------+---------------+
root@utility-container:~# magnum cluster-show test03
| Property | Value --------------------------------------------------------+
| status | CREATE_FAILED |
| cluster_template_id | 9ee0ff65-f81a-432c-972b-5de25ddd51ab |
| node_addresses | [] |
| uuid | 5bb46ecd-dd2e-425a-b548-78d6eaf750c8 |
| stack_id | 75345ad3-609c-4d2f-9bc9-bac20717531a |
| status_reason | Timed out |
| created_at | 2017-09-29T17:31:04+00:00 |
| updated_at | 2017-09-29T18:31:19+00:00 |
| coe_version | v1.5.3 |
| faults | {'0': 'resources[0]: Stack CREATE cancelled', 'kube_masters': 'CREATE aborted (Task create from ResourceGroup "kube_masters" Stack "test03-ikz26ajpwlu3" [75345ad3-609c-4d2f-9bc9-bac20717531a] Timed out)', 'master_wait_condition': 'CREATE aborted (Task create from HeatWaitCondition "master_wait_condition" Stack "test03-ikz26ajpwlu3-kube_masters-lzzz7mqxwgax-0-gnyim76waud2" [a17ebd32-65d3-4ee5-a9e0-17c04f9b07e4] Timed out)'} |
| keypair | mykey |
| api_address | https://:6443 |
| master_addresses | ['10.XX.XX.XX'] |
| create_timeout | 60 |
| node_count | 1 |
| discovery_url | https://discovery.etcd.io/1d542c0feb9ef2e5ead7891e1836f98a |
| master_count | 1 |
| container_version | 1.12.6 |
| name | test03 |
+---------------------+--------------------------------------------------------root@utility-container:~# heat stack-list
+--------------------------------------+-------------------------------+-------| id | stack_name | stack_status | creation_time | updated_time |
+--------------------------------------+-------------------------------+---------------+----------------------+--------------+
| cb2501bb-c396-4b6c-a3bb-caa45379068c | test-cluster01-wunzolev6q2n | CREATE_FAILED | 2017-09-29T16:09:15Z | None |
| 6ec23a16-51e8-4380-ba1b-64d0deb975ea | test-cluster02-dh3kpubrq5la | CREATE_FAILED | 2017-09-29T16:45:12Z | None |
| 75345ad3-609c-4d2f-9bc9-bac20717531a | test03-ikz26ajpwlu3 | CREATE_FAILED | 2017-09-29T17:31:17Z | None |
| 61b1c935-984f-4728-a92b-adab0f22ef77 | mesos-cluster-02-lvidk3vjukve | CREATE_FAILED | 2017-09-29T20:13:49Z | None |
| b90874f4-a4ad-4145-990f-96abf29a8534 | mesos-cluster-03-smqttfspyyml | CREATE_FAILED | 2017-09-29T20:35:08Z | None |
| 9ae3ab0a-f837-457f-9b94-5fc2ccec8a21 | mesos-cluster-04-u7l2lktypneb | CREATE_FAILED | 2017-09-29T20:39:21Z | None |
+--------------------------------------+-------------------------------+---------------+----------------------+--------------+
root@utility-container:~# heat stack-show 75345ad3-609c-4d2f-9bc9-bac20717531a
------------------------------------+
| Property | Value ------------------------------------+
| capabilities | [] |
| creation_time | 2017-09-29T17:31:17Z |
| deletion_time | None |
| description | This template will boot a Kubernetes cluster with one |
| | or more minions (as specified by the number_of_minions |
| | parameter, which defaults to 1). |
| disable_rollback | True |
| id | 75345ad3-609c-4d2f-9bc9-bac20717531a |
| links | https://10.XX.XX.XX:8004/v1/0df01d5576b5451787a441be0aa819e0/stacks/test03-ikz26ajpwlu3/75345ad3-609c-4d2f-9bc9-bac20717531a (self) |
| notification_topics | [] |
| outputs | [ |
| | { |
| | "output_value": [ |
| | "10.20.30.22" |
| | ], |
| | "output_key": "kube_masters_private", |
| | "description": "This is a list of the \"private\" IP addresses of all the Kubernetes masters.\n" |
| | }, |
| | { |
| | "output_value": [ |
| | "10.XX.XX.XX" |
| | ], |
| | "output_key": "kube_masters", |
| | "description": "This is a list of the \"public\" IP addresses of all the Kubernetes masters. Use these IP addresses to log in to the Kubernetes masters via ssh.\n" |
| | }, |
| | { |
| | "output_value": "", |
| | "output_key": "api_address", |
| | "description": "This is the API endpoint of the Kubernetes cluster. Use this to access the Kubernetes API.\n" |
| | }, |
| | { |
| | "output_value": null, |
| | "output_key": "kube_minions_private", |
| | "description": "This is a list of the \"private\" IP addresses of all the Kubernetes minions.\n" |
| | }, |
| | { |
| | "output_value": null, |
| | "output_key": "kube_minions", |
| | "description": "This is a list of the \"public\" IP addresses of all the Kubernetes minions. Use these IP addresses to log in to the Kubernetes minions via ssh." |
| | }, |
| | { |
| | "output_value": "localhost:5000", |
| | "output_key": "registry_address", |
| | "description": "This is the url of docker registry server where you can store docker images." |
| | } |
| | ] |
| parameters | { |
| | "OS::project_id": "0df01d5576b5451787a441be0aa819e0", |
| | "fixed_network_cidr": "10.0.0.0/24", |
| | "magnum_url": "https://10.XX.XX.XX:9511", |
| | "number_of_masters": "1", |
| | "tenant_name": "0df01d5576b5451787a441be0aa819e0", |
| | "wait_condition_timeout": "6000", |
| | "minion_flavor": "m1.small", |
| | "portal_network_cidr": "10.254.0.0/16", |
| | "auth_url": "https://10.XX.XX.XX:5000/v3/", |
| | "admission_control_list": "NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota", |
| | "registry_container": "container", |
| | "cluster_uuid": "5bb46ecd-dd2e-425a-b548-78d6eaf750c8", |
| | "kubernetes_port": "6443", |
| | "external_network": "4717b1be-bcb3-4bb7-84eb-f95152b77822", |
| | "trustee_domain_id": "5eea91121d9f4dc6af9ce63310666039", |
| | "flannel_backend": "udp", |
| | "fixed_subnet": "mysubnet", |
| | "region_name": "RegionOne", |
| | "kube_dashboard_enabled": "True", |
| | "kube_dashboard_version": "v1.5.1", |
| | "no_proxy": "", |
| | "registry_port": "5000", |
| | "kube_version": "v1.5.3", |
| | "minions_to_remove": "[]", |
| | "https_proxy": "", |
| | "tls_disabled": "False", |
| | "trust_id": "******", |
| | "volume_driver": "cinder", |
| | "number_of_minions": "1", |
| | "swift_region": "", |
| | "username": "admin", |
| | "http_proxy": "", |
| | "docker_volume_size": "5", |
| | "OS::stack_name": "test03-ikz26ajpwlu3", |
| | "system_pods_timeout": "5", |
| | "insecure_registry_url": "", |
| | "system_pods_initial_delay": "30", |
| | "registry_enabled": "False", |
| | "kube_allow_priv": "true", |
| | "password": "******", |
| | "loadbalancing_protocol": "TCP", |
| | "trustee_password": "******", |
| | "docker_storage_driver": "devicemapper", |
| | "registry_insecure": "True", |
| | "OS::stack_id": "75345ad3-609c-4d2f-9bc9-bac20717531a", |
| | "registry_chunksize": "5242880", |
| | "trustee_user_id": "c563b732e1fd413eaeb6c185a94d48ee", |
| | "network_driver": "flannel", |
| | "fixed_network": "mynet", |
| | "master_flavor": "m1.small", |
| | "trustee_username": "5bb46ecd-dd2e-425a-b548-78d6eaf750c8_0df01d5576b5451787a441be0aa819e0", |
| | "ssh_key_name": "mykey", |
| | "flannel_network_subnetlen": "24", |
| | "flannel_network_cidr": "10.100.0.0/16", |
| | "discovery_url": "https://discovery.etcd.io/1d542c0feb9ef2e5ead7891e1836f98a", |
| | "dns_nameserver": "10.XX.XX.XX", |
| | "server_image": "fedora-atomic-ocata" |
| | } |
| parent | None |
| stack_name | test03-ikz26ajpwlu3 |
| stack_owner | None |
| stack_status | CREATE_FAILED |
| stack_status_reason | Timed out |
| stack_user_project_id | 77c5cd3ac15a422081ff3ed630e9712f |
| tags | null |
| template_description | This template will boot a Kubernetes cluster with one |
| | or more minions (as specified by the number_of_minions |
| | parameter, which defaults to 1). |
| timeout_mins | 60 |
| updated_time | None |
------------------------------------+
root@utility-container:~# heat resource-list test03-ikz26ajpwlu3
+-----------------------------+--------------------------------------+---------| resource_name | physical_resource_id | resource_type | resource_status | updated_time |
+-----------------------------+--------------------------------------+---------| api_address_floating_switch | | Magnum::FloatingIPAddressSwitcher | INIT_COMPLETE | 2017-09-29T17:31:18Z |
| api_address_lb_switch | | Magnum::ApiGatewaySwitcher | INIT_COMPLETE | 2017-09-29T17:31:18Z |
| api_lb | 6bd8bd1e-83f8-4ad7-ad5f-66307b78bda6 | file:///openstack/venvs/magnum-15.1.9/lib/python2.7/site-packages/magnum/drivers/common/templates/lb.yaml | CREATE_COMPLETE | 2017-09-29T17:31:18Z |
| etcd_address_lb_switch | | Magnum::ApiGatewaySwitcher | INIT_COMPLETE | 2017-09-29T17:31:18Z |
| etcd_lb | 1a5d6e23-69c4-40ab-b069-65f51b686212 | file:///openstack/venvs/magnum-15.1.9/lib/python2.7/site-packages/magnum/drivers/common/templates/lb.yaml | CREATE_COMPLETE | 2017-09-29T17:31:18Z |
| kube_masters | ef36144f-646b-4471-bd1d-ee0d9503b82a | OS::Heat::ResourceGroup | CREATE_FAILED | 2017-09-29T17:31:18Z |
| kube_minions | | OS::Heat::ResourceGroup | INIT_COMPLETE | 2017-09-29T17:31:18Z |
| network | a500022c-49aa-417e-bc62-8f31eea9117c | file:///openstack/venvs/magnum-15.1.9/lib/python2.7/site-packages/magnum/drivers/common/templates/network.yaml | CREATE_COMPLETE | 2017-09-29T17:31:18Z |
| secgroup_kube_master | 47694be1-9774-460f-be3e-bd600377b786 | OS::Neutron::SecurityGroup | CREATE_COMPLETE | 2017-09-29T17:31:18Z |
| secgroup_kube_minion | 697e02dc-2b13-4a22-95ad-cca83da72620 | OS::Neutron::SecurityGroup | CREATE_COMPLETE | 2017-09-29T17:31:18Z |
+-----------------------------+--------------------------------------+---------root@utility-container:~# heat resource-show test03-ikz26ajpwlu3 ef36144f-646b-4471-bd1d-ee0d9503b82a
+------------------------+-----------------------------------------------------| Property | Value |
+------------------------+-----------------------------------------------------| attributes | { |
| | "attributes": null, |
| | "refs": null, |
| | "refs_map": null, |
| | "removed_rsrc_list": [] |
| | } |
| creation_time | 2017-09-29T17:31:18Z |
| description | |
| links | https://10.XX.XX.Xx:8004/v1/0df01d5576b5451787a441be0aa819e0/stacks/test03-ikz26ajpwlu3/75345ad3-609c-4d2f-9bc9-bac20717531a/resources/kube_masters (self) |
| | https://10.XX.XX.XX:8004/v1/0df01d5576b5451787a441be0aa819e0/stacks/test03-ikz26ajpwlu3/75345ad3-609c-4d2f-9bc9-bac20717531a (stack) |
| | https://10.XX.XX.XX:8004/v1/0df01d5576b5451787a441be0aa819e0/stacks/test03-ikz26ajpwlu3-kube_masters-lzzz7mqxwgax/ef36144f-646b-4471-bd1d-ee0d9503b82a (nested) |
| logical_resource_id | kube_masters |
| physical_resource_id | ef36144f-646b-4471-bd1d-ee0d9503b82a |
| required_by | api_address_lb_switch |
| | etcd_address_lb_switch |
| resource_name | kube_masters |
| resource_status | CREATE_FAILED |
| resource_status_reason | CREATE aborted (Task create from ResourceGroup "kube_masters" Stack "test03-ikz26ajpwlu3" [75345ad3-609c-4d2f-9bc9-bac20717531a] Timed out) |
| resource_type | OS::Heat::ResourceGroup |
| updated_time | 2017-09-29T17:31:18Z |
+------------------------+-----------------------------------------------------Heat logs shows the stack creation is failed after the timeout.
If anyone has this issue and has resolved it or knows how the issue can be resolved, please share your thoughts.
if you need more data, let me know.
Thank you.

Revision history for this message
srujan (srujanreddy) wrote :

I'm able to login to the instances created, I see some services on the master node are not running like flanneld, etcd, docker. I tried to start them manually and expected if that will help before the timeout. But nothing helped in creating the magnum cluster.

description: updated
Revision history for this message
srujan (srujanreddy) wrote :

heat.conf file attachment

Revision history for this message
srujan (srujanreddy) wrote :

magnum.conf in attachment

srujan (srujanreddy)
description: updated
srujan (srujanreddy)
description: updated
Revision history for this message
suibin zhang (suizh) wrote :

same issue. I am running Pike.

Revision history for this message
srujan (srujanreddy) wrote :

Yeah I got the same issue on Ocata and Pike Environments.

Revision history for this message
suibin zhang (suizh) wrote :

After some trials, I get pass that and CREATE_COMPLETE. (maybe it is not the same issue).

In my case, I run packStack on a m4.xlarge EC2 instance. It does take a while (>40min) to install and start ONE node from Magnum. Also make sure the O.S host has enough memory and disk for the Kube node's flavor (I ended up create a 2GBRAM and 10G disk flavor). Then extend the cluster-create timeout to 120min. While it is creating, I monitor the O.S host cpu and mem usage.

Revision history for this message
srujan (srujanreddy) wrote :

Thanks for the reply, I'm going to try again and i'll keep you posted.

Revision history for this message
mos (mtsietsi) wrote :

Have the same issue here with Openstack Ocata (from cloud-archive). Ubuntu repos for xenial (16.04 LTS) still suffer from bug reported here: https://bugs.launchpad.net/magnum/+bug/1492695 since it's not up to Magnum 5.0.0 yet, so a git clone and source code compile is necessary. Kubernetes and Swarm COEs both give result of CREATE_FAILED.

Revision history for this message
MarkW (mwutzke) wrote :

I believe I am seeing the same issue where Magnum (and Heat) have resources in the "CREATE_IN_PROGRESS" state for a long time, and eventually timeout. I'm using devstack (stable/pike) and the Mesos COE - but I suspect this is seen with other COE's

After some debug, I"ve identified an issue where OS::Heat::SoftwareDeploymentGroup and OS::Heat::SoftwareDeployment (at least) are failing to signal their completion using heat-config-notify. Specifically, the POST is failing with 403 error codes.

It appears (root cause still unknown) that the HTTP request parameters (req.params) passed to EC2Token._authorize() contains more data than the original HEAT signature was generated with. As a result the Keystone ec2tokens request fails.

The additional data in the req.params is the date from the /var/lib/heat-config/deployed/*.notify.json. I'm not yet sure how it ends up in req.params (more investigation is required).

Attached is a temporary patch that removes these additional keys from req.params (the JSON of notofy.json is the key, and the associated value is '')

I'm keen to see if this patch addresses the concerns seen by others on this issue.

Revision history for this message
MarkW (mwutzke) wrote :

I've done some more digging today, and got a handle on what is happening.

Webob exposes a synthesized req.params field - of both the query string and request body (POST) variables.

Any 'POST'ed request body variables (e.g. from heat-config-notify for OS::Heat::SoftwareDeployment resources) will interfere negatively with the signature calculation, resulting in an incorrect signature calculation - which keystone will reject.

The attached patch (that replaces the previous patch) only uses the query string parameters for calculating the signature.

Revision history for this message
mos (mtsietsi) wrote :

Thanks for the patch! I have applied it and the stack creation does progress further than previous attempts, but still eventually fails. This occurs on all COEs. Progress is best with Mesos, but fails at last step OS::Heat::WaitCondition. Will keep poking around.

Revision history for this message
MarkW (mwutzke) wrote :

Updated patch

Revision history for this message
Amir DHAOUI (amirdhaoui) wrote :

I am facing the same issue with openstack Queens, COE is K8S, kube_masters creation is failed

CREATE aborted (Task create from ResourceGroup "kube_masters" Stack "kubernetes-cluster-zlpsjyl35frt" [1c1c0705-6b28-439c-93dd-2f983c9436f4] Timed out)

No errors in heat or Magnum logs!
Any help please

Revision history for this message
MarkW (mwutzke) wrote :

I do not believe the patch I proposed (#12) has been applied upstream yet.
Have you tried applying this patch locally ? Does it help ?

Revision history for this message
Sudheendra Harwalkar (sharwalkar) wrote :

We had similar issue with openstack Queens with swarm COE, after tracing code, observed that magnum-conductor stops after connection to public discovery endpoint, while creating cluster, I enabled and changed discovery url from https to http in maganum.conf and it worked for us.

Revision history for this message
panticz.de (panticz.de) wrote :

Check if swap is disabled the cluster VMs (flavor) and Magnum and Heat configured right

# heat.conf
[DEFAULT]
region_name_for_services = ch-zh1

# magnum.conf
[cinder]
default_docker_volume_type = VT1

For futher debug connect the master and minion VMs and run:
sudo systemctl list-units --failed
sudo journalctl -f

Here a Magnum deployment example with kolla-ansible:
http://www.panticz.de/magnum

Revision history for this message
Henro (henro001) wrote :

I had the same issue where the K8S cluster would be up and running but the resource will `CREATE_FAILED` because of missing notification/

this fixed it for me. tx.
>

# heat.conf
[DEFAULT]
region_name_for_services = ch-zh1

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.