I see this on the master branch of magnum currently. It's a race condition, and does not happen on every cluster creation, but does happen on most. I have a server with 48 cores (24 + hyperthreading), and so have 48 magnum conductor worker processes. When this issue occurs, I actually see one of these tracebacks for each worker, and all seem to be aligned temporally. This is another issue which I will raise separately.
I added some logging to the relevant section of code:
diff --git a/magnum/drivers/heat/driver.py b/magnum/drivers/heat/driver.py
index 5aa5a8e..96563b3 100644
--- a/magnum/drivers/heat/driver.py
+++ b/magnum/drivers/heat/driver.py
@@ -73,6 +73,9 @@ class HeatDriver(driver.Driver):
One key difference here is that the cluster's trust fields (trustee_username, trustee_password, trust_id) are all None when the error message occurs, and not None thereafter. This suggests to me that we should not be trying to poll heat if the cluster's trust fields are None.
I see this on the master branch of magnum currently. It's a race condition, and does not happen on every cluster creation, but does happen on most. I have a server with 48 cores (24 + hyperthreading), and so have 48 magnum conductor worker processes. When this issue occurs, I actually see one of these tracebacks for each worker, and all seem to be aligned temporally. This is another issue which I will raise separately.
I added some logging to the relevant section of code:
diff --git a/magnum/ drivers/ heat/driver. py b/magnum/ drivers/ heat/driver. py drivers/ heat/driver. py drivers/ heat/driver. py driver. Driver) :
index 5aa5a8e..96563b3 100644
--- a/magnum/
+++ b/magnum/
@@ -73,6 +73,9 @@ class HeatDriver(
def update_ cluster_ status( self, context, cluster): make_cluster_ context( cluster) clients. OpenStackClient s(stack_ ctx), context,
cluster, self)
poller. poll_and_ check()
stack_ctx = mag_ctx.
+ LOG.info("MG: cluster %s", dict(cluster))
poller = HeatPoller(
When the error occurs, we see the following output:
2017-07-04 18:58:08.384 41 INFO magnum. drivers. heat.driver [req-c48a6b51- e7c9-4898- 956f-512f01ab3d 68 - - - - -] MG: cluster {'updated_at': None, 'trustee_user_id': None, 'keypair': u'controller', 'node_co 3549b77a64d946a 3ec0', 'uuid': 'b78e4f5c- 02cf-4208- bc09-ecea3e7c86 7b', 'cluster_template': ClusterTemplate (apiserver_ cluster_ distro= 'fedora' ,coe='kubernete s',created_ at=2017- 07-03T13: 07:52Z, dns_nameserver= '8.8.8. 8',docker_ storage_ driver= 'devicemapper' ,docker_ volume_ size=3, external_ network_ id='ilab' ,fixed_ netwo internal' ,fixed_ subnet= 'p3-internal' ,flavor_ id='compute- A-magnum' ,floating_ ip_enabled= True,http_ proxy=None, https_proxy= None,id= 4,image_ id='k8s- fedora- 25',insecure_ registry= None,keypair_ id=None, labe flavor_ id='compute- A-magnum' ,master_ lb_enabled= False,name= 'k8s-fedora- 25',network_ driver= 'flannel' ,no_proxy= None,project_ id='37a2b9a3299 e44f8bc3f6dc600 07ef76' ,public= False,registry_ enabled= Fa type='bm' ,tls_disabled= False,updated_ at=None, user_id= 'ae1a02bad3a043 549b77a64d946a3 ec0',uuid= 'ea0ba754- a769-493a- ade7-d25762e466 9e',volume_ driver= None), 'ca_cert_ref': None, 'api_address': None 4f8bc3f6dc60007 ef76', 'status': u'CREATE_ IN_PROGRESS' , 'docker_ volume_ size': 3, 'master_count': 1, 'node_addresses': [], 'statu template_ id': u'ea0ba754- a769-493a- ade7-d25762e466 9e', 'name': u'k8s-xxx', 'stack_id': None, 'created_at': datetime. datetime( 2017, 7, 4, 17, 58, 3, tzinfo=<i version' : None, 'trustee_password': None, 'trustee_username': None, 'discovery_url': None}
unt': 1, 'id': 22, 'trust_id': None, 'magnum_cert_ref': None, 'user_id': u'ae1a02bad3a04
port=None,
rk='p3-
ls={},master_
lse,server_
, 'master_addresses': [], 'create_timeout': 60, 'project_id': u'37a2b9a3299e4
s_reason': None, 'coe_version': None, 'cluster_
so8601.Utc>), 'container_
When the error does not occur, we see the following:
2017-07-04 19:00:37.368 40 INFO magnum. drivers. heat.driver [req-4cd7235d- c7f4-4497- 817c-c9c0a8d57e ed - - - - -] MG: cluster {'updated_at': datetime. datetime( 2017, 7, 4, 17, 58, 12, tzinfo= <iso8601. Utc>), 9df8500e521fb95 e746', 'keypair': u'controller', 'node_count': 1, 'id': 22, 'trust_id': u'0c73f85318b94 d3dbf8a9110fbc9 f983', 'magnum_cert_ref': u'http:// 10.60.253. 128:9311 16b00b96- 74f3-4cb5- aac4-9ace755d90 c7', 'user_id': u'ae1a02bad3a04 3549b77a64d946a 3ec0', 'uuid': 'b78e4f5c- 02cf-4208- bc09-ecea3e7c86 7b', 'cluster_template': ClusterTemplate (apiserver_ port=Non distro= 'fedora' ,coe='kubernete s',created_ at=2017- 07-03T13: 07:52Z, dns_nameserver= '8.8.8. 8',docker_ storage_ driver= 'devicemapper' ,docker_ volume_ size=3, external_ network_ id='ilab' ,fixed_ network= 'p3-i ,fixed_ subnet= 'p3-internal' ,flavor_ id='compute- A-magnum' ,floating_ ip_enabled= True,http_ proxy=None, https_proxy= None,id= 4,image_ id='k8s- fedora- 25',insecure_ registry= None,keypair_ id=None, labels= {},ma id='compute- A-magnum' ,master_ lb_enabled= False,name= 'k8s-fedora- 25',network_ driver= 'flannel' ,no_proxy= None,project_ id='37a2b9a3299 e44f8bc3f6dc600 07ef76' ,public= False,registry_ enabled= False,serv 'bm',tls_ disabled= False,updated_ at=None, user_id= 'ae1a02bad3a043 549b77a64d946a3 ec0',uuid= 'ea0ba754- a769-493a- ade7-d25762e466 9e',volume_ driver= None), 'ca_cert_ref': u'http:// 10.60.253. 128:9311/ v1/co a3e7060d- 4b44-4cb7- 9cf1-57a7b3b226 ba', 'api_address': None, 'master_addresses': [], 'create_timeout': 60, 'project_id': u'37a2b9a3299e4 4f8bc3f6dc60007 ef76', 'status': u'CREATE_ IN_PROGRESS' , 'dock template_ id': u'ea0ba754- a769-493a- ade7-d25762e466 9e', 'name': u'k8s-xxx', 'stack_id': u'd 0708-413a- 8413-820e1e97fa 0a', 'created_at': datetime. datetime( 2017, 7, 4, 17, 58, 3, tzinfo= <iso8601. Utc>), 'container_ version' : None, 'trustee_password': u'eTu8bWhJc9LUc rae2E', 'trustee_username' 02cf-4208- bc09-ecea3e7c86 7b_37a2b9a3299e 44f8bc3f6dc6000 7ef76', 'discovery_url': u'https:/ /discovery. etcd.io/ 7c9aa5538cf6975 54feb86c30ee5a9 21'}
'trustee_user_id': u'5f1fbc8933e74
/v1/containers/
e,cluster_
nternal'
ster_flavor_
er_type=
ntainers/
er_volume_size': 3, 'master_count': 1, 'node_addresses': [], 'status_reason': None, 'coe_version': None, 'cluster_
d020bdd-
: u'b78e4f5c-
One key difference here is that the cluster's trust fields (trustee_username, trustee_password, trust_id) are all None when the error message occurs, and not None thereafter. This suggests to me that we should not be trying to poll heat if the cluster's trust fields are None.