Sometimes when build_instance fails on n-cpu, the error that n-cond receives is mangles like this:
Nov 22 17:39:04 jh-devstack-03 nova-conductor[26556]: ERROR nova.scheduler.utils [None req-fd8acb29-8a2c-4603-8786-54f2580d0ea9 tempest-FloatingIpSameNetwork-1597192363 tempest-FloatingIpSameNetwork-1597192363]
[instance: 5ee9d527-0043-474e-bfb3-e6621426662e] Error from last host: jh-devstack-03 (node jh-devstack03): [u'Traceback (most recent call last):\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 1847, in
_do_build_and_run_instance\n filter_properties)\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 2086, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n',
u"RescheduledException: Build of instance 5ee9d527-0043-474e-bfb3-e6621426662e was re-scheduled: operation failed: domain 'instance-00000028' already exists with uuid
93974d36e3a7-4139bbd8-2d5b51195a5f\n"]
Nov 22 17:39:04 jh-devstack-03 nova-conductor[26556]: WARNING nova.scheduler.utils [None req-fd8acb29-8a2c-4603-8786-54f2580d0ea9 tempest-FloatingIpSameNetwork-1597192363 tempest-FloatingIpSameNetwork-1597192363]
Failed to compute_task_build_instances: No sql_connection parameter is established: CantStartEngineError: No sql_connection parameter is established
Nov 22 17:39:04 jh-devstack-03 nova-conductor[26556]: WARNING nova.scheduler.utils [None req-fd8acb29-8a2c-4603-8786-54f2580d0ea9 tempest-FloatingIpSameNetwork-1597192363 tempest-FloatingIpSameNetwork-1597192363]
[instance: 5ee9d527-0043-474e-bfb3-e6621426662e] Setting instance to ERROR state.: CantStartEngineError: No sql_connection parameter is established
Seem to occur quite often in gate, too. http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Setting%20instance%20to%20ERROR%20state.%3A%20CantStartEngineError%5C%22
The result is that the instance information shows "No sql_connection parameter is established" instead of the original error, making debugging the root cause quite difficult.
I just noticed that this also blocks rescheduling when launching an instance failed on one host, so the instance will end up in error state immediately. I guess that this could make this a critical regression.