OpenStack Compute (nova)

nova-conductor is masking error when rescheduling

Bug #1733933 reported by Dr. Jens Harbott on 2017-11-22

This bug report is a duplicate of: Bug #1736946: Conductor: fails to clean up networking resources due to _destroy_build_request CantStartEngineError. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Triaged	Low	Unassigned

Bug Description

Sometimes when build_instance fails on n-cpu, the error that n-cond receives is mangles like this:

Nov 22 17:39:04 jh-devstack-03 nova-conductor[26556]: ERROR nova.scheduler.utils [None req-fd8acb29-8a2c-4603-8786-54f2580d0ea9 tempest-FloatingIpSameNetwork-1597192363 tempest-FloatingIpSameNetwork-1597192363]
[instance: 5ee9d527-0043-474e-bfb3-e6621426662e] Error from last host: jh-devstack-03 (node jh-devstack03): [u'Traceback (most recent call last):\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 1847, in
_do_build_and_run_instance\n filter_properties)\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 2086, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n',
u"RescheduledException: Build of instance 5ee9d527-0043-474e-bfb3-e6621426662e was re-scheduled: operation failed: domain 'instance-00000028' already exists with uuid
93974d36e3a7-4139bbd8-2d5b51195a5f\n"]
Nov 22 17:39:04 jh-devstack-03 nova-conductor[26556]: WARNING nova.scheduler.utils [None req-fd8acb29-8a2c-4603-8786-54f2580d0ea9 tempest-FloatingIpSameNetwork-1597192363 tempest-FloatingIpSameNetwork-1597192363]
Failed to compute_task_build_instances: No sql_connection parameter is established: CantStartEngineError: No sql_connection parameter is established
Nov 22 17:39:04 jh-devstack-03 nova-conductor[26556]: WARNING nova.scheduler.utils [None req-fd8acb29-8a2c-4603-8786-54f2580d0ea9 tempest-FloatingIpSameNetwork-1597192363 tempest-FloatingIpSameNetwork-1597192363]
[instance: 5ee9d527-0043-474e-bfb3-e6621426662e] Setting instance to ERROR state.: CantStartEngineError: No sql_connection parameter is established

Seem to occur quite often in gate, too. http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Setting%20instance%20to%20ERROR%20state.%3A%20CantStartEngineError%5C%22

The result is that the instance information shows "No sql_connection parameter is established" instead of the original error, making debugging the root cause quite difficult.

Tags:

Surya Seetharaman (tssurya) on 2017-11-22

tags:

added: conductor

Jiang (jiangpf) on 2017-11-23

Changed in nova:
status:	New → Confirmed
assignee:	nobody → Jiang (jiangpf)

Revision history for this message

Dr. Jens Harbott (j-harbott) wrote on 2017-11-23:

I just noticed that this also blocks rescheduling when launching an instance failed on one host, so the instance will end up in error state immediately. I guess that this could make this a critical regression.

Revision history for this message

Jiang (jiangpf) wrote on 2017-11-24:

self._url_cfg['connection'] in oslo_db/sqlalchemy/enginefacade.py has been changed to None, after build_instance fails on n-cpu.

Revision history for this message

Dr. Jens Harbott (j-harbott) wrote on 2017-12-05:

@Jiang: Any progress on this issue? Otherwise please unassign yourself an let someone else take over.

Revision history for this message

Jiang (jiangpf) wrote on 2017-12-05:

I am sorry, I can not find the reason about this problem.

Changed in nova:
assignee:	Jiang (jiangpf) → nobody

Revision history for this message

Dr. Jens Harbott (j-harbott) wrote on 2017-12-06:

Note that this doesn't only affect rescheduling, I'm getting this issue also when trying to start an instance with a security group that doesn't exists.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-12-13:

See: https://docs.openstack.org/nova/latest/user/cellsv2-layout.html

And: https://github.com/openstack-dev/devstack/blob/master/stackrc#L80

The failure on reschedule thing is a default behavior in devstack with the CELLSV2_SETUP flag. Set that to 'singleconductor' in devstack if you want / need reschedules.

We're working on the fixes in nova long-term in this blueprint in Queens:

https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/return-alternate-hosts.html

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-12-13:

Running nova in a split MQ "super conductor" mode outside of devstack is not required, so I wouldn't consider this a critical regression, it's something that deployers have to opt into based on how they setup their nova deployment.

Changed in nova:
status:	Confirmed → Won't Fix
status:	Won't Fix → Triaged
importance:	Undecided → Low

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-12-13:

Marking this low severity since it's a known issue, but won't invalidate it in case someone wants to push a change to try and handle the CantStartEngineError error so we log the original failure before we stop rescheduling.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1736946 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.