nova-conductor is masking error when rescheduling

Bug #1733933 reported by Dr. Jens Harbott
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Low
Unassigned

Bug Description

Sometimes when build_instance fails on n-cpu, the error that n-cond receives is mangles like this:

Nov 22 17:39:04 jh-devstack-03 nova-conductor[26556]: ERROR nova.scheduler.utils [None req-fd8acb29-8a2c-4603-8786-54f2580d0ea9 tempest-FloatingIpSameNetwork-1597192363 tempest-FloatingIpSameNetwork-1597192363]
[instance: 5ee9d527-0043-474e-bfb3-e6621426662e] Error from last host: jh-devstack-03 (node jh-devstack03): [u'Traceback (most recent call last):\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 1847, in
 _do_build_and_run_instance\n filter_properties)\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 2086, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n',
u"RescheduledException: Build of instance 5ee9d527-0043-474e-bfb3-e6621426662e was re-scheduled: operation failed: domain 'instance-00000028' already exists with uuid
93974d36e3a7-4139bbd8-2d5b51195a5f\n"]
Nov 22 17:39:04 jh-devstack-03 nova-conductor[26556]: WARNING nova.scheduler.utils [None req-fd8acb29-8a2c-4603-8786-54f2580d0ea9 tempest-FloatingIpSameNetwork-1597192363 tempest-FloatingIpSameNetwork-1597192363]
Failed to compute_task_build_instances: No sql_connection parameter is established: CantStartEngineError: No sql_connection parameter is established
Nov 22 17:39:04 jh-devstack-03 nova-conductor[26556]: WARNING nova.scheduler.utils [None req-fd8acb29-8a2c-4603-8786-54f2580d0ea9 tempest-FloatingIpSameNetwork-1597192363 tempest-FloatingIpSameNetwork-1597192363]
[instance: 5ee9d527-0043-474e-bfb3-e6621426662e] Setting instance to ERROR state.: CantStartEngineError: No sql_connection parameter is established

Seem to occur quite often in gate, too. http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Setting%20instance%20to%20ERROR%20state.%3A%20CantStartEngineError%5C%22

The result is that the instance information shows "No sql_connection parameter is established" instead of the original error, making debugging the root cause quite difficult.

Tags: conductor
tags: added: conductor
Jiang (jiangpf)
Changed in nova:
status: New → Confirmed
assignee: nobody → Jiang (jiangpf)
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

I just noticed that this also blocks rescheduling when launching an instance failed on one host, so the instance will end up in error state immediately. I guess that this could make this a critical regression.

Revision history for this message
Jiang (jiangpf) wrote :

self._url_cfg['connection'] in oslo_db/sqlalchemy/enginefacade.py has been changed to None, after build_instance fails on n-cpu.

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

@Jiang: Any progress on this issue? Otherwise please unassign yourself an let someone else take over.

Revision history for this message
Jiang (jiangpf) wrote :

I am sorry, I can not find the reason about this problem.

Changed in nova:
assignee: Jiang (jiangpf) → nobody
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

Note that this doesn't only affect rescheduling, I'm getting this issue also when trying to start an instance with a security group that doesn't exists.

Revision history for this message
Matt Riedemann (mriedem) wrote :

See: https://docs.openstack.org/nova/latest/user/cellsv2-layout.html

And: https://github.com/openstack-dev/devstack/blob/master/stackrc#L80

The failure on reschedule thing is a default behavior in devstack with the CELLSV2_SETUP flag. Set that to 'singleconductor' in devstack if you want / need reschedules.

We're working on the fixes in nova long-term in this blueprint in Queens:

https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/return-alternate-hosts.html

Revision history for this message
Matt Riedemann (mriedem) wrote :

Running nova in a split MQ "super conductor" mode outside of devstack is not required, so I wouldn't consider this a critical regression, it's something that deployers have to opt into based on how they setup their nova deployment.

Changed in nova:
status: Confirmed → Won't Fix
status: Won't Fix → Triaged
importance: Undecided → Low
Revision history for this message
Matt Riedemann (mriedem) wrote :

Marking this low severity since it's a known issue, but won't invalidate it in case someone wants to push a change to try and handle the CantStartEngineError error so we log the original failure before we stop rescheduling.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.