CantStartEngineError due to host aggregate up-call when boot from volume and [cinder]/cross_az_attach=False

Bug #1781421 reported by Matt Riedemann on 2018-07-12
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Unassigned
Pike
Medium
Unassigned
Queens
Medium
Unassigned

Bug Description

This is semi-related to bug 1497253 but I found it while triaging that bug to see if it was still an issue since Pike (I don't think it is).

If you run devstack with default superconductor mode configuration, and configure nova-cpu.conf with:

[cinder]
cross_az_attach=False

Then try to boot from volume where nova-compute creates the volume, it fails with CantStartEngineError because the cell conductor (n-cond-cell1.service) is not configured to reach the API DB to get host aggregate information.

Here is a nova boot command to recreate:

$ nova boot --flavor cirros256 --block-device id=e642acfd-4283-458a-b7ea-6c316da3b2ce,source=image,dest=volume,shutdown=remove,size=1,bootindex=0 --poll test-bfv

Where the block device id is the uuid of the cirros image in the devstack env.

This is the failure in the nova-compute logs:

http://paste.openstack.org/show/725723/

972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-4b40-bb42-0df7b65bb36e] Getting AZ for instance; instance.host: rocky; instance.availabilty_zone: nova
3-c972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-4b40-bb42-0df7b65bb36e] Instance failed block device setup: RemoteError: Remote error: CantStartEngineEr
  File "/opt/stack/nova/nova/conductor/manager.py", line 124, in _object_dispatch\n return getattr(target, method)(*args, **kwargs)\n', u' File "/usr/local/lib/python2.7
b9-e23a-4b40-bb42-0df7b65bb36e] Traceback (most recent call last):
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/compute/manager.py", line 1564, in _prep_block_device
b9-e23a-4b40-bb42-0df7b65bb36e] wait_func=self._await_block_device_map_created)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 854, in attach_block_devices
b9-e23a-4b40-bb42-0df7b65bb36e] _log_and_attach(device)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 851, in _log_and_attach
b9-e23a-4b40-bb42-0df7b65bb36e] bdm.attach(*attach_args, **attach_kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 747, in attach
b9-e23a-4b40-bb42-0df7b65bb36e] context, instance, volume_api, virt_driver)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 46, in wrapped
b9-e23a-4b40-bb42-0df7b65bb36e] ret_val = method(obj, context, *args, **kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 623, in attach
b9-e23a-4b40-bb42-0df7b65bb36e] instance=instance)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/volume/cinder.py", line 504, in check_availability_zone
b9-e23a-4b40-bb42-0df7b65bb36e] instance_az = az.get_instance_availability_zone(context, instance)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/availability_zones.py", line 194, in get_instance_availability_zone
b9-e23a-4b40-bb42-0df7b65bb36e] az = get_host_availability_zone(elevated, host)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/availability_zones.py", line 95, in get_host_availability_zone
b9-e23a-4b40-bb42-0df7b65bb36e] key='availability_zone')
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 177, in wrapper
b9-e23a-4b40-bb42-0df7b65bb36e] args, kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/conductor/rpcapi.py", line 241, in object_class_action_versions
b9-e23a-4b40-bb42-0df7b65bb36e] args=args, kwargs=kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
b9-e23a-4b40-bb42-0df7b65bb36e] retry=self.retry)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 133, in _send
b9-e23a-4b40-bb42-0df7b65bb36e] retry=retry)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
b9-e23a-4b40-bb42-0df7b65bb36e] call_monitor_timeout, retry=retry)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 575, in _send
b9-e23a-4b40-bb42-0df7b65bb36e] raise result
b9-e23a-4b40-bb42-0df7b65bb36e] RemoteError: Remote error: CantStartEngineError No sql_connection parameter is established
b9-e23a-4b40-bb42-0df7b65bb36e] [u'Traceback (most recent call last):\n', u' File "/opt/stack/nova/nova/conductor/manager.py", line 124, in _object_dispatch\n return get
b9-e23a-4b40-bb42-0df7b65bb36e]

The logging at the start is my own for debug:

972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-4b40-bb42-0df7b65bb36e] Getting AZ for instance; instance.host: rocky; instance.availabilty_zone: nova

But it shows that the instance.host and instance.availability_zone are set. The instance.host gets set by the instance_claim in the resource tracker and the instance.availability_zone get set by conductor at the top in the schedule_and_build_instances method due to this change in pike:

https://review.openstack.org/#/c/446053/

So all I have to do to avoid the up-call is this:

diff --git a/nova/availability_zones.py b/nova/availability_zones.py
index 7c8d948..f128d8e 100644
--- a/nova/availability_zones.py
+++ b/nova/availability_zones.py
@@ -165,7 +165,7 @@ def get_availability_zones(context, get_only_available=False,
 def get_instance_availability_zone(context, instance):
     """Return availability zone of specified instance."""
     host = instance.host if 'host' in instance else None
- if not host:
+ if not host or (host and instance.availability_zone):
         # Likely hasn't reached a viable compute node yet so give back the
         # desired availability_zone in the instance record if the boot request
         # specified one.

This would also fix #5 in our up-call list:

https://docs.openstack.org/nova/latest/user/cellsv2-layout.html#operations-requiring-upcalls

Fix proposed to branch: master
Review: https://review.openstack.org/582342

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/582342
Reason: My fix might be incomplete, it's hard to say without a functional test written before changing how this code works to make sure I don't regress something in the API behavior.

Matt Riedemann (mriedem) on 2019-06-19
Changed in nova:
status: In Progress → Confirmed
assignee: Matt Riedemann (mriedem) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers