Comment 2 for bug 1789484

Revision history for this message
Matt Riedemann (mriedem) wrote :

(2:38:34 PM) mriedem: 2018-08-27 14:30:31.159 | + functions-common:_run_under_systemd:1482 : sudo systemctl start <email address hidden>
(2:38:54 PM) mriedem: 2018-08-27 14:30:32.414 | + functions-common:_run_under_systemd:1482 : sudo systemctl start <email address hidden>

http://logs.openstack.org/93/596493/1/check/cinder-tempest-dsvm-lvm-lio-barbican/d716acf/logs/screen-n-cpu.txt.gz#_Aug_27_14_31_35_678991

Aug 27 14:31:35.678991 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: CRITICAL nova [None req-90f5253f-cd43-47e6-b247-24f3404706e8 None None] Unhandled error: MessagingTimeout: Timed out waiting for a reply to message ID e7cfc4318ed54ad2adad8b98ae5211c9
Aug 27 14:31:35.679306 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova Traceback (most recent call last):
Aug 27 14:31:35.679462 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/usr/local/bin/nova-compute", line 10, in <module>
Aug 27 14:31:35.679584 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova sys.exit(main())
Aug 27 14:31:35.679704 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/cmd/compute.py", line 57, in main
Aug 27 14:31:35.679836 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova topic=compute_rpcapi.RPC_TOPIC)
Aug 27 14:31:35.679960 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/service.py", line 259, in create
Aug 27 14:31:35.680117 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova periodic_interval_max=periodic_interval_max)
Aug 27 14:31:35.680244 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/service.py", line 128, in __init__
Aug 27 14:31:35.680362 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova self.manager = manager_class(host=self.host, *args, **kwargs)
Aug 27 14:31:35.680475 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/compute/manager.py", line 505, in __init__
Aug 27 14:31:35.680600 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova self.compute_api = compute.API()
Aug 27 14:31:35.680724 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/compute/__init__.py", line 39, in API
Aug 27 14:31:35.680846 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova return importutils.import_object(class_name, *args, **kwargs)
Aug 27 14:31:35.680960 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/usr/local/lib/python2.7/dist-packages/oslo_utils/importutils.py", line 44, in import_object
Aug 27 14:31:35.681076 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova return import_class(import_str)(*args, **kwargs)
Aug 27 14:31:35.681188 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/compute/api.py", line 266, in __init__
Aug 27 14:31:35.681304 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova self.compute_rpcapi = compute_rpcapi.ComputeAPI()
Aug 27 14:31:35.681439 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/compute/rpcapi.py", line 369, in __init__
Aug 27 14:31:35.681561 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova version_cap = self._determine_version_cap(target)
Aug 27 14:31:35.681671 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/compute/rpcapi.py", line 387, in _determine_version_cap
Aug 27 14:31:35.681784 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova context.get_admin_context(), 'nova-compute')
Aug 27 14:31:35.681923 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 177, in wrapper
Aug 27 14:31:35.682038 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova args, kwargs)
Aug 27 14:31:35.682169 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/opt/stack/new/nova/nova/conductor/rpcapi.py", line 241, in object_class_action_versions
Aug 27 14:31:35.682282 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova args=args, kwargs=kwargs)
Aug 27 14:31:35.682394 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
Aug 27 14:31:35.682506 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova retry=self.retry)
Aug 27 14:31:35.682619 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 133, in _send
Aug 27 14:31:35.682780 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova retry=retry)
Aug 27 14:31:35.682913 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
Aug 27 14:31:35.683042 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova call_monitor_timeout, retry=retry)
Aug 27 14:31:35.683161 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
Aug 27 14:31:35.683274 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova call_monitor_timeout)
Aug 27 14:31:35.683387 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
Aug 27 14:31:35.683495 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova message = self.waiters.get(msg_id, timeout=timeout)
Aug 27 14:31:35.683609 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
Aug 27 14:31:35.683723 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova 'to message ID %s' % msg_id)
Aug 27 14:31:35.683844 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova MessagingTimeout: Timed out waiting for a reply to message ID e7cfc4318ed54ad2adad8b98ae5211c9
Aug 27 14:31:35.683961 ubuntu-xenial-ovh-bhs1-0001572220 nova-compute[1625]: ERROR nova
Aug 27 14:31:35.861191 ubuntu-xenial-ovh-bhs1-0001572220 systemd[1]: <email address hidden>: Main process exited, code=exited, status=1/FAILURE
Aug 27 14:31:35.861466 ubuntu-xenial-ovh-bhs1-0001572220 systemd[1]: <email address hidden>: Unit entered failed state.
Aug 27 14:31:35.861503 ubuntu-xenial-ovh-bhs1-0001572220 systemd[1]: <email address hidden>: Failed with result 'exit-code'.

The failure is triggered when nova is configured with:

[upgrade_levels]
compute=auto

Which happened in devstack in rocky: https://review.openstack.org/#/c/550108/

In this case, nova-compute is asking the cell conductor what the minimum nova-compute service version is in the cell it's running in to determine the RPC API version pin while doing upgrades. But apparently n-cond-cell1 isn't ready yet and rpc times out.

A couple of thoughts on fixing this:

1. We could move the compute RPC API init from ComputeManager.__init__ to ComputeManager.post_start_hook to give some additional time while ComputeManager.init_host does some initialization with the hypervisor. That might not be enough time though. So we could add a retry loop on this to handle MessagingTimeout and only then fail (and kill the service) once we hit the retry timeout. One thing we'd need to be aware of is if we move to post_start_hook, that comes after we register the service with the servicegroup API which means the compute service shows as "up" to the scheduler which means the scheduler (ComputeFilter) can pick it for scheduling new workloads, and if compute RPC API isn't actually ready yet, we could fail in weird ways. Also, it's possible that we couldn't even get to post_start_hook if service group registration fails (because n-cond-cell1 isn't up yet in that cell and we default to use the DB service group driver).

2. We could use the (new in rocky) long_rpc_timeout config option, but it's probably not really appropriate here and could actually hang the start of the service if n-cond-cell1 is really down.