Comment 0 for bug 1804262

Revision history for this message
Bjoern (bjoern-t) wrote :

Description
===========

Under Pike we are operating a /var/lib/nova/instances mounted on a clustered Netapp A700 AFF. The share is mounted across the entire nova fleet of currently 29 hosts (10G networking) with ~ 720 instances.
We are mounting the share with standard NFS options are considering actimeo as improvement, unless there are expected issues around metadata consistency issues:

host:/share /var/lib/nova/instances nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=xxxx,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=xxxx

But recently we noticed an increase of Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply t

which we mitigated by increasing the rpc_response_timeout.
As the result of the increased errors we saw nova-compute service flapping which caused other issues like volume attachments got delayed or erred out.

Am I right with the assumption that the resource tracker and services updates are happening inside the same thread ?
What else can we do to prevent these errors ?

Actual result
=============
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task [req-73d6cf48-d94a-41e4-a59e-9965fec4972d - - - - -] Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply to message ID 29820aa832354e788c7d50a533823c2a
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task task(self, context)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/compute/manager.py", line 7118, in _run_image_cache_manager_pass
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task self.driver.manage_image_cache(context, filtered_instances)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7563, in manage_image_cache
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task self.image_cache_manager.update(context, all_instances)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/imagecache.py", line 414, in update
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task running = self._list_running_instances(context, all_instances)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/imagecache.py", line 54, in _list_running_instances
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task context, [instance.uuid for instance in all_instances])
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/objects/block_device.py", line 333, in bdms_by_instance_uuid
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task bdms = cls.get_by_instance_uuids(context, instance_uuids)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 177, in wrapper
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task args, kwargs)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 240, in object_class_action_versions
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task args=args, kwargs=kwargs)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task retry=self.retry)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task timeout=timeout, retry=retry)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 566, in send
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task retry=retry)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 555, in _send
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task result = self._waiter.wait(msg_id, timeout)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 447, in wait

Expected result
===============
rpc_response_timeout should remain constant regardless of instances operated under /var/log

Environment
===========
Ubuntu 16.04.4 LTS (amd64)

pips:
nova==16.1.5.dev57
nova-lxd==16.0.1.dev1
nova-powervm==5.0.4.dev3
python-novaclient==9.1.2

debs:
libvirt-bin 3.6.0-1ubuntu6.8~cloud0
libvirt-clients 3.6.0-1ubuntu6.8~cloud0
libvirt-daemon 3.6.0-1ubuntu6.8~cloud0
libvirt-daemon-system 3.6.0-1ubuntu6.8~cloud0
libvirt0 3.6.0-1ubuntu6.8~cloud0
python-libvirt 3.5.0-1build1~cloud0