Under Pike we are operating a /var/lib/nova/instances mounted on a clustered Netapp A700 AFF. The share is mounted across the entire nova fleet of currently 29 hosts (10G networking) with ~ 720 instances.
We are mounting the share with standard NFS options are considering actimeo as improvement, unless there are expected issues around metadata consistency issues:
But recently we noticed an increase of Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply t
which we mitigated by increasing the rpc_response_timeout.
As the result of the increased errors we saw nova-compute service flapping which caused other issues like volume attachments got delayed or erred out.
Am I right with the assumption that the resource tracker and services updates are happening inside the same thread ?
What else can we do to prevent these errors ?
Actual result
=============
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task [req-73d6cf48-d94a-41e4-a59e-9965fec4972d - - - - -] Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply to message ID 29820aa832354e788c7d50a533823c2a
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task task(self, context)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/compute/manager.py", line 7118, in _run_image_cache_manager_pass
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task self.driver.manage_image_cache(context, filtered_instances)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7563, in manage_image_cache
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task self.image_cache_manager.update(context, all_instances)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/imagecache.py", line 414, in update
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task running = self._list_running_instances(context, all_instances)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/imagecache.py", line 54, in _list_running_instances
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task context, [instance.uuid for instance in all_instances])
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/objects/block_device.py", line 333, in bdms_by_instance_uuid
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task bdms = cls.get_by_instance_uuids(context, instance_uuids)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 177, in wrapper
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task args, kwargs)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 240, in object_class_action_versions
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task args=args, kwargs=kwargs)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task retry=self.retry)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task timeout=timeout, retry=retry)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 566, in send
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task retry=retry)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 555, in _send
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task result = self._waiter.wait(msg_id, timeout)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 447, in wait
Expected result
===============
rpc_response_timeout should remain constant regardless of instances operated under /var/log
Description
===========
Under Pike we are operating a /var/lib/ nova/instances mounted on a clustered Netapp A700 AFF. The share is mounted across the entire nova fleet of currently 29 hosts (10G networking) with ~ 720 instances.
We are mounting the share with standard NFS options are considering actimeo as improvement, unless there are expected issues around metadata consistency issues:
host:/share /var/lib/ nova/instances nfs rw,relatime, vers=3, rsize=65536, wsize=65536, namlen= 255,hard, proto=tcp, timeo=600, retrans= 2,sec=sys, mountaddr= xxxx,mountvers= 3,mountport= 635,mountproto= udp,local_ lock=none, addr=xxxx
But recently we noticed an increase of Error during ComputeManager. _run_image_ cache_manager_ pass: MessagingTimeout: Timed out waiting for a reply t
which we mitigated by increasing the rpc_response_ timeout.
As the result of the increased errors we saw nova-compute service flapping which caused other issues like volume attachments got delayed or erred out.
Am I right with the assumption that the resource tracker and services updates are happening inside the same thread ?
What else can we do to prevent these errors ?
Actual result periodic_ task [req-73d6cf48- d94a-41e4- a59e-9965fec497 2d - - - - -] Error during ComputeManager. _run_image_ cache_manager_ pass: MessagingTimeout: Timed out waiting for a reply to message ID 29820aa832354e7 88c7d50a533823c 2a periodic_ task File "/openstack/ venvs/nova- r16.2.4/ lib/python2. 7/site- packages/ oslo_service/ periodic_ task.py" , line 220, in run_periodic_tasks periodic_ task task(self, context) periodic_ task File "/openstack/ venvs/nova- r16.2.4/ lib/python2. 7/site- packages/ nova/compute/ manager. py", line 7118, in _run_image_ cache_manager_ pass periodic_ task self.driver. manage_ image_cache( context, filtered_instances) periodic_ task File "/openstack/ venvs/nova- r16.2.4/ lib/python2. 7/site- packages/ nova/virt/ libvirt/ driver. py", line 7563, in manage_image_cache periodic_ task self.image_ cache_manager. update( context, all_instances) periodic_ task File "/openstack/ venvs/nova- r16.2.4/ lib/python2. 7/site- packages/ nova/virt/ libvirt/ imagecache. py", line 414, in update periodic_ task running = self._list_ running_ instances( context, all_instances) periodic_ task File "/openstack/ venvs/nova- r16.2.4/ lib/python2. 7/site- packages/ nova/virt/ imagecache. py", line 54, in _list_running_ instances periodic_ task context, [instance.uuid for instance in all_instances]) periodic_ task File "/openstack/ venvs/nova- r16.2.4/ lib/python2. 7/site- packages/ nova/objects/ block_device. py", line 333, in bdms_by_ instance_ uuid periodic_ task bdms = cls.get_ by_instance_ uuids(context, instance_uuids) periodic_ task File "/openstack/ venvs/nova- r16.2.4/ lib/python2. 7/site- packages/ oslo_versionedo bjects/ base.py" , line 177, in wrapper periodic_ task args, kwargs) periodic_ task File "/openstack/ venvs/nova- r16.2.4/ lib/python2. 7/site- packages/ nova/conductor/ rpcapi. py", line 240, in object_ class_action_ versions periodic_ task args=args, kwargs=kwargs) periodic_ task File "/openstack/ venvs/nova- r16.2.4/ lib/python2. 7/site- packages/ oslo_messaging/ rpc/client. py", line 169, in call periodic_ task retry=self.retry) periodic_ task File "/openstack/ venvs/nova- r16.2.4/ lib/python2. 7/site- packages/ oslo_messaging/ transport. py", line 123, in _send periodic_ task timeout=timeout, retry=retry) periodic_ task File "/openstack/ venvs/nova- r16.2.4/ lib/python2. 7/site- packages/ oslo_messaging/ _drivers/ amqpdriver. py", line 566, in send periodic_ task retry=retry) periodic_ task File "/openstack/ venvs/nova- r16.2.4/ lib/python2. 7/site- packages/ oslo_messaging/ _drivers/ amqpdriver. py", line 555, in _send periodic_ task result = self._waiter. wait(msg_ id, timeout) periodic_ task File "/openstack/ venvs/nova- r16.2.4/ lib/python2. 7/site- packages/ oslo_messaging/ _drivers/ amqpdriver. py", line 447, in wait
=============
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
2018-11-20 14:09:40.413 4294 ERROR oslo_service.
Expected result timeout should remain constant regardless of instances operated under /var/log
===============
rpc_response_
Environment
===========
Ubuntu 16.04.4 LTS (amd64)
pips: =16.0.1. dev1 =5.0.4. dev3 novaclient= =9.1.2
nova==16.1.5.dev57
nova-lxd=
nova-powervm=
python-
debs: 8~cloud0 8~cloud0 8~cloud0 daemon- system 3.6.0-1ubuntu6. 8~cloud0 8~cloud0 cloud0
libvirt-bin 3.6.0-1ubuntu6.
libvirt-clients 3.6.0-1ubuntu6.
libvirt-daemon 3.6.0-1ubuntu6.
libvirt-
libvirt0 3.6.0-1ubuntu6.
python-libvirt 3.5.0-1build1~