Activity log for bug #1804262

Date Who What changed Old value New value Message
2018-11-20 16:02:08 Bjoern bug added bug
2018-11-20 16:02:39 Bjoern description Description =========== Under Pike we are operating a /var/lib/nova/instances mounted on a clustered Netapp A700 AFF. The share is mounted across the entire nova fleet of currently 29 hosts (10G networking) with ~ 720 instances. We are mounting the share with standard NFS options are considering actimeo as improvement, unless there are expected issues around metadata consistency issues: host:/share /var/lib/nova/instances nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=xxxx,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=xxxx But recently we noticed an increase of Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply t which we mitigated by increasing the rpc_response_timeout. As the result of the increased errors we saw nova-compute service flapping which caused other issues like volume attachments got delayed or erred out. Am I right with the assumption that the resource tracker and services updates are happening inside the same thread ? What else can we do to prevent these errors ? Actual result ============= 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task [req-73d6cf48-d94a-41e4-a59e-9965fec4972d - - - - -] Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply to message ID 29820aa832354e788c7d50a533823c2a 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task task(self, context) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/compute/manager.py", line 7118, in _run_image_cache_manager_pass 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task self.driver.manage_image_cache(context, filtered_instances) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7563, in manage_image_cache 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task self.image_cache_manager.update(context, all_instances) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/imagecache.py", line 414, in update 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task running = self._list_running_instances(context, all_instances) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/imagecache.py", line 54, in _list_running_instances 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task context, [instance.uuid for instance in all_instances]) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/objects/block_device.py", line 333, in bdms_by_instance_uuid 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task bdms = cls.get_by_instance_uuids(context, instance_uuids) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 177, in wrapper 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task args, kwargs) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 240, in object_class_action_versions 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task args=args, kwargs=kwargs) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task retry=self.retry) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task timeout=timeout, retry=retry) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 566, in send 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task retry=retry) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 555, in _send 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task result = self._waiter.wait(msg_id, timeout) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 447, in wait Expected result =============== rpc_response_timeout should remain constant regardless of instances operated under /var/log Environment =========== Ubuntu 16.04.4 LTS (amd64) pips: nova==16.1.5.dev57 nova-lxd==16.0.1.dev1 nova-powervm==5.0.4.dev3 python-novaclient==9.1.2 debs: libvirt-bin 3.6.0-1ubuntu6.8~cloud0 libvirt-clients 3.6.0-1ubuntu6.8~cloud0 libvirt-daemon 3.6.0-1ubuntu6.8~cloud0 libvirt-daemon-system 3.6.0-1ubuntu6.8~cloud0 libvirt0 3.6.0-1ubuntu6.8~cloud0 python-libvirt 3.5.0-1build1~cloud0 Description =========== Under Pike we are operating a /var/lib/nova/instances mounted on a clustered Netapp A700 AFF. The share is mounted across the entire nova fleet of currently 29 hosts (10G networking) with ~ 720 instances. We are mounting the share with standard NFS options are considering actimeo as improvement, unless there are expected issues around metadata consistency issues: host:/share /var/lib/nova/instances nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=xxxx,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=xxxx But recently we noticed an increase of Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply t which we mitigated by increasing the rpc_response_timeout. As the result of the increased errors we saw nova-compute service flapping which caused other issues like volume attachments got delayed or erred out. Am I right with the assumption that the resource tracker and services updates are happening inside the same thread ? What else can we do to prevent these errors ? Actual result ============= 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task [req-73d6cf48-d94a-41e4-a59e-9965fec4972d - - - - -] Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply to message ID 29820aa832354e788c7d50a533823c2a 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task task(self, context) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/compute/manager.py", line 7118, in _run_image_cache_manager_pass 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task self.driver.manage_image_cache(context, filtered_instances) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7563, in manage_image_cache 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task self.image_cache_manager.update(context, all_instances) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/imagecache.py", line 414, in update 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task running = self._list_running_instances(context, all_instances) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/imagecache.py", line 54, in _list_running_instances 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task context, [instance.uuid for instance in all_instances]) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/objects/block_device.py", line 333, in bdms_by_instance_uuid 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task bdms = cls.get_by_instance_uuids(context, instance_uuids) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 177, in wrapper 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task args, kwargs) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 240, in object_class_action_versions 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task args=args, kwargs=kwargs) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task retry=self.retry) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task timeout=timeout, retry=retry) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 566, in send 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task retry=retry) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 555, in _send 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task result = self._waiter.wait(msg_id, timeout) 2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 447, in wait Expected result =============== rpc_response_timeout should remain constant regardless of instances operated under /var/lib/nova/instances Environment =========== Ubuntu 16.04.4 LTS (amd64) pips: nova==16.1.5.dev57 nova-lxd==16.0.1.dev1 nova-powervm==5.0.4.dev3 python-novaclient==9.1.2 debs: libvirt-bin 3.6.0-1ubuntu6.8~cloud0 libvirt-clients 3.6.0-1ubuntu6.8~cloud0 libvirt-daemon 3.6.0-1ubuntu6.8~cloud0 libvirt-daemon-system 3.6.0-1ubuntu6.8~cloud0 libvirt0 3.6.0-1ubuntu6.8~cloud0 python-libvirt 3.5.0-1build1~cloud0
2019-01-24 13:48:56 Balazs Gibizer tags compute rpc
2019-01-24 15:46:46 Matt Riedemann bug added subscriber Matthew Booth
2019-01-24 16:11:20 Matt Riedemann nova: status New Triaged
2019-01-24 16:11:21 Matt Riedemann nova: importance Undecided Medium
2019-01-24 16:33:58 Mario Fedato bug added subscriber Mario Fedato
2019-01-24 17:32:29 OpenStack Infra nova: status Triaged In Progress
2019-01-24 17:32:29 OpenStack Infra nova: assignee Matt Riedemann (mriedem)
2019-01-24 17:37:24 Matt Riedemann nominated for series nova/queens
2019-01-24 17:37:24 Matt Riedemann bug task added nova/queens
2019-01-24 17:37:24 Matt Riedemann nominated for series nova/rocky
2019-01-24 17:37:24 Matt Riedemann bug task added nova/rocky
2019-01-24 17:37:24 Matt Riedemann nominated for series nova/ocata
2019-01-24 17:37:24 Matt Riedemann bug task added nova/ocata
2019-01-24 17:37:24 Matt Riedemann nominated for series nova/pike
2019-01-24 17:37:24 Matt Riedemann bug task added nova/pike
2019-01-24 17:58:54 Matt Riedemann nova/ocata: status New Triaged
2019-01-24 17:58:57 Matt Riedemann nova/pike: status New Triaged
2019-01-24 17:59:03 Matt Riedemann nova/rocky: status New Triaged
2019-01-24 17:59:12 Matt Riedemann nova/pike: importance Undecided Medium
2019-01-24 17:59:20 Matt Riedemann nova/rocky: importance Undecided Medium
2019-01-24 17:59:24 Matt Riedemann nova/ocata: importance Undecided Medium
2019-01-24 17:59:27 Matt Riedemann nova/queens: importance Undecided Medium
2019-01-24 17:59:34 Matt Riedemann nova/queens: status New Triaged
2019-01-25 14:05:55 OpenStack Infra nova: assignee Matt Riedemann (mriedem) Matthew Booth (mbooth-9)
2020-05-19 21:01:00 OpenStack Infra nova: assignee Matthew Booth (mbooth-9) Lee Yarwood (lyarwood)