ComputeManager._run_image_cache_manager_pass times out when running on NFS

Bug #1804262 reported by Bjoern
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Medium
Lee Yarwood
Ocata
Triaged
Medium
Unassigned
Pike
Triaged
Medium
Unassigned
Queens
Triaged
Medium
Unassigned
Rocky
Triaged
Medium
Unassigned

Bug Description

Description
===========

Under Pike we are operating a /var/lib/nova/instances mounted on a clustered Netapp A700 AFF. The share is mounted across the entire nova fleet of currently 29 hosts (10G networking) with ~ 720 instances.
We are mounting the share with standard NFS options are considering actimeo as improvement, unless there are expected issues around metadata consistency issues:

host:/share /var/lib/nova/instances nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=xxxx,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=xxxx

But recently we noticed an increase of Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply t

which we mitigated by increasing the rpc_response_timeout.
As the result of the increased errors we saw nova-compute service flapping which caused other issues like volume attachments got delayed or erred out.

Am I right with the assumption that the resource tracker and services updates are happening inside the same thread ?
What else can we do to prevent these errors ?

Actual result
=============
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task [req-73d6cf48-d94a-41e4-a59e-9965fec4972d - - - - -] Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply to message ID 29820aa832354e788c7d50a533823c2a
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task task(self, context)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/compute/manager.py", line 7118, in _run_image_cache_manager_pass
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task self.driver.manage_image_cache(context, filtered_instances)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7563, in manage_image_cache
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task self.image_cache_manager.update(context, all_instances)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/imagecache.py", line 414, in update
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task running = self._list_running_instances(context, all_instances)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/imagecache.py", line 54, in _list_running_instances
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task context, [instance.uuid for instance in all_instances])
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/objects/block_device.py", line 333, in bdms_by_instance_uuid
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task bdms = cls.get_by_instance_uuids(context, instance_uuids)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 177, in wrapper
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task args, kwargs)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 240, in object_class_action_versions
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task args=args, kwargs=kwargs)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task retry=self.retry)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task timeout=timeout, retry=retry)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 566, in send
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task retry=retry)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 555, in _send
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task result = self._waiter.wait(msg_id, timeout)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 447, in wait

Expected result
===============
rpc_response_timeout should remain constant regardless of instances operated under /var/lib/nova/instances

Environment
===========
Ubuntu 16.04.4 LTS (amd64)

pips:
nova==16.1.5.dev57
nova-lxd==16.0.1.dev1
nova-powervm==5.0.4.dev3
python-novaclient==9.1.2

debs:
libvirt-bin 3.6.0-1ubuntu6.8~cloud0
libvirt-clients 3.6.0-1ubuntu6.8~cloud0
libvirt-daemon 3.6.0-1ubuntu6.8~cloud0
libvirt-daemon-system 3.6.0-1ubuntu6.8~cloud0
libvirt0 3.6.0-1ubuntu6.8~cloud0
python-libvirt 3.5.0-1build1~cloud0

Tags: compute rpc
Bjoern (bjoern-t)
description: updated
Revision history for this message
Bjoern (bjoern-t) wrote :

Part of the problems seems to be that the image cache manager seems to run on all nodes and checking all instances:

2018-11-20 18:02:13.810 34617 INFO nova.compute.manager [req-b2f865a4-1d07-49d9-aefa-d1e6d0331ff3 - - - - -] Running image cache manager for 705 instances

That's from a debug line I added inside the manager.py and _run_image_cache_manager_pass method:

        LOG.info('Running image cache manager for %d instances', len(filtered_instances))
        self.driver.manage_image_cache(context, filtered_instances)

tags: added: compute rpc
Revision history for this message
Matt Riedemann (mriedem) wrote :

Hi Bjoern, I saw you ask about this in the #openstack-nova IRC channel. Matthew Booth (mdbooth in IRC) is probably the person that is most familiar with the image cache manager code at this time so I would suggesting talking with him about this issue.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Is there something holding a lock that makes this slower when on NFS? You could try getting a GMR:

https://docs.openstack.org/nova/latest/admin/support-compute.html#guru-meditation-reports

To see what is going on in threads when this timeout occurs.

Otherwise is it just a matter of pulling in and processing all instances rather than just the ones on a given compute service host?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Ah I see from that "get_storage_users" code that each compute service writes to the "instances_path" config option location a dict keyed by hostname to the last time that the check ran for that host. Then _run_image_cache_manager_pass will run over all of those instances across all nodes.

From the MessagingTimeout traceback, it looks like this is what's killing you since it's pulling BDMs for 705 instances from the database:

https://github.com/openstack/nova/blob/56811efa3583dfa3c03f9e43a9802bbe21e45bbd/nova/virt/imagecache.py#L53

That should probably be chunked for paging somehow, maybe process in groups of 50 or something?

Alternatively, just get the BDMs per instance here:

https://github.com/openstack/nova/blob/56811efa3583dfa3c03f9e43a9802bbe21e45bbd/nova/virt/imagecache.py#L80

Also note that code only cares about swap BDMs which are defined with this filter:

https://github.com/openstack/nova/blob/56811efa3583dfa3c03f9e43a9802bbe21e45bbd/nova/block_device.py#L439

So that BDM query could also be optimized to only return swap BDMs.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I guess https://github.com/openstack/nova/blob/56811efa3583dfa3c03f9e43a9802bbe21e45bbd/nova/virt/imagecache.py#L53 was done specifically to avoid a message storm to conductor and the database:

https://review.openstack.org/#/c/298023/

So you'd either need to limit that and do paging (batches of 50 instances) or pass something down from the compute manager to driver.manage_image_cache to indicate that instances from more than one host are involved (shared storage) and the behavior could be different in that case.

The other optimization about only querying for swap BDMs is also something that could be done regardless of chunking.

Changed in nova:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/633042

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
Matthew Booth (mbooth-9) wrote :

There's a slightly wider context here, which is that swap files and ephemerals shouldn't be using the image cache anyway: these should always be flat. My original imagebackend series addressed this, but of course to remove this particular code we'd also need an upgrade task, which would be fun to write :/ This is another consequence of that which I hadn't noticed before.

Revision history for this message
Matthew Booth (mbooth-9) wrote :

On second thoughts... I think we should just delete this code right now, and stop aging backing files for swap disks. A slightly more conservative option would be to guard the whole lot in a workaround config option which defaults to off.

My reasoning is that the number of different swap backing files is unlikely to be prolific, and is unlikely to grow over time beyond an initial set. We create one of these things for each size of swap file which is specified in a flavor. Firstly, I suspect that this feature isn't widely used at all. But even if it is the total number of swap sizes on offer isn't going to be large: probably less than 10. It's simply not worth this collossal effort to reap them.

Secondly, I may have missed something but I think we're not reaping ephemeral disks. The variety of ephemeral disks is also unlikely to be prolific, but it's going to be more prolific than swap disks, and nobody's screaming about that.

And as I said above, neither of these should have backing files anyway, but that's a harder problem to retrofix.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/633213

Changed in nova:
assignee: Matt Riedemann (mriedem) → Matthew Booth (mbooth-9)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/641093

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/678954

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/678954
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=708c11e96d01157704ba63b9b15a6e695c77ad07
Submitter: Zuul
Branch: master

commit 708c11e96d01157704ba63b9b15a6e695c77ad07
Author: Matt Riedemann <email address hidden>
Date: Tue Aug 27 16:04:57 2019 -0400

    Update help for image_cache_manager_interval option

    The image_cache_manager_interval configuration option help is
    updated to:

    1. Fix formatting on possible values.
    2. Describe the "large number of instances" scenarios to
       watch out for in bug 1804262.
    3. Add related options.
    4. Clarify the "0" default interval value so it is not confused
       with the 2400 default.

    Change-Id: Ia182e758d975e76a6e6f14f29591cb2891d7c83a
    Related-Bug: #1804262

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/633042
Reason: This is in merge conflict now and with mdbooth's alternative https://review.opendev.org/#/c/641093/ and given the lack of urgency on resolving this (or even for the bug reporter to comment on it) I'm just going to abandon.

Changed in nova:
assignee: Matthew Booth (mbooth-9) → Lee Yarwood (lyarwood)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.