Ah I see from that "get_storage_users" code that each compute service writes to the "instances_path" config option location a dict keyed by hostname to the last time that the check ran for that host. Then _run_image_cache_manager_pass will run over all of those instances across all nodes.
From the MessagingTimeout traceback, it looks like this is what's killing you since it's pulling BDMs for 705 instances from the database:
Ah I see from that "get_storage_users" code that each compute service writes to the "instances_path" config option location a dict keyed by hostname to the last time that the check ran for that host. Then _run_image_ cache_manager_ pass will run over all of those instances across all nodes.
From the MessagingTimeout traceback, it looks like this is what's killing you since it's pulling BDMs for 705 instances from the database:
https:/ /github. com/openstack/ nova/blob/ 56811efa3583dfa 3c03f9e43a9802b be21e45bbd/ nova/virt/ imagecache. py#L53
That should probably be chunked for paging somehow, maybe process in groups of 50 or something?
Alternatively, just get the BDMs per instance here:
https:/ /github. com/openstack/ nova/blob/ 56811efa3583dfa 3c03f9e43a9802b be21e45bbd/ nova/virt/ imagecache. py#L80
Also note that code only cares about swap BDMs which are defined with this filter:
https:/ /github. com/openstack/ nova/blob/ 56811efa3583dfa 3c03f9e43a9802b be21e45bbd/ nova/block_ device. py#L439
So that BDM query could also be optimized to only return swap BDMs.