image cache manager removes used backing files on NFS shared storage

Bug #1906798 reported by Clement G
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

Description
===========
After a site electrical maintenance (power off for two days), most of the instances using ephemeral storage fail to start with "Error : Image <id> could not be found.".
The backing files for these instances in the "/var/lib/nova/instances/_base" folder are missing.

We are using ephemeral storage shared on NFS. Glance images are rebuilt every day, so most instances do not share a common image.

Cause: after poweron a first compute runs _run_image_cache_manager_pass
First (storage_users.register_storage_use) the compute registers itself on the /var/lib/nova/instances/compute_nodes file.
Then (storage_users.get_storage_users), it reads the compute_nodes file. Since it is the first to register in the file in more the 24hours, it conclude it is the only one using the storage and remove all backing files not attached to instances it runs.

I think we should wait for at least "image_cache_manager_interval" to allow time for the other hosts to register in compute_nodes before actually removing base files.

Steps to reproduce
===========
Use shared NFS ephemeral storage on all computes.
1. create an instance on each compute, each time from a different glance images
2. remove all these images from glance
3. stop all instances
4. stop all nova_compute services
5. wait for 24 hours
   (alternatively, echo '{}' > /var/lib/nova/instances/compute_nodes && touch -d "$(date '+%Y-%m-%d %H:%M:%S' -d '1 day ago')" /var/lib/nova/instances/_base/* )
5. start all nova_compute services
6. wait for the image cache manager to trigger (~ image_cache_manager_interval default 40mn)
7. start all instances

Expected result
===============
All instances start

Actual result
=============
All the instances fail to start, except on one compute

Environment
===========
tested with nova version 10.1.0 and 13.0.2
on libvirt KVM and shared NFS Netapp storage

Revision history for this message
Jake Hill (routergod) wrote :
Download full text (3.6 KiB)

I think I just experienced the same thing with a GFS2 filesystem configuration.

g-s-s is not keeping old images, so this is going to happen.

2021-07-07 19:19:09.370 76758 INFO nova.virt.libvirt.driver [-] [instance: 0578149e-3cf5-4461-813a-d4c19c8ea219] Instance destroyed successfully.
2021-07-07 19:19:09.396 76758 INFO os_vif [req-df6690bd-fe7f-4fe1-9b63-70b0bb16f323 14e990be9c6d4ee7bc164053cd199b03 4ab34c2f0e2849fabd0f6bc1df763d32 - e09db03a5c734790884ce76e1ccb84e3 e09db03a5c734790884ce76e1ccb84e3] Successfully unplugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:b1:f3:33,bridge_name='br-int',has_traffic_filtering=True,id=a307f72b-bcc1-48d0-bdd8-bdbfa115f06e,network=Network(4450b7ec-63d9-4753-9964-dba9e7b10ae4),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapa307f72b-bc')
2021-07-07 19:19:10.078 76758 INFO nova.compute.manager [req-df6690bd-fe7f-4fe1-9b63-70b0bb16f323 14e990be9c6d4ee7bc164053cd199b03 4ab34c2f0e2849fabd0f6bc1df763d32 - e09db03a5c734790884ce76e1ccb84e3 e09db03a5c734790884ce76e1ccb84e3] [instance: 0578149e-3cf5-4461-813a-d4c19c8ea219] Successfully reverted task state from powering-on on failure for instance.
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server [req-df6690bd-fe7f-4fe1-9b63-70b0bb16f323 14e990be9c6d4ee7bc164053cd199b03 4ab34c2f0e2849fabd0f6bc1df763d32 - e09db03a5c734790884ce76e1ccb84e3 e09db03a5c734790884ce76e1ccb84e3] Exception during message handling: nova.exception.ImageNotFound: Image c6396d68-72a5-4897-9eb5-d33484d3f925 could not be found.
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/image/glance.py", line 375, in download
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server image_chunks = self._client.call(
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/image/glance.py", line 190, in call
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server result = getattr(controller, method)(*args, **kwargs)
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/glanceclient/common/utils.py", line 628, in inner
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server return RequestIdProxy(wrapped(*args, **kwargs))
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/glanceclient/v2/images.py", line 249, in data
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server resp, image_meta = self.http_client.get(url)
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/keystoneauth1/adapter.py", line 395, in get
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server return self.request(url, 'GET', **kwargs)
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/glanceclient/common/http.py", line 380, in request
2021-07-07 19:19:10.086 76758 ERROR oslo_messaging.rpc.server return self._handle_response(resp)
2021-07...

Read more...

Revision history for this message
Uggla (rene-ribaud) wrote :

This bug has been open for quite a long time without activities. So I really apologize if you cannot have the support required.
However, the bug was on Pike, which is out of support now.
I propose closing this bug unless you still have it, and you can reproduce it on a supported release. Please provide the new info and set this bug back to new.

Changed in nova:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.