Cached images incorrectly removed after instance storage comes back online after a prolonged >= 24 hour outage

Bug #1554093 reported by Lee Yarwood
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Low
Unassigned

Bug Description

After a prolonged outage of >= 24 hours any cached images stored on shared instance storage are prone to removal as compute nodes race to complete a pass of the cache manager once the storage returns.

This pass of the cache manager first registers the current node as an active user of the instance store before compiling a list of instances on hosts registered to the instance store. This list then being used to determine which of the cached images can be safely removed.

After a prolonged outage of >= 24 hours the first compute node to run a cache manager pass will only find itself listed as an active user of the instance store. Thus it can and likely will remove cached images for instances hosted on other compute nodes.

IMHO additional care should be taken before calling for the removal of cached images for instances on registered but seemingly inactive compute nodes.

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

* Adding tag "compute" as it affects the compute manager at [1].

> After a prolonged outage of >= 24 hours [...]

This is due to the config option:
    DEFAULT.remove_unused_original_minimum_age_seconds = 86400

References:
[1] https://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py?id=56a8fe0cc7339ea08e3044406d67341a616eb843#n6683
[2] http://docs.openstack.org/developer/nova/sample_config.html

tags: added: compute
Changed in nova:
status: New → Incomplete
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

@Lee Yarwood:
We have different virt drivers which handle the cleanup differently.
Which driver (libvirt, vmware, ...) did you use? Please set the status
back to "new" when you provide this information.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

Apologies for not highlighting this originally but I've yet to reproduce this against master, this report is currently based on a reported issue downstream (RHBZ#1310143).

> We have different virt drivers which handle the cleanup differently.
> Which driver (libvirt, vmware, ...) did you use? Please set the status
> back to "new" when you provide this information.

I'm using the libvirt driver but IMHO this race starts within the compute manager.

When building the list of filtered_instances we first fetch a list of active hosts. At present this uses a hard coded value of 24 hours [1] to determine if a registered node, that had previously written to the compute_nodes file is still active.

This becomes a problem after a prolonged outage as the first host to run a cache pass will potentially find that it is the only active host using the instance store. Thus all instances on the remaining hosts are deemed to be inactive, allowing the removal of any associated cached images, swap files etc.

>> This is due to the config option:
>> DEFAULT.remove_unused_original_minimum_age_seconds = 86400

This configurable ultimately allows for these `unused` images to be removed but IMHO the actual issue is still between the compute manager and storage_users.

[1] https://git.openstack.org/cgit/openstack/nova/tree/nova/virt/storage_users.py?id=f4dd117a81c68d9fb600dff456e1171841758b43#n76

Changed in nova:
status: Incomplete → New
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

The mentioned downstream bug report is documented at [1]. This report
also describes that the issue occurs when:
* The original image was deleted in glance
* The cached image in the shared storage got deleted
* The instance gets rebooted
Commit [2] introduced the logic due to bug 1078594 [3] in the "Grizzly"
release. I haven't reproduced it but this report here sounds valid
but with an very low chance of happening.

References:
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1310143
[2] https://github.com/openstack/nova/commit/c2de33a0a2132774dc295861cef138ec24bb0cf9
[3] https://bugs.launchpad.net/nova/+bug/1078594

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
Anseela M M (anseela-m00) wrote :

Can you please mention the version which you found this error?

Changed in nova:
assignee: nobody → Anseela M M (anseela-m00)
Revision history for this message
Lee Yarwood (lyarwood) wrote :

The downstream issue was reported against Kilo however Newton still appears susceptible to this.

tags: added: image-cache
Changed in nova:
assignee: Anseela M M (anseela-m00) → nobody
Lee Yarwood (lyarwood)
Changed in nova:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.