OpenStack Compute (nova)

Cached images incorrectly removed after instance storage comes back online after a prolonged >= 24 hour outage

Bug #1554093 reported by Lee Yarwood on 2016-03-07

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Invalid	Low	Unassigned

Bug Description

After a prolonged outage of >= 24 hours any cached images stored on shared instance storage are prone to removal as compute nodes race to complete a pass of the cache manager once the storage returns.

This pass of the cache manager first registers the current node as an active user of the instance store before compiling a list of instances on hosts registered to the instance store. This list then being used to determine which of the cached images can be safely removed.

After a prolonged outage of >= 24 hours the first compute node to run a cache manager pass will only find itself listed as an active user of the instance store. Thus it can and likely will remove cached images for instances hosted on other compute nodes.

IMHO additional care should be taken before calling for the removal of cached images for instances on registered but seemingly inactive compute nodes.

Tags:

Revision history for this message

Markus Zoeller (markus_z) (mzoeller) wrote on 2016-03-10:

* Adding tag "compute" as it affects the compute manager at [1].

> After a prolonged outage of >= 24 hours [...]

This is due to the config option:
DEFAULT.remove_unused_original_minimum_age_seconds = 86400

References:
[1] https://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py?id=56a8fe0cc7339ea08e3044406d67341a616eb843#n6683
[2] http://docs.openstack.org/developer/nova/sample_config.html

tags:	added: compute
Changed in nova:
status:	New → Incomplete

Revision history for this message

Markus Zoeller (markus_z) (mzoeller) wrote on 2016-03-10:

@Lee Yarwood:
We have different virt drivers which handle the cleanup differently.
Which driver (libvirt, vmware, ...) did you use? Please set the status
back to "new" when you provide this information.

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2016-03-10:

Apologies for not highlighting this originally but I've yet to reproduce this against master, this report is currently based on a reported issue downstream (RHBZ#1310143).

> We have different virt drivers which handle the cleanup differently.
> Which driver (libvirt, vmware, ...) did you use? Please set the status
> back to "new" when you provide this information.

I'm using the libvirt driver but IMHO this race starts within the compute manager.

When building the list of filtered_instances we first fetch a list of active hosts. At present this uses a hard coded value of 24 hours [1] to determine if a registered node, that had previously written to the compute_nodes file is still active.

This becomes a problem after a prolonged outage as the first host to run a cache pass will potentially find that it is the only active host using the instance store. Thus all instances on the remaining hosts are deemed to be inactive, allowing the removal of any associated cached images, swap files etc.

>> This is due to the config option:
>> DEFAULT.remove_unused_original_minimum_age_seconds = 86400

This configurable ultimately allows for these `unused` images to be removed but IMHO the actual issue is still between the compute manager and storage_users.

[1] https://git.openstack.org/cgit/openstack/nova/tree/nova/virt/storage_users.py?id=f4dd117a81c68d9fb600dff456e1171841758b43#n76

Changed in nova:
status:	Incomplete → New

Revision history for this message

Markus Zoeller (markus_z) (mzoeller) wrote on 2016-03-11:

The mentioned downstream bug report is documented at [1]. This report
also describes that the issue occurs when:
* The original image was deleted in glance
* The cached image in the shared storage got deleted
* The instance gets rebooted
Commit [2] introduced the logic due to bug 1078594 [3] in the "Grizzly"
release. I haven't reproduced it but this report here sounds valid
but with an very low chance of happening.

References:
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1310143
[2] https://github.com/openstack/nova/commit/c2de33a0a2132774dc295861cef138ec24bb0cf9
[3] https://bugs.launchpad.net/nova/+bug/1078594

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Low

Revision history for this message

Anseela M M (anseela-m00) wrote on 2016-03-18:

Can you please mention the version which you found this error?

Anseela M M (anseela-m00) on 2016-03-18

Changed in nova:
assignee:	nobody → Anseela M M (anseela-m00)

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2016-03-29:

The downstream issue was reported against Kilo however Newton still appears susceptible to this.

Markus Zoeller (markus_z) (mzoeller) on 2016-04-12

tags:

added: image-cache

Anseela M M (anseela-m00) on 2016-04-13

Changed in nova:
assignee:	Anseela M M (anseela-m00) → nobody

Lee Yarwood (lyarwood) on 2020-03-16

Changed in nova:
status:	Confirmed → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

redhat-bugs #1310143
[CLOSED INSUFFICIENT_DATA] Edit

Bug watches keep track of this bug in other bug trackers.