Image-cache deleting active swap backing images

Bug #1457527 reported by Favyen Bastani
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Unassigned
Kilo
Fix Released
High
Unassigned

Bug Description

Version: 1:2015.1.0-0ubuntu1~cloud0 (Kilo)

I am having issues with backing images of disk.swap being deleted from the image cache.

Here is part of the nova-compute log, although multiple instances have disk.swap with swap_256 as the base image, the image cache repeatedly tries to delete it:

2015-05-14 08:08:15.080 4319 INFO nova.virt.libvirt.imagecache [req-967c82af-329e-42ff-aade-f4af2b4ba732 - - - - -] Base or swap file too young to remove: /var/lib/nova/instances/_base/swap_256
2015-05-14 08:48:46.209 4319 INFO nova.virt.libvirt.imagecache [req-967c82af-329e-42ff-aade-f4af2b4ba732 - - - - -] Base or swap file too young to remove: /var/lib/nova/instances/_base/swap_256
2015-05-14 09:29:00.814 4319 INFO nova.virt.libvirt.imagecache [req-967c82af-329e-42ff-aade-f4af2b4ba732 - - - - -] Base or swap file too young to remove: /var/lib/nova/instances/_base/swap_256
2015-05-14 10:09:14.351 4319 INFO nova.virt.libvirt.imagecache [req-967c82af-329e-42ff-aade-f4af2b4ba732 - - - - -] Removing base or swap file: /var/lib/nova/instances/_base/swap_256
2015-05-14 16:14:21.340 6479 INFO nova.virt.libvirt.imagecache [req-1d61428f-0b8b-4bae-9293-42ac99dc3f58 - - - - -] Base or swap file too young to remove: /var/lib/nova/instances/_base/swap_256
2015-05-14 16:55:21.195 6479 INFO nova.virt.libvirt.imagecache [req-1d61428f-0b8b-4bae-9293-42ac99dc3f58 - - - - -] Base or swap file too young to remove: /var/lib/nova/instances/_base/swap_256
2015-05-14 17:36:12.260 6479 INFO nova.virt.libvirt.imagecache [req-1d61428f-0b8b-4bae-9293-42ac99dc3f58 - - - - -] Base or swap file too young to remove: /var/lib/nova/instances/_base/swap_256

I am NOT using shared storage for the instances directory, it is sitting on the local filesystem, and instances on the same host node are using the swap base file that the image cache is deleting.

As far as I can tell, it is attempting to delete it immediately after the swap image is created. Swap backing images of the form swap_512_512 are not deleted though (as opposed to just swap_512; I couldn't figure out what the difference is).

Reproduce: Create volume with swap disk
Temporary solution: image_cache_manager_interval = -1

Revision history for this message
Michael Still (mikal) wrote :

The message says the file is too young to be removed (i.e. hasn't been deleted). Is it actually being removed from disk?

Changed in nova:
status: New → Incomplete
Revision history for this message
Michael Still (mikal) wrote :

Also, I think this message should be debug. The code definitely doesn't intend to delete in this case, and it seems to me that this message spamming info is confusing.

Revision history for this message
Michael Still (mikal) wrote :

Oh, then I noticed this slightly different line in the log dump above:

2015-05-14 10:09:14.351 4319 INFO nova.virt.libvirt.imagecache [req-967c82af-329e-42ff-aade-f4af2b4ba732 - - - - -] Removing base or swap file: /var/lib/nova/instances/_base/swap_256

So this is a thing. I am confused why it isn't more widely reported though.

Revision history for this message
Favyen Bastani (fbastani) wrote :

I included the "too young" since AFAIK that indicates the image cache thinks that the image should be deleted, but isn't deleting it yet due to the modify time constraint. But it eventually says "Removing base or swap file", at which point the swap backing image does get deleted. It is recreated when the instance is shutoff and booted back up, but this caused problems for snapshots and such.

Revision history for this message
Favyen Bastani (fbastani) wrote :

Also, the specific code that was added in Kilo and appears to be causing this issue is this (virt/imagecache.py):

            gb = objects.BlockDeviceMappingList.get_by_instance_uuid
            bdms = gb(context, instance.uuid)
            if bdms:
  swap = driver_block_device.convert_swap(bdms)
                if swap:
                    swap_image = 'swap_' + str(swap[0]['swap_size'])
                    used_swap_images.add(swap_image)

I believe the "bdms" was empty for all of the non-volume VMs (using the ephemeral/local root disk storage) and so the swap backing image never gets marked as used, so the image cache thinks it should be deleted.

Revision history for this message
Michael Still (mikal) wrote :

Yeah, thanks for this. I am running a debug OSAD now to try and recreate this problem. I agree that this code looks like it might be the problem.

Changed in nova:
status: Incomplete → In Progress
assignee: nobody → Michael Still (mikalstill)
Revision history for this message
Michael Still (mikal) wrote :

So, I booted a test VM, and it has libvirt XML like this:

    <disk type="file" device="disk">
      <driver name="qemu" type="qcow2" cache="none"/>
      <source file="/var/lib/nova/instances/509e5a0e-96ff-4fa1-9592-2a6840ce785
9/disk"/>
      <target bus="virtio" dev="vda"/>
    </disk>
    <disk type="file" device="disk">
      <driver name="qemu" type="qcow2" cache="none"/>
      <source file="/var/lib/nova/instances/509e5a0e-96ff-4fa1-9592-2a6840ce7859/disk.swap"/>
      <target bus="virtio" dev="vdb"/>
    </disk>
    <disk type="file" device="cdrom">
      <driver name="qemu" type="raw" cache="none"/>
      <source file="/var/lib/nova/instances/509e5a0e-96ff-4fa1-9592-2a6840ce7859/disk.config"/>
      <target bus="ide" dev="hdd"/>
    </disk>

However, the BDM fetched by nova.virt.imagecache shows a BDM like this:

BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2015-05-21T18:30:41Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=22,image_id='66f2f5e1-b955-4098-a755-fd9d6c9ea388',instance=<?>,instance_uuid=509e5a0e-96ff-4fa1-9592-2a6840ce7859,no_device=False,snapshot_id=None,source_type='image',updated_at=2015-05-21T18:30:42Z,volume_id=None,volume_size=None)

So, the code thinks that the instance isn't using a swap disk and therefore doesn't count that swap file as in use.

I'm still digging into this.

Changed in nova:
importance: Undecided → Critical
Revision history for this message
Michael Still (mikal) wrote :

More digging. The swap disk also doesn't appear in the block device mapping table in nova. I am starting to think that swap is a special case in the driver spawn code and not persisted to the database? Still digging.

Revision history for this message
Michael Still (mikal) wrote :

Sorry for the delay here, I didn't get much chasing of this done on the long haul flight home post summit. I'm taking another look today.

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

The following patch will make sure that the BDM entry for a swap disk is created at all times (right now it would only be created if --swap was used iow if passed in as part of block_device_mapping_v2).

https://review.openstack.org/#/c/83516/

Most of the libvirt code knows how to handle both cases - but this is obviously not ideal. I will respin the patch and tag this bug - but we may need some code to handle instances booted befor the patch.

Michael Still (mikal)
Changed in nova:
importance: Critical → High
assignee: Michael Still (mikalstill) → Nikola Đipanov (ndipanov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/83516
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7f8128f87f5a2fa93c857295fb7e4163986eda25
Submitter: Jenkins
Branch: master

commit 7f8128f87f5a2fa93c857295fb7e4163986eda25
Author: Xavier Queralt <email address hidden>
Date: Thu Apr 10 13:59:47 2014 +0200

    Add the swap and ephemeral BDMs if needed

    After preparing the block device mapping for creating new instance(s),
    check if any ephemeral and/or a swap disks have been already defined. If
    that is not the case and the instance type requires them, create the
    needed ephemeral and/or swap BDMs.

    Closes-Bug: #1297325
    Related-Bug: #1457527
    Change-Id: I44b30625cf1023d20ebec5e38f46b7b8dab325f6

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

So the abovementioned fix has landed (LP won't pick up the tag unless it's in the commit message on the first posting of the patch)

so it would be good to know if it solves this at least for the newly created instances.

tags: added: imagecache libvirt volumes
Revision history for this message
Matthew Thode (prometheanfire) wrote :

can this be backported to liberty/kilo?

Revision history for this message
Bjoern (bjoern-t) wrote :

Hello can we get this back ported to kilo. This is a rather critical bug ?

Revision history for this message
Matt Riedemann (mriedem) wrote :

There was some controversy around this fix, as it broke the triple-o CI:

https://review.openstack.org/#/c/190622/

That was abandoned since triple-o worked around the issue themselves, but it does make me nervous about backporting a potentially breaking change into a stable branch.

Revision history for this message
Kevin Carter (kevin-carter) wrote :

This fix is in Liberty however it'd be great if we can see it in Kilo. If there's a way to get this through as part of stable maintenance in the kilo release cycle it'd be appreciated however if the risk is too great then I understand.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/kilo)

Related fix proposed to branch: stable/kilo
Review: https://review.openstack.org/269421

Revision history for this message
Kevin Carter (kevin-carter) wrote :

I've gone ahead and submitted the cherry-pick to see if we can get this fix though into stable/kilo If we can move some of this convo into the review it'd be great.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/kilo)

Reviewed: https://review.openstack.org/269421
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dd1c3c7561676e7479d4ed60609286ed40f32964
Submitter: Jenkins
Branch: stable/kilo

commit dd1c3c7561676e7479d4ed60609286ed40f32964
Author: Xavier Queralt <email address hidden>
Date: Thu Apr 10 13:59:47 2014 +0200

    Add the swap and ephemeral BDMs if needed

    After preparing the block device mapping for creating new instance(s),
    check if any ephemeral and/or a swap disks have been already defined. If
    that is not the case and the instance type requires them, create the
    needed ephemeral and/or swap BDMs.

    Closes-Bug: #1297325
    Related-Bug: #1457527
    Change-Id: I44b30625cf1023d20ebec5e38f46b7b8dab325f6
    (cherry picked from commit 7f8128f87f5a2fa93c857295fb7e4163986eda25)

tags: added: in-stable-kilo
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

@Nikola:
What needs to be done to resolve this bug completely?

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

mdbooth has this on his radar but isn't there yet. The imagecache implementation seems to have basic design issues. The different backends (like LVM and RBD) don't make this task easier.

Changed in nova:
status: In Progress → Confirmed
assignee: Nikola Đipanov (ndipanov) → nobody
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :
tags: added: needs-attention
tags: added: image-cache
removed: imagecache
Revision history for this message
gadLinux (gad-aguilardelgado) wrote :
Download full text (6.3 KiB)

I'm having this problem on Mitaka, Ubuntu 16.04.

2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher [req-e0f2a780-1749-4ac6-83a4-8690d152eebf 03cc38dac979425abb2b76763b922036 c010bd7c02174062bb1df9c9cc9e6c15 - - -] Exception during message handling: [Errno 2] No such file or directory: '/var/lib/nova/instances/60640ffe-6d33-4fd4-bc0b-1dc768378e68/disk.swap'
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_reply
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher incoming.message))
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 185, in _dispatch
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 127, in _do_dispatch
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args)
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 110, in wrapped
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher payload)
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher self.force_reraise()
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 89, in wrapped
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher return f(self, context, *args, **kw)
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 359, in decorated_function
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher LOG.warning(msg, e, instance=instance)
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher self.force_reraise()
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2016-05-17 16:10:15.882 18166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-package...

Read more...

Revision history for this message
Jay Pipes (jaypipes) wrote :

@mdbooth: Could you please update the status of this bug? I know you've gotten a number of things refactored about the image layer in libvirt, but what percent of this bug has been addressed, if any?

I'm going to set this to Incomplete until we get an answer from Matt.

Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Sean Dague (sdague) wrote :

Automatically discovered version kilo in description. If this is incorrect, please update the description to include 'nova version: ...'

tags: added: openstack-version.kilo
Revision history for this message
XiaojueGuan (xiaojuegaun) wrote :

is this still a bug?

Lee Yarwood (lyarwood)
Changed in nova:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.