Unable to use qcow2 disk image larger than system memory

Bug #1661328 reported by Félix Bouliane
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ironic-python-agent
Won't Fix
Wishlist
Jonathan Provost

Bug Description

Background:
 * That was when using the agent driver
 * Node with 4GB RAM
 * 10GB qcow2 image
 * CoreOS ramdisk

Steps to reproduce:
    * Setup node in Ironic
    * Deploy node with large qcow2 image

Actual behavior:
    The machine becomes unresponsive and crashes. Deployment fails without too much indication as to what part failed.

Expected behavior:
    Large qcow2 images are correctly imaged into the machine
    *OR* I quickly get an error message that this image cannot be installed on the machine because it is too big.

Analysis:
Any image to be deployed (besides RAW images when using streaming=True) are downloaded to a temporary location, /tmp. The image is then checksummed and passed to write_image.sh which then calls qemu-img for writing to disk.

Because /tmp is backed by the root tmpfs, when the image is larger than the amount of available space, the image fills up the memory and the server crashes.

Tags: rfe needs-spec
information type: Public → Public Security
information type: Public Security → Public
affects: ironic → ironic-python-agent
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic-python-agent (master)

Fix proposed to branch: master
Review: https://review.openstack.org/428410

Changed in ironic-python-agent:
assignee: nobody → Félix Bouliane (fbouliane)
status: New → In Progress
Revision history for this message
Jay Faulkner (jason-oldos) wrote :

This is more of a known limitation than a bug, however, you can use raw images which are streamed to disk.

What I don't understand is how #428410 relates to this bug. Is there some larger plan in mind?

Revision history for this message
Mathieu Mitchell (mat128) wrote :

Jay: Yes we indeed have a workaround we are trying to implement. The first step is figuring out there is not enough space for the file.

Félix's first commit is a workaround that will allow us to find out the size of the image. Our next step is doing a fallocate with the file size when writing to the file. This should raise when the disk is not big enough.

We thought about the different solutions and found the following:
* Feed the image URL directly to qemu-img. The tool will issue "Range" requests to the HTTP server. This will not allow checksum verification and is considered a regression for us.

* Use RAW images with streaming (we have a unique set of qcow2 images for both virtual and metal cloud which we wouldn't like to duplicate)

* Use Glare to store multiple image types and integrate Ironic to request the most appropriate image type

* Use iSCSI drivers to stream any kind of image to disk instead of using node memory (our conductors have restricted access to the nodes, both in terms of network access and bandwidth, pushing image data through the management network would be a step backward for us)

We settled on the following solution:
* Download image to a loop device backed by the actual disk drive (with an offset). Keep existing code and benefit from checksumming already present.

We wrote a POC of that solution and it works wonderfully. Despite an additional write+read on disk, it takes about the same time as normal deployment. This is probably due to hard disk caching.

Changed in ironic-python-agent:
assignee: Félix Bouliane (fbouliane) → Jonathan Provost (jprovost-sh)
Changed in ironic-python-agent:
importance: Undecided → Wishlist
tags: added: rfe
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic-python-agent (master)

Change abandoned by Félix Bouliane (<email address hidden>) on branch: master
Review: https://review.openstack.org/428410
Reason: this will be addressed in change Ibf742198e83ae13f90767b28cc1858f0a17c3a95

Revision history for this message
Jonathan Provost (jprovost-sh) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/430442

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic-python-agent (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/444525

Revision history for this message
Jay Faulkner (jason-oldos) wrote :

Can we get a spec up for this? Thanks!

tags: added: needs-spec
Revision history for this message
Hao Li (lihaosz) wrote :

Can we enhauce node-validate function in ironic, check if the ram in node is larger than the user-image size

Revision history for this message
Félix Bouliane (fbouliane) wrote :

There seems to be a check in ironic that does the validation that the memory is larger than the image_size + reserved space. For our ironic installation, we used the patch [2] and additionally, we needed a patch to disable the image size check.

I will try to write a spec to propose the improvement properly.

[1] https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/agent.py#L98
[2] https://review.openstack.org/#/c/430442/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic-python-agent (stable/newton)

Change abandoned by Dmitry Tantsur (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/444525
Reason: Abandoning per comment above (and because stable/newton goes EOL soon)

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

It appears that this is no longer being worked, resetting bug status and abandoning the change that has been sitting outstanding without reply for nearly a year. :(

Changed in ironic-python-agent:
status: In Progress → Triaged
Revision history for this message
Jay Faulkner (jason-oldos) wrote :

This feature would be difficult to implement under modern security limitations, especially given you can just use a raw image. We're going to say wontfix as it's not a project priority and it's very complex for very little benefit.

Changed in ironic-python-agent:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.