Reaching disk space capacity for containers job

Bug #1694709 reported by Martin André
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Steve Baker

Bug Description

We're reaching the limits of the disk space capacity for containers job. When adding new images we run out of disk space and the job fails:

InstanceDeployFailure: Failed to provision instance fe3ec1c7-64af-45be-976d-b8638eb684a4: Failed to deploy. Error: Disk volume where '/var/lib/ironic/master_images/tmpYx3Ya4' is located doesn't have enough disk space. Required 5603 MiB, only 5261 MiB available space present.

This was seen in https://review.openstack.org/#/c/469401/

http://logs.openstack.org/01/469401/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-containers-oooq-nv/6a7fcfd/logs/oooq/undercloud/var/log/nova/nova-compute.log.txt.gz#_2017-05-31_12_20_17_498

Martin André (mandre)
Changed in tripleo:
status: New → Triaged
milestone: none → pike-2
Revision history for this message
Dan Prince (dan-prince) wrote :

If we created an overcloud-containers.qcow too and stopped installing all the openstack-* packages I think it would save us a good chunk of space. This would supplant the use of overcloud-full.qcow...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/469807

Revision history for this message
Dan Prince (dan-prince) wrote :

This should now be addressed by:

https://review.openstack.org/469917 Bump undercloud flavor to disk 41

Changed in tripleo:
assignee: nobody → Dan Prince (dan-prince)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I noticed that the 'openstack undercloud deploy' pulls required docker images to the undercloud host, then 'openstack overcloud deploy' *doubles* the space consumed by the images by importing those into a private registry hosted undercloud as well.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

So let's make the undercloud deploy first setup a registry, then pull images directly from it's upstream source or CI mirrored proxy, then consume them from that registry, so does the overcloud deploy.

Revision history for this message
Martin André (mandre) wrote :

That's a very good point Bogdan! The images are present both in the docker cache and in the local docker registry. We should be able to cut the size they take in half by removing the images from the docker cache on successful upload. The only drawback is that they won't appear on the undercloud when doing a 'docker images' which can be a bit confusing.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Another thing is researching for de-duplication enbaled file system, like btrfs?, for undercloud containers' storage. I believe an experimental/not-very-stable state of things would fit the case perfectly w/o issues for production use cases. Those are just images, and if FS broken, could be just re-fetched once again, from the scratch.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I'm testing undercloud with a loopback btrfs device used only for docker images, and overlay2 for the running images. Contrary to the docker btrfs driver https://docs.docker.com/engine/userguide/storagedriver/btrfs-driver/ for everything, the former should provide a decent runtime performance for undercloud containers, but somewhat slowed down images' fetching / pushing.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Btrfs looks a no go. I tested consumed space for the docker load from tar then re-tag and push to a local registry, against -g overlay2 (over the host's xfs) vs -g btrfs, and here are results:

overlay2 backed by xfs, loaded docker images, custom path:
 # du -sh /var/lib/docker-images
 8.1G /var/lib/docker-images

the same, re-tagged and pushed to local registry:
 # du -sh /var/lib/docker-registry
 2.6G /var/lib/docker-registry
task time: 0:17:24.613

The measured task time stabds for the quickstart-extras overcloud-prep-containers 'Prepare for the containerized deployment' ran against undercloud VM, with 54 tripleoupstream docker images stored in a tarbal (docker saved).

COW btrfs on a loopback device file created over xfs, loaded docker images, custom path:
(du -sh showed wrong stats, so bringing in only total stats)
 # btrfs fi usage /mnt/docker_images_btrfs | grep Used
  Used: 16.71GiB
 Data,single: Size:13.01GiB, Used:7.61GiB
 Metadata,DUP: Size:5.12GiB, Used:4.55GiB
 System,DUP: Size:8.00MiB, Used:16.00KiB

the same, re-tagged and pushed to local registry:
 # du -sh /mnt/docker_images_btrfs/registry
 2.6G /mnt/docker_images_btrfs/registry
(and total stats)
 # btrfs fi usage /mnt/docker_images_btrfs | grep Used
  Used: 19.47GiB
 Data,single: Size:13.01GiB, Used:10.15GiB
 Metadata,DUP: Size:5.12GiB, Used:4.66GiB
 System,DUP: Size:8.00MiB, Used:16.00KiB
task time: 0:30:43.406

So btrfs COW seemingly saves 0 space for a the docker registry popluated from local images on the same data volume, and doubles the consumed space for the very docker images :/

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Martin, also my tests showed that the overhead for a local registry is only 2.6G of the space consumed additionally to the 8.1G of docker images stored in the cache. So removing registry doesn't save much of the space I'd say.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Jistr have raised a good point: the way we rebuild and push images and reuse a common centos7 base and (a service-specific common base for derived service images like nova->nova-api) for the images may be the issue. We really should double-check all of the pushed images always useing these common base images and properly layered.

Revision history for this message
Martin André (mandre) wrote :

@Bogdan: the point is to delete the docker cache (so roughly 8GB according to your measures) and keep only the registry that is going to be used for deploying the overcloud.

Also it's possible that the image that Jirka pushed takes more space than necessary, we can fix it with a mass rebuild. I expect the automated image building in RDO pipeline to help there.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Martin, undercloud runs containers from those images stored locally, AFAIK the images can't be deleted if containers are running from those images. I'm not sure we're at the same page wrt images cache for docker. Could you clarify?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/471832

Revision history for this message
Martin André (mandre) wrote :

Bogdan, sorry if I wasn't clean in my previous comment. This workaround applies only to the overcloud container job (where we're seing the disk space issue) and not to the containerized undercloud where obviously we do not want to delete docker data.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.openstack.org/471832
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=70b1203110d9d8739fb16766e506d729ebb9fa14
Submitter: Jenkins
Branch: master

commit 70b1203110d9d8739fb16766e506d729ebb9fa14
Author: Martin André <email address hidden>
Date: Wed Jun 7 17:49:33 2017 +0200

    Allow deleting all of docker cache

    This is mostly a workaround for limited disk space in CI and should
    never be used otherwise as it wipes all of the docker data.

    Change-Id: I0b69087b78dc974e6ffcb3674c88c435ec569988
    Related-Bug: #1694709

Changed in tripleo:
milestone: pike-2 → pike-3
Revision history for this message
Dan Prince (dan-prince) wrote :

Bogdan: Per your previous comment regarding the 'undercloud deploy' consuming extra space. I don't think the CI job for containers actually uses this type of undercloud does it yet? It was my understanding we still used instack-undercloud for the overcloud CI job.

Changed in tripleo:
status: Triaged → In Progress
Changed in tripleo:
milestone: pike-3 → pike-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master)

Change abandoned by Martin André (<email address hidden>) on branch: master
Review: https://review.openstack.org/469807

Changed in tripleo:
milestone: pike-rc1 → pike-rc2
Changed in tripleo:
milestone: pike-rc2 → queens-1
Changed in tripleo:
milestone: queens-1 → queens-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/515252

Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
assignee: Dan Prince (dan-prince) → Steve Baker (steve-stevebaker)
tags: added: pike-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/519201
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=7e0564106e7b01f6516c22b95082b04bc0c8fc63
Submitter: Zuul
Branch: master

commit 7e0564106e7b01f6516c22b95082b04bc0c8fc63
Author: Steve Baker <email address hidden>
Date: Mon Nov 13 10:06:58 2017 +1300

    Implement post-upload cleanup of docker images

    The docker uploader will leave a copy of all uploaded images in the
    local docker storage. This change will track those images and delete
    them from the local docker after all uploads are complete.

    If a local image is in use (for example, an already deployed
    containerised undercloud) then the delete will fail. In this case,
    only a warning is logged.

    Change-Id: Ic0424638b9ddbf77e10cfe936d0b96ff2da1a59e
    Closes-Bug: #1708965
    Closes-Bug: #1694709

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 8.3.0

This issue was fixed in the openstack/tripleo-common 8.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (master)

Change abandoned by Steve Baker (<email address hidden>) on branch: master
Review: https://review.openstack.org/515252

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.