tripleo

Reaching disk space capacity for containers job

Bug #1694709 reported by Martin André on 2017-05-31

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	High	Steve Baker	tripleo queens-3

Bug Description

We're reaching the limits of the disk space capacity for containers job. When adding new images we run out of disk space and the job fails:

InstanceDeployFailure: Failed to provision instance fe3ec1c7-64af-45be-976d-b8638eb684a4: Failed to deploy. Error: Disk volume where '/var/lib/ironic/master_images/tmpYx3Ya4' is located doesn't have enough disk space. Required 5603 MiB, only 5261 MiB available space present.

This was seen in https://review.openstack.org/#/c/469401/

http://logs.openstack.org/01/469401/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-containers-oooq-nv/6a7fcfd/logs/oooq/undercloud/var/log/nova/nova-compute.log.txt.gz#_2017-05-31_12_20_17_498

Tags:

Martin André (mandre) on 2017-05-31

Changed in tripleo:
status:	New → Triaged
milestone:	none → pike-2

Revision history for this message

Dan Prince (dan-prince) wrote on 2017-05-31:

If we created an overcloud-containers.qcow too and stopped installing all the openstack-* packages I think it would save us a good chunk of space. This would supplant the use of overcloud-full.qcow...

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-01: Related fix proposed to tripleo-quickstart-extras (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/469807

Revision history for this message

Dan Prince (dan-prince) wrote on 2017-06-01:

This should now be addressed by:

https://review.openstack.org/469917 Bump undercloud flavor to disk 41

Changed in tripleo:
assignee:	nobody → Dan Prince (dan-prince)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2017-06-02:

I noticed that the 'openstack undercloud deploy' pulls required docker images to the undercloud host, then 'openstack overcloud deploy' *doubles* the space consumed by the images by importing those into a private registry hosted undercloud as well.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2017-06-02:

So let's make the undercloud deploy first setup a registry, then pull images directly from it's upstream source or CI mirrored proxy, then consume them from that registry, so does the overcloud deploy.

Revision history for this message

Martin André (mandre) wrote on 2017-06-02:

That's a very good point Bogdan! The images are present both in the docker cache and in the local docker registry. We should be able to cut the size they take in half by removing the images from the docker cache on successful upload. The only drawback is that they won't appear on the undercloud when doing a 'docker images' which can be a bit confusing.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2017-06-06:

Another thing is researching for de-duplication enbaled file system, like btrfs?, for undercloud containers' storage. I believe an experimental/not-very-stable state of things would fit the case perfectly w/o issues for production use cases. Those are just images, and if FS broken, could be just re-fetched once again, from the scratch.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2017-06-06:

I'm testing undercloud with a loopback btrfs device used only for docker images, and overlay2 for the running images. Contrary to the docker btrfs driver https://docs.docker.com/engine/userguide/storagedriver/btrfs-driver/ for everything, the former should provide a decent runtime performance for undercloud containers, but somewhat slowed down images' fetching / pushing.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2017-06-07:

Btrfs looks a no go. I tested consumed space for the docker load from tar then re-tag and push to a local registry, against -g overlay2 (over the host's xfs) vs -g btrfs, and here are results:

overlay2 backed by xfs, loaded docker images, custom path:
# du -sh /var/lib/docker-images
8.1G /var/lib/docker-images

the same, re-tagged and pushed to local registry:
# du -sh /var/lib/docker-registry
2.6G /var/lib/docker-registry
task time: 0:17:24.613

The measured task time stabds for the quickstart-extras overcloud-prep-containers 'Prepare for the containerized deployment' ran against undercloud VM, with 54 tripleoupstream docker images stored in a tarbal (docker saved).

COW btrfs on a loopback device file created over xfs, loaded docker images, custom path:
(du -sh showed wrong stats, so bringing in only total stats)
# btrfs fi usage /mnt/docker_images_btrfs | grep Used
Used: 16.71GiB
Data,single: Size:13.01GiB, Used:7.61GiB
Metadata,DUP: Size:5.12GiB, Used:4.55GiB
System,DUP: Size:8.00MiB, Used:16.00KiB

the same, re-tagged and pushed to local registry:
# du -sh /mnt/docker_images_btrfs/registry
2.6G /mnt/docker_images_btrfs/registry
(and total stats)
# btrfs fi usage /mnt/docker_images_btrfs | grep Used
Used: 19.47GiB
Data,single: Size:13.01GiB, Used:10.15GiB
Metadata,DUP: Size:5.12GiB, Used:4.66GiB
System,DUP: Size:8.00MiB, Used:16.00KiB
task time: 0:30:43.406

So btrfs COW seemingly saves 0 space for a the docker registry popluated from local images on the same data volume, and doubles the consumed space for the very docker images :/

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2017-06-07:

#10

@Martin, also my tests showed that the overhead for a local registry is only 2.6G of the space consumed additionally to the 8.1G of docker images stored in the cache. So removing registry doesn't save much of the space I'd say.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2017-06-07:

#11

Jistr have raised a good point: the way we rebuild and push images and reuse a common centos7 base and (a service-specific common base for derived service images like nova->nova-api) for the images may be the issue. We really should double-check all of the pushed images always useing these common base images and properly layered.

Revision history for this message

Martin André (mandre) wrote on 2017-06-07:

#12

@Bogdan: the point is to delete the docker cache (so roughly 8GB according to your measures) and keep only the registry that is going to be used for deploying the overcloud.

Also it's possible that the image that Jirka pushed takes more space than necessary, we can fix it with a mass rebuild. I expect the automated image building in RDO pipeline to help there.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2017-06-07:

#13

@Martin, undercloud runs containers from those images stored locally, AFAIK the images can't be deleted if containers are running from those images. I'm not sure we're at the same page wrt images cache for docker. Could you clarify?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-07:

#14

Related fix proposed to branch: master
Review: https://review.openstack.org/471832

Revision history for this message

Martin André (mandre) wrote on 2017-06-08:

#15

Bogdan, sorry if I wasn't clean in my previous comment. This workaround applies only to the overcloud container job (where we're seing the disk space issue) and not to the containerized undercloud where obviously we do not want to delete docker data.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-08: Related fix merged to tripleo-quickstart-extras (master)

#16

Reviewed: https://review.openstack.org/471832
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=70b1203110d9d8739fb16766e506d729ebb9fa14
Submitter: Jenkins
Branch: master

commit 70b1203110d9d8739fb16766e506d729ebb9fa14
Author: Martin André <email address hidden>
Date: Wed Jun 7 17:49:33 2017 +0200

Allow deleting all of docker cache

This is mostly a workaround for limited disk space in CI and should
never be used otherwise as it wipes all of the docker data.

Change-Id: I0b69087b78dc974e6ffcb3674c88c435ec569988
Related-Bug: #1694709

Emilien Macchi (emilienm) on 2017-06-08

Changed in tripleo:
milestone:	pike-2 → pike-3

Revision history for this message

Dan Prince (dan-prince) wrote on 2017-06-12:

#17

Bogdan: Per your previous comment regarding the 'undercloud deploy' consuming extra space. I don't think the CI job for containers actually uses this type of undercloud does it yet? It was my understanding we still used instack-undercloud for the overcloud CI job.

Emilien Macchi (emilienm) on 2017-07-05

Changed in tripleo:
status:	Triaged → In Progress

Emilien Macchi (emilienm) on 2017-07-30

Changed in tripleo:
milestone:	pike-3 → pike-rc1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-17: Change abandoned on tripleo-quickstart-extras (master)

#18

Change abandoned by Martin André (<email address hidden>) on branch: master
Review: https://review.openstack.org/469807

Emilien Macchi (emilienm) on 2017-08-25

Changed in tripleo:
milestone:	pike-rc1 → pike-rc2

Emilien Macchi (emilienm) on 2017-09-05

Changed in tripleo:
milestone:	pike-rc2 → queens-1

Emilien Macchi (emilienm) on 2017-10-23

Changed in tripleo:
milestone:	queens-1 → queens-2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-26: Related fix proposed to tripleo-common (master)

#19

Related fix proposed to branch: master
Review: https://review.openstack.org/515252

Alex Schultz (alex-schultz) on 2017-12-05

Changed in tripleo:
milestone:	queens-2 → queens-3

OpenStack Infra (hudson-openstack) on 2017-12-06

Changed in tripleo:
assignee:	Dan Prince (dan-prince) → Steve Baker (steve-stevebaker)

Bogdan Dobrelya (bogdando) on 2017-12-06

tags:

added: pike-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-06: Fix merged to tripleo-common (master)

#20

Reviewed: https://review.openstack.org/519201
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=7e0564106e7b01f6516c22b95082b04bc0c8fc63
Submitter: Zuul
Branch: master

commit 7e0564106e7b01f6516c22b95082b04bc0c8fc63
Author: Steve Baker <email address hidden>
Date: Mon Nov 13 10:06:58 2017 +1300

Implement post-upload cleanup of docker images

    The docker uploader will leave a copy of all uploaded images in the
    local docker storage. This change will track those images and delete
    them from the local docker after all uploads are complete.

    If a local image is in use (for example, an already deployed
    containerised undercloud) then the delete will fail. In this case,
    only a warning is logged.

    Change-Id: Ic0424638b9ddbf77e10cfe936d0b96ff2da1a59e
    Closes-Bug: #1708965
    Closes-Bug: #1694709

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-29: Fix included in openstack/tripleo-common 8.3.0

#21

This issue was fixed in the openstack/tripleo-common 8.3.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-02: Change abandoned on tripleo-common (master)

#22

Change abandoned by Steve Baker (<email address hidden>) on branch: master
Review: https://review.openstack.org/515252

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.