StarlingX

Download images fails when there are many images

Bug #2051005 reported by Boovan Rajendran on 2024-01-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Boovan Rajendran

Bug Description

Brief Description:

This could happen anytime the download step is used, however it is most likely to happen during restore, when a large number of images are downloaded.

While performing optimized restore all images are redownloaded.

During this download phase, the containerd cache is not cleared. Therefore you can fill up the cache before all the images are downloaded. This will cause the download task to fail.

Severity:

Critical

Steps to Reproduce:

Deploy a system
Increase size of docker-distribution lv so it's larger than docker lv
e.g. system controllerfs-modify docker-distribution=40
Push images to regsitry.local until you have much more than docker lv size
Backup
Optimized restore

Expected Behavior:

Restore works

Actual Behavior:

Restore fails because not enough space to download images

Reproducibility:
100%

System Configuration

AIO-SX and system controllers

Last Pass

N/A

Ansible:

TASK [common/push-docker-images : Download images and push to local registry] **************************************************
Tuesday 12 December 2023 21:05:49 +0000 (0:00:00.064) 0:12:38.552 ******
FAILED - RETRYING: Download images and push to local registry (10 retries left).
FAILED - RETRYING: Download images and push to local registry (9 retries left).
FAILED - RETRYING: Download images and push to local registry (8 retries left).
FAILED - RETRYING: Download images and push to local registry (7 retries left).

Containerd.log:

2023-12-12T21:18:02.294 localhost containerd[14351]: info time="2023-12-12T21:18:02.294432122Z" level=error msg="PullImage \"registry.local:9001/quay.io/calico/pod2daemon-flexvol:v3.22.2\" failed" error="failed to pull and unpack image \"registry.local:9001/quay.io/calico/pod2daemon-flexvol:v3.22.2\": mkdir /var/lib/docker/io.containerd.content.v1.content/ingest/c69b7b05ef3eff9307747b128f647dbc9c2a6c9fb0e97ec94d11b2a2ae3e9679: no space left on device"
2023-12-12T21:18:02.384 localhost containerd[14351]: info time="2023-12-12T21:18:02.383925925Z" level=info msg="PullImage \"registry.local:9001/k8s.gcr.io/sig-storage/csi-attacher:v3.4.0\""
2023-12-12T21:18:02.401 localhost containerd[14351]: info time="2023-12-12T21:18:02.400809952Z" level=error msg="PullImage \"registry.local:9001/k8s.gcr.io/sig-storage/csi-attacher:v3.4.0\" failed" error="failed to pull and unpack image \"registry.local:9001/k8s.gcr.io/sig-storage/csi-attacher:v3.4.0\": mkdir /var/lib/docker/io.containerd.content.v1.content/ingest/3c877f43151daa56b7426eb413edd5ed002b7f094d4616b0061458137c45b94a: no space left on device"
2023-12-12T21:18:02.483 localhost containerd[14351]: info time="2023-12-12T21:18:02.483030414Z" level=info msg="PullImage \"registry.local:9001/docker.io/wind-river/cloud-platform-deployment-manager:WRCP_21.12-wrs.4\""
2023-12-12T21:18:02.500 localhost containerd[14351]: info time="2023-12-12T21:18:02.499997930Z" level=error msg="PullImage \"registry.local:9001/docker.io/wind-river/cloud-platform-deployment-manager:WRCP_21.12-wrs.4\" failed" error="failed to pull and unpack image \"registry.local:9001/docker.io/wind-river/cloud-platform-deployment-manager:WRCP_21.12-wrs.4\": mkdir /var/lib/docker/io.containerd.content.v1.content/ingest/021797cd63eabca90739f10da97adbfd0472bc8562bbe2e166bc55664a6f6848: no space left on device"
2023-12-12T21:18:02.594 localhost containerd[14351]: info time="2023-12-12T21:18:02.593296317Z" level=info msg="PullImage \"registry.local:9001/quay.io/jetstack/cert-manager-acmesolver:v1.13.1\""
2023-12-12T21:18:02.611 localhost containerd[14351]: info time="2023-12-12T21:18:02.611366921Z" level=error msg="PullImage \"registry.local:9001/quay.io/jetstack/cert-manager-acmesolver:v1.13.1\" failed" error="failed to pull and unpack image \"registry.local:9001/quay.io/jetstack/cert-manager-acmesolver:v1.13.1\": mkdir /var/lib/docker/io.containerd.content.v1.content/ingest/5927bebbc37bbc112a47c1e1904f4d6b01462998cc0e3a6032f143742f500128: no space left on device"

df -h

sysadmin@controller-0:~$ df -h
Filesystem Size Used Avail Use% Mounted on
none 7.6G 0 7.6G 0% /dev
tmpfs 7.7G 3.9M 7.7G 1% /run
/dev/mapper/cgts--vg-root--lv 20G 6.3G 13G 34% /sysroot
/dev/sda4 2.0G 205M 1.6G 12% /boot
tmpfs 7.7G 312K 7.7G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
tmpfs 1.0G 196K 1.0G 1% /tmp
/dev/mapper/cgts--vg-var--lv 20G 5.8G 13G 32% /var
/dev/sda3 300M 14M 287M 5% /boot/efi
/dev/mapper/cgts--vg-log--lv 7.6G 3.4M 7.2G 1% /var/log
/dev/sda2 29G 26G 2.2G 93% /var/rootdirs/o
pt/platform-backup
/dev/mapper/cgts--vg-docker--lv 30G 30G 176K 100% /var/lib/docker
/dev/mapper/cgts--vg-scratch--lv 32G 28K 30G 1% /var/rootdirs/s
cratch
/dev/mapper/cgts--vg-backup--lv 25G 24K 24G 1% /var/rootdirs/o
pt/backups
/dev/mapper/cgts--vg-kubelet--lv 9.8G 24K 9.3G 1% /var/lib/kubele
t
/dev/drbd0 20G 126M 19G 1% /var/lib/postgr
esql
/dev/drbd1 2.0G 384M 1.5G 21% /var/lib/rabbit
mq
/dev/drbd2 9.8G 2.0M 9.3G 1% /var/rootdirs/o
pt/platform
/dev/drbd5 990M 24K 923M 1% /var/rootdirs/o
pt/extension
/dev/drbd7 4.9G 28K 4.6G 1% /var/rootdirs/o
pt/etcd
/dev/drbd8 40G 17G 21G 45% /var/lib/docker
-distribution

sudo du -hd1 /var/lib/docker/

sysadmin@controller-0:~$ sudo du -hd1 /var/lib/docker/
24K /var/lib/docker/containerd
0 /var/lib/docker/containers
0 /var/lib/docker/plugins
0 /var/lib/docker/overlay2
4.0K /var/lib/docker/image
24K /var/lib/docker/volumes
0 /var/lib/docker/trust
28K /var/lib/docker/network
0 /var/lib/docker/swarm
72K /var/lib/docker/buildkit
0 /var/lib/docker/tmp
0 /var/lib/docker/runtimes
0 /var/lib/docker/tmpmounts
7.7G /var/lib/docker/io.containerd.content.v1.content
0 /var/lib/docker/io.containerd.snapshotter.v1.btrfs
0 /var/lib/docker/io.containerd.snapshotter.v1.native
22G /var/lib/docker/io.containerd.snapshotter.v1.overlayfs
7.3M /var/lib/docker/io.containerd.metadata.v1.bolt
0 /var/lib/docker/io.containerd.runtime.v1.linux
0 /var/lib/docker/io.containerd.runtime.v2.task
30G /var/lib/docker/

Alarms

N/A

Test Activity:

Developer Testing

Workaround:

While ansible is running the step "Download images and push to local registry", execute the following. Do not let the docker-lv become full:

sudo bash -c -- 'while true; do sleep 60 && crictl rmi --prune; done'

Tags:

OpenStack Infra (hudson-openstack) on 2024-01-23

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-11: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/906304
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/d81436d34eb867a16788b66393bb3783b478e581
Submitter: "Zuul (22348)"
Branch: master

commit d81436d34eb867a16788b66393bb3783b478e581
Author: Boovan Rajendran <email address hidden>
Date: Mon Jan 22 11:43:03 2024 -0500

Exclude unwanted cached images to download during optimized restore

    During optimized restore operation download images step is
    failing since there is not enough space in the crictl image cache
    to store all of the images.

    While performing optimized B&R, we are taking a list of crictl cache
    images as a backup during backup operation. While restore download
    the images that are present in the backup cached image list and
    exclude the other images.

    we need list of k8s control plane images to satisfy the below scenarios.
     - crictl_image_cache_list is not present in backup file during restore.
     - crictl image cache was cleared before backup.

    * Push an image to registry.local before backup in a way that does not
      add it to cache.
      ```
      docker login registry.local:9001 -u admin
      docker image pull busybox
      docker tag busybox:latest registry.local:9001/docker.io/busybox:latest
      docker push registry.local:9001/docker.io/busybox:latest
      ```
    * Check the pushed image is not present in crictl image cache
      after optimized restore.
      ```
      crictl images
      ```
    * Check the pushed image is present in registry.local
      after optimized restore.
      ```
      source /etc/platform/openrc
      system registry-image-list
      ```
    Test plan:

    PASS: Perform optimized B&R on AIO-SX, verify unwanted cached images
    deleted successfully after restore.
    PASS: Perform optimized B&R on AIO-SX, verify that custom images are
    in registry.local after restore.
    PASS: Tested by creating and installing an iso as AIO-SX.
    PASS: Tested by performing multiple k8s upgrade from 1.24 to 1.27.
    PASS: Tested by performing unoptimized B&R on AIO-SX.
    PASS: Tested by performing platform upgrade.
    PASS: Tested by installing DC system.
    PASS: Tested by performing Subcloud prestage.

Closes-Bug: 2051005

Change-Id: Iece7229a6c0089c99be6905d7d3b9e053c45d385
Signed-off-by: Boovan Rajendran <email address hidden>

Reviewed:  https://review.opendev.org/c/starlingx/ansible-playbooks/+/906304
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/d81436d34eb867a16788b66393bb3783b478e581
Submitter: "Zuul (22348)"
Branch:    master

commit d81436d34eb867a16788b66393bb3783b478e581
Author: Boovan Rajendran <boovan.rajendran@windriver.com>
Date:   Mon Jan 22 11:43:03 2024 -0500

Exclude unwanted cached images to download during optimized restore
    
    During optimized restore operation download images step is
    failing since there is not enough space in the crictl image cache
    to store all of the images.
    
    While performing optimized B&R, we are taking a list of crictl cache
    images as a backup during backup operation. While restore download
    the images that are present in the backup cached image list and
    exclude the other images.
    
    we need list of k8s control plane images to satisfy the below scenarios.
     - crictl_image_cache_list is not present in backup file during restore.
     - crictl image cache was cleared before backup.
    
    * Push an image to registry.local before backup in a way that does not
      add it to cache.
      ```
      docker login registry.local:9001 -u admin
      docker image pull busybox
      docker tag busybox:latest registry.local:9001/docker.io/busybox:latest
      docker push registry.local:9001/docker.io/busybox:latest
      ```
    * Check the pushed image is not present in crictl image cache
      after optimized restore.
      ```
      crictl images
      ```
    * Check the pushed image is present in registry.local
      after optimized restore.
      ```
      source /etc/platform/openrc
      system registry-image-list
      ```
    Test plan:
    
    PASS: Perform optimized B&R on AIO-SX, verify unwanted cached images
    deleted successfully after restore.
    PASS: Perform optimized B&R on AIO-SX, verify that custom images are
    in registry.local after restore.
    PASS: Tested by creating and installing an iso as AIO-SX.
    PASS: Tested by performing multiple k8s upgrade from 1.24 to 1.27.
    PASS: Tested by performing unoptimized B&R on AIO-SX.
    PASS: Tested by performing platform upgrade.
    PASS: Tested by installing DC system.
    PASS: Tested by performing Subcloud prestage.
    
    Closes-Bug: 2051005
    
    Change-Id: Iece7229a6c0089c99be6905d7d3b9e053c45d385
    Signed-off-by: Boovan Rajendran <boovan.rajendran@windriver.com>

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2024-04-10

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.10.0 stx.config
Changed in starlingx:
assignee:	nobody → Boovan Rajendran (brajendr)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.