Download images fails when there are many images

Bug #2051005 reported by Boovan Rajendran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Boovan Rajendran

Bug Description

Brief Description:

This could happen anytime the download step is used, however it is most likely to happen during restore, when a large number of images are downloaded.

While performing optimized restore all images are redownloaded.

During this download phase, the containerd cache is not cleared.  Therefore you can fill up the cache before all the images are downloaded.   This will cause the download task to fail.

Severity:

Critical

Steps to Reproduce:

Deploy a system
Increase size of docker-distribution lv so it's larger than docker lv
e.g. system controllerfs-modify docker-distribution=40
Push images to regsitry.local until you have much more than docker lv size
Backup
Optimized restore

Expected Behavior:

Restore works

Actual Behavior:

Restore fails because not enough space to download images

Reproducibility:
100%

System Configuration

AIO-SX and system controllers

Last Pass

N/A

Ansible:

TASK [common/push-docker-images : Download images and push to local registry] **************************************************
Tuesday 12 December 2023  21:05:49 +0000 (0:00:00.064)       0:12:38.552 ******
FAILED - RETRYING: Download images and push to local registry (10 retries left).
FAILED - RETRYING: Download images and push to local registry (9 retries left).
FAILED - RETRYING: Download images and push to local registry (8 retries left).
FAILED - RETRYING: Download images and push to local registry (7 retries left).

Containerd.log:

2023-12-12T21:18:02.294 localhost containerd[14351]: info time="2023-12-12T21:18:02.294432122Z" level=error msg="PullImage \"registry.local:9001/quay.io/calico/pod2daemon-flexvol:v3.22.2\" failed" error="failed to pull and unpack image \"registry.local:9001/quay.io/calico/pod2daemon-flexvol:v3.22.2\": mkdir /var/lib/docker/io.containerd.content.v1.content/ingest/c69b7b05ef3eff9307747b128f647dbc9c2a6c9fb0e97ec94d11b2a2ae3e9679: no space left on device"
2023-12-12T21:18:02.384 localhost containerd[14351]: info time="2023-12-12T21:18:02.383925925Z" level=info msg="PullImage \"registry.local:9001/k8s.gcr.io/sig-storage/csi-attacher:v3.4.0\""
2023-12-12T21:18:02.401 localhost containerd[14351]: info time="2023-12-12T21:18:02.400809952Z" level=error msg="PullImage \"registry.local:9001/k8s.gcr.io/sig-storage/csi-attacher:v3.4.0\" failed" error="failed to pull and unpack image \"registry.local:9001/k8s.gcr.io/sig-storage/csi-attacher:v3.4.0\": mkdir /var/lib/docker/io.containerd.content.v1.content/ingest/3c877f43151daa56b7426eb413edd5ed002b7f094d4616b0061458137c45b94a: no space left on device"
2023-12-12T21:18:02.483 localhost containerd[14351]: info time="2023-12-12T21:18:02.483030414Z" level=info msg="PullImage \"registry.local:9001/docker.io/wind-river/cloud-platform-deployment-manager:WRCP_21.12-wrs.4\""
2023-12-12T21:18:02.500 localhost containerd[14351]: info time="2023-12-12T21:18:02.499997930Z" level=error msg="PullImage \"registry.local:9001/docker.io/wind-river/cloud-platform-deployment-manager:WRCP_21.12-wrs.4\" failed" error="failed to pull and unpack image \"registry.local:9001/docker.io/wind-river/cloud-platform-deployment-manager:WRCP_21.12-wrs.4\": mkdir /var/lib/docker/io.containerd.content.v1.content/ingest/021797cd63eabca90739f10da97adbfd0472bc8562bbe2e166bc55664a6f6848: no space left on device"
2023-12-12T21:18:02.594 localhost containerd[14351]: info time="2023-12-12T21:18:02.593296317Z" level=info msg="PullImage \"registry.local:9001/quay.io/jetstack/cert-manager-acmesolver:v1.13.1\""
2023-12-12T21:18:02.611 localhost containerd[14351]: info time="2023-12-12T21:18:02.611366921Z" level=error msg="PullImage \"registry.local:9001/quay.io/jetstack/cert-manager-acmesolver:v1.13.1\" failed" error="failed to pull and unpack image \"registry.local:9001/quay.io/jetstack/cert-manager-acmesolver:v1.13.1\": mkdir /var/lib/docker/io.containerd.content.v1.content/ingest/5927bebbc37bbc112a47c1e1904f4d6b01462998cc0e3a6032f143742f500128: no space left on device"

df -h

sysadmin@controller-0:~$ df -h
Filesystem                        Size  Used Avail Use% Mounted on
none                              7.6G     0  7.6G   0% /dev
tmpfs                             7.7G  3.9M  7.7G   1% /run
/dev/mapper/cgts--vg-root--lv      20G  6.3G   13G  34% /sysroot
/dev/sda4                         2.0G  205M  1.6G  12% /boot
tmpfs                             7.7G  312K  7.7G   1% /dev/shm
tmpfs                             5.0M     0  5.0M   0% /run/lock
tmpfs                             4.0M     0  4.0M   0% /sys/fs/cgroup
tmpfs                             1.0G  196K  1.0G   1% /tmp
/dev/mapper/cgts--vg-var--lv       20G  5.8G   13G  32% /var
/dev/sda3                         300M   14M  287M   5% /boot/efi
/dev/mapper/cgts--vg-log--lv      7.6G  3.4M  7.2G   1% /var/log
/dev/sda2                          29G   26G  2.2G  93% /var/rootdirs/o
pt/platform-backup
/dev/mapper/cgts--vg-docker--lv    30G   30G  176K 100% /var/lib/docker
/dev/mapper/cgts--vg-scratch--lv   32G   28K   30G   1% /var/rootdirs/s
cratch
/dev/mapper/cgts--vg-backup--lv    25G   24K   24G   1% /var/rootdirs/o
pt/backups
/dev/mapper/cgts--vg-kubelet--lv  9.8G   24K  9.3G   1% /var/lib/kubele
t
/dev/drbd0                         20G  126M   19G   1% /var/lib/postgr
esql
/dev/drbd1                        2.0G  384M  1.5G  21% /var/lib/rabbit
mq
/dev/drbd2                        9.8G  2.0M  9.3G   1% /var/rootdirs/o
pt/platform
/dev/drbd5                        990M   24K  923M   1% /var/rootdirs/o
pt/extension
/dev/drbd7                        4.9G   28K  4.6G   1% /var/rootdirs/o
pt/etcd
/dev/drbd8                         40G   17G   21G  45% /var/lib/docker
-distribution

sudo du -hd1 /var/lib/docker/

sysadmin@controller-0:~$ sudo du -hd1 /var/lib/docker/
24K     /var/lib/docker/containerd
0       /var/lib/docker/containers
0       /var/lib/docker/plugins
0       /var/lib/docker/overlay2
4.0K    /var/lib/docker/image
24K     /var/lib/docker/volumes
0       /var/lib/docker/trust
28K     /var/lib/docker/network
0       /var/lib/docker/swarm
72K     /var/lib/docker/buildkit
0       /var/lib/docker/tmp
0       /var/lib/docker/runtimes
0       /var/lib/docker/tmpmounts
7.7G    /var/lib/docker/io.containerd.content.v1.content
0       /var/lib/docker/io.containerd.snapshotter.v1.btrfs
0       /var/lib/docker/io.containerd.snapshotter.v1.native
22G     /var/lib/docker/io.containerd.snapshotter.v1.overlayfs
7.3M    /var/lib/docker/io.containerd.metadata.v1.bolt
0       /var/lib/docker/io.containerd.runtime.v1.linux
0       /var/lib/docker/io.containerd.runtime.v2.task
30G     /var/lib/docker/

Alarms

N/A

Test Activity:

Developer Testing

Workaround:

While ansible is running the step "Download images and push to local registry", execute the following.  Do not let the docker-lv become full:

sudo bash -c -- 'while true; do sleep 60 && crictl rmi --prune; done'

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/906304
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/d81436d34eb867a16788b66393bb3783b478e581
Submitter: "Zuul (22348)"
Branch: master

commit d81436d34eb867a16788b66393bb3783b478e581
Author: Boovan Rajendran <email address hidden>
Date: Mon Jan 22 11:43:03 2024 -0500

    Exclude unwanted cached images to download during optimized restore

    During optimized restore operation download images step is
    failing since there is not enough space in the crictl image cache
    to store all of the images.

    While performing optimized B&R, we are taking a list of crictl cache
    images as a backup during backup operation. While restore download
    the images that are present in the backup cached image list and
    exclude the other images.

    we need list of k8s control plane images to satisfy the below scenarios.
     - crictl_image_cache_list is not present in backup file during restore.
     - crictl image cache was cleared before backup.

    * Push an image to registry.local before backup in a way that does not
      add it to cache.
      ```
      docker login registry.local:9001 -u admin
      docker image pull busybox
      docker tag busybox:latest registry.local:9001/docker.io/busybox:latest
      docker push registry.local:9001/docker.io/busybox:latest
      ```
    * Check the pushed image is not present in crictl image cache
      after optimized restore.
      ```
      crictl images
      ```
    * Check the pushed image is present in registry.local
      after optimized restore.
      ```
      source /etc/platform/openrc
      system registry-image-list
      ```
    Test plan:

    PASS: Perform optimized B&R on AIO-SX, verify unwanted cached images
    deleted successfully after restore.
    PASS: Perform optimized B&R on AIO-SX, verify that custom images are
    in registry.local after restore.
    PASS: Tested by creating and installing an iso as AIO-SX.
    PASS: Tested by performing multiple k8s upgrade from 1.24 to 1.27.
    PASS: Tested by performing unoptimized B&R on AIO-SX.
    PASS: Tested by performing platform upgrade.
    PASS: Tested by installing DC system.
    PASS: Tested by performing Subcloud prestage.

    Closes-Bug: 2051005

    Change-Id: Iece7229a6c0089c99be6905d7d3b9e053c45d385
    Signed-off-by: Boovan Rajendran <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.10.0 stx.config
Changed in starlingx:
assignee: nobody → Boovan Rajendran (brajendr)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.