User populated images not present in local registry after B&R

Bug #1886152 reported by Dan Voiculeasa
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
David Sullivan

Bug Description

Brief Description
-----------------
After a B&R user created pods using images from local registry.local remained in an ImagePullBackOff state.

Severity
--------
Major

Steps to Reproduce
------------------

Method A)
During first bootstrap use `additional_local_registry_images` to specify a list of images to be downloaded and pushed to local registry.

additional_local_registry_images:
- my-domain/my-repo/my-app:my-tag

Method B)
After unlock manually download and push a custom image `my-domain/my-repo/my-app:my-tag` to local registry.

After either method A or B:

Deploy a pod container using registry.local:9001/my-domain/my-repo/my-app:my-tag.
Perform the backup and restore.

Expected Behavior
------------------
All pods recover after the restore

Actual Behavior
----------------
A few pods were stuck in ImagePullBackOff state, eg:
test test-pod-676fb84f84-j4r84 0/1 ImagePullBackOff 0 89m

Reproducibility
---------------
Reproducible

System Configuration
--------------------
ALL

Branch/Pull Time/Commit
-----------------------
ISO built on July 1

Last Pass
---------

Timestamp/Logs
--------------

Test Activity
-------------

Workaround
----------
User has to manually populate local registry with his custom pod container images.

Changed in starlingx:
assignee: nobody → Dan Voiculeasa (dvoicule)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/739195
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=e2cbd22b3fe79fc2e0e69a6de0bd0f6c14417b0c
Submitter: Zuul
Branch: master

commit e2cbd22b3fe79fc2e0e69a6de0bd0f6c14417b0c
Author: Dan Voiculeasa <email address hidden>
Date: Fri Jul 3 12:01:10 2020 +0300

    Optimize download image list

    Use `unique` filter to remove duplicates in the list.

    Partial-Bug: 1886152
    Change-Id: Ia9b2276e1654b9e413ec5402a9c2e0d804f66f9b
    Signed-off-by: Dan Voiculeasa <email address hidden>

Ghada Khalil (gkhalil)
tags: added: stx.update
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.5.0 / medium priority - workaround exists

tags: added: stx.5.0
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/739196
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=eee995ced9e0b06dc7ba3982732dace12de45313
Submitter: Zuul
Branch: master

commit eee995ced9e0b06dc7ba3982732dace12de45313
Author: Dan Voiculeasa <email address hidden>
Date: Thu Jul 2 22:22:25 2020 +0300

    B&R: Save image name and tags present in local registry

    During backup save all images in the form 'name:tag'. Save to a list
    in the ansible overrides file picked up by restore procedure.

    Doing so the images in local registry will be repopulated during
    restore. More specific during the bootstrap playbook import of the
    restore procedure.

    Closes-Bug: 1886152
    Depends-On: Ia9b2276e1654b9e413ec5402a9c2e0d804f66f9b
    Change-Id: I189428d10c83dae54e2121e6c8f6363fbb14f53a
    Signed-off-by: Dan Voiculeasa <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Re-opening. This fix failed verification.

From Frank Miller:
A B&R test with subclouds failed due to the system using a private registry with a specific port number specified. When this image is pulled the image registry stores this without the port number.

Therefore when the software backup is done the local registry has the image tag without the port number and then at restore (or upgrade) time the image cannot be pulled since the port number is missing and not known. This works for users who use a private registry without a port number, but this doesn't work if a private registry uses a port number.

The way in which the local registry keeps images will need to be updated to preserve the port # in order for this to work.

Changed in starlingx:
status: Fix Released → Confirmed
Revision history for this message
Ghada Khalil (gkhalil) wrote :

https://review.opendev.org/739196 was reverted as it resulted in B&R testing failures. The commit used to revert is: https://review.opendev.org/#/c/741182/

Revision history for this message
Ghada Khalil (gkhalil) wrote :

The other commit tied to this LP (https://review.opendev.org/739195) was also reverted from stx master & r/stx.4.0
revert in stx master: https://review.opendev.org/#/c/741253/
revert in r/stx.4.0: https://review.opendev.org/#/c/741262/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/741976

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/742118

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/742119

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/743281

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/742118
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=81e4fce4f01c54be5e2b22a6082d98f0a842c25e
Submitter: Zuul
Branch: master

commit 81e4fce4f01c54be5e2b22a6082d98f0a842c25e
Author: Dan Voiculeasa <email address hidden>
Date: Thu Jul 16 20:01:48 2020 +0300

    Refactor docker images information

    Extract from role `push-docker-images` a common role
    `load-images-information`.

    The role purpose is to set facts about platform images.

    This new role will be used by the backup procedure to filter out
    platform images.

    Partial-Bug: 1886152
    Change-Id: Ic3f7a82171dccaa15a76b2578135163dc78201f3
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/741976
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=ad114fba58a04fe8a0803624df83238822292bae
Submitter: Zuul
Branch: master

commit ad114fba58a04fe8a0803624df83238822292bae
Author: Dan Voiculeasa <email address hidden>
Date: Thu Jul 16 15:39:41 2020 +0300

    Utils: get local registry image list

    The functionality described here will be used by the backup procedure
    to save the images that have an unknown origin, save from local
    registry to an archive.

    This commit adds `local-registry-list` option to /usr/bin/sysinv-utils.
    sudo /usr/bin/sysinv-utils local-registry-list <file> [--no-apps]
    Example:
    sudo /usr/bin/sysinv-utils local-registry-list /tmp/123
    sudo /usr/bin/sysinv-utils local-registry-list /tmp/123 --no-apps

    Save to a file the list of images present in local registry.
    If `--no-apps` is specified then retrieve the list of all images used
    by all apps so that they will be removed from list of present images.
    From a backup procedure perspective: the final goal is to not save
    apps images to images backup file.

    The implication is that if the user pushes an image to the local
    registry with the same name used by an uploaded app image this won't be
    backed up [scenario 1]. A similar scenario for apply-failed apps
    images. [scenario 2]
    Uploaded and apply-failed apps won't have their images auto-downloaded
    during restore. After the restore user needs to manually push those
    images to local registry.

    If retrieving only images for applied apps the archive can be filled
    with apply-failed apps images.
    Focus on minimizing disk usage: retrieve all images for all applications
    so that they can be excluded from the archive.

    Improvement note: we could retrieve here only apps in
    applied/apply-failed state if during restore the images for apply-failed
    apps would also be downloaded. This avoids [scenario 1].

    Partial-Bug: 1886152
    Change-Id: I0852702be7d47cd5173a7815f33812e75332c63a
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/742119
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=8345f30dfae16f71bb798a4d5c8c7b17400bd757
Submitter: Zuul
Branch: master

commit 8345f30dfae16f71bb798a4d5c8c7b17400bd757
Author: Dan Voiculeasa <email address hidden>
Date: Thu Jul 16 17:31:31 2020 +0300

    B&R: Backup user images from local registry.

    The procedure is optional and is activated by setting
    backup_user_local_registry=true for the backup playbook.

    An user image is one that is not a platform or an app image.
    Information about platform images is kept in the playbooks.
    Information about apps images is retrieved through sysinv-utils.
    In fact apps images are filtered out in sysinv-utils by passing
    `--no-apps` arg.

    Save user images to local filesystem using docker pull.
    Check if there is enough disk space for the archive to be saved.
    Export to a tar archive using docker save.

    For remote play the copy of user images generated archive is not
    implemented yet.

    Partial-Bug: 1886152
    Depends-On: Ic3f7a82171dccaa15a76b2578135163dc78201f3
    Depends-On: I0852702be7d47cd5173a7815f33812e75332c63a
    Change-Id: I4644784ea4164134f163d218e69dc4ceb148985a
    Signed-off-by: Dan Voiculeasa <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/743281
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=b8e84bb5d05db95436da35c5558c0402f9838337
Submitter: Zuul
Branch: master

commit b8e84bb5d05db95436da35c5558c0402f9838337
Author: Dan Voiculeasa <email address hidden>
Date: Thu Jul 16 13:43:13 2020 +0300

    B&R: Add playbook to restore user images

    Add role for restoring registry images from docker archive.
    Import user images from docker archive to docker.
    Push the images referencing local registry. After the push the images
    tags will be deleted. If there are no tags pointing to the same image
    unique id the image will be deleted from the docker filesystem thus
    recovering the space.

    Partition where archive is located, docker partition, local registry
    partition must all be big enough.

    The archive can be bigger than available space before bootstrap
    playbook is played. Restore platform procedure needs to be called before
    restoring docker images so that bootstrap playbook is called and space
    allocated.

    For local plays there are 2 key variables, while for remote play there
    are 4.
    Var `initial_backup_dir` points to the directory where the backup file is
    located.
    Var `backup_filename` points to the file in the directory.
    Var `backup_dir` is used for remote plays and is the destination folder
    on the remote where the file will be copied.
    Var `ansible_remote_tmp` points to a temporary directory on the remote
    that ansible `copy` module will use to transfer the file.

    Directories pointed to by `initial_backup_dir, `backup_dir` and
    `ansible_remote_tmp` must be mounted where free space exceeds backup
    size.

    Example call:
    restor_user_images.yml \
        -e "initial_backup_dir=/host1/sufficient/space1/" \
        -e "backup_filename=example.tar" \
        -e "ansible_remote_tmp=/host2/sufficient/space1/" \
        -e "backup_dir=/host2/sufficient/space1/"

    Closes-Bug: 1886152
    Depends-On: I4644784ea4164134f163d218e69dc4ceb148985a
    Change-Id: I0a7dcf9d174fbc07af85ef51bb4068d0dc16c560
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
Frank Miller (sensfan22) wrote :

In addition to a B&R this issue also occurs on an upgrade for AIO-SX configs. Assigning to David to add the code change required for Dan's solution to also work on an upgrade.

Changed in starlingx:
assignee: Dan Voiculeasa (dvoicule) → David Sullivan (dsullivanwr)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/744798

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/744802

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/744802
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=5a8180a637ce87b539561a84a07178049ce34140
Submitter: Zuul
Branch: master

commit 5a8180a637ce87b539561a84a07178049ce34140
Author: David Sullivan <email address hidden>
Date: Tue Aug 4 19:18:48 2020 -0400

    User populated images not present after SX upgrade

    Ensure the backup created during upgrade-start includes the images data.
    Also ensure the images data files are removed during upgrade-complete.

    Change-Id: I81541f7eb3a591e5008e3859e48cde07c9f2ba01
    Partial-Bug: 1886152
    Signed-off-by: David Sullivan <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/744798
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=5d802971b98c98546fbd902738fb907352628f0f
Submitter: Zuul
Branch: master

commit 5d802971b98c98546fbd902738fb907352628f0f
Author: David Sullivan <email address hidden>
Date: Tue Aug 4 14:45:17 2020 -0400

    User populated images not present after SX upgrade

    After a SX upgrade any custom images pushed to registry.local are lost.
    Pods using those images will remain in ImagePullBackOff until the images
    are pushed manually. This change will restore those images during the SX
    upgrade.

    Change-Id: I3f902f8c0095cb6da014895f6aee55a3a057e616
    Partial-Bug: 1886152
    Signed-off-by: David Sullivan <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.