Kubernetes point release upgrades cause ImagePullBackOff when using custom registry

Bug #1891530 reported by Chris Johnston
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
CDK Addons
Fix Released
High
Kevin W Monroe
Charmed Kubernetes Testing
Fix Released
High
Kevin W Monroe

Bug Description

When a point release of Kubernetes is published which requires new (updated) images, an environment using a custom registry will result in pods in ImagePullBackOff because the new image isn't available in the custom registry.

Tags: sts
description: updated
Revision history for this message
Tim Van Steenburgh (tvansteenburgh) wrote :

Task to Chris on MM and gathered some info.

In this case, the customer cluster had just upgraded to 1.17.10:

kubernetes-master 1.17.10 active 3 kubernetes-master local 0 ubuntu
kubernetes-worker 1.17.10 active 3 kubernetes-worker local 0 ubuntu exposed

The customer (who is using their own registry) now has pods in ImagePullBackoff because their registry doesn't have nvidia/k8s-device-plugin:v0.7.0-rc.5

Looking at https://github.com/charmed-kubernetes/bundle/blob/master/container-images.txt at the time of this writing, we can see that the latest version of cdk-addons is 1.17.9, and that this image did indeed change since 1.17.8 (and in fact, it has changed 6 times in the 1.17.x series).

Revision history for this message
Tim Van Steenburgh (tvansteenburgh) wrote :

Strangely, there are no nvidia images in rocks.c.c:

tvansteenburgh@slipgate:~$ curl -s https://rocks.canonical.com/v2/_catalog | python3 -mjson.tool | grep nvidia
tvansteenburgh@slipgate:~$

Revision history for this message
Tim Van Steenburgh (tvansteenburgh) wrote :

The 1.17 Components page lists the version that accompanied the 1.17.0 release: https://ubuntu.com/kubernetes/docs/1.17/components

George Kraft (cynerva)
no longer affects: charmed-kubernetes-bundles
Changed in cdk-addons:
importance: Undecided → Medium
status: New → Triaged
Felipe Reyes (freyes)
tags: added: sts
George Kraft (cynerva)
Changed in cdk-addons:
importance: Medium → High
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Based on internal discussions and investigation, so far there have been 4 ideas brought up:

1) Prevent snaps from automatically being updating themselves

a) Use Enterprise Snap Proxy
b) Snap devmode

2) Delaying the snap update up to 90 days

3) Have better notifications

a) Release notes indicating image updates on each new point release
b) Integrate to the LMA stack for alerts on snap/image list/point releases

4) Automatically detect the problem and do not do the upgrade

Approach #1 allows for the operator to allow the updates to occur in a scheduled maintenance window
Approach #2 and/or #3 allows the operator to create a cronjob or script that monitors upstream release notes or the image list [1] for changes and pull the new images to the custom registry
Approach #4 seems to be the ideal one. By using snap hooks [2], a validation of whether the required images are accessible can be coded in the snap to prevent the daemonsets yamls from being updated to seek new images, therefore preventing the ImagePullBackOff state. Also as part of it, a message could be triggered to inform the user that the new required images have not been found.

[1] https://github.com/charmed-kubernetes/bundle/blob/master/container-images.txt
[2] https://forum.snapcraft.io/t/supported-snap-hooks/3795

Changed in cdk-addons:
assignee: nobody → Kevin W Monroe (kwmonroe)
Changed in cdk-addons:
status: Triaged → In Progress
Changed in cdk-addons:
milestone: none → 1.21
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

For CK 1.21, we're recommending customers opt for 1a + 3a from comment #4. That is, use the snap store proxy to hold snap versions (1a) until they can review/act on release notes for image changes in new snap versions (3a).

PR for the cdk-addons piece of this is up for review:

https://github.com/charmed-kubernetes/cdk-addons/pull/203

Changed in charmed-kubernetes-testing:
assignee: nobody → Kevin W Monroe (kwmonroe)
importance: Undecided → High
milestone: none → 1.21
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

The other piece of this will be to update the build-cdk-addons jenkins job to utilize the new "compare-images" target so that info can feed into the release notes for new releases.

Changed in charmed-kubernetes-testing:
assignee: Kevin W Monroe (kwmonroe) → nobody
status: New → In Progress
assignee: nobody → Kevin W Monroe (kwmonroe)
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :
Changed in cdk-addons:
status: In Progress → Fix Committed
Changed in charmed-kubernetes-testing:
status: In Progress → Fix Committed
Changed in cdk-addons:
status: Fix Committed → Fix Released
Changed in charmed-kubernetes-testing:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.