Containers: Worker nodes are pulling from external registry instead from the internal registry

Bug #1817958 reported by Frank Miller
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Angie Wang

Bug Description

Brief Description
-----------------
After install and unlock worker nodes, they were stuck at ContainerCreating for extended amount of time (> 40m). While investigating this issue, it was determined that the worker nodes are trying to pull from the external registry instead of from an internal registry.

Severity
--------
Minor

Steps to Reproduce
------------------
- Install and configure controller-0
- Install controller-1 and worker nodes from controller-0, and unlock them

Expected Behavior
------------------
- Worker nodes should pull images from internal registry

Actual Behavior
----------------
- worker nodes were trying to pull images from external repo and got stuck at NotReady - ContainerCreating for 40 minutes plus

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
f/stein as of 2019-02-25

Timestamp/Logs
--------------
# All nodes unlocked and available:
[2019-02-26 03:12:48,748] 262 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-list'
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | compute-0 | worker | unlocked | enabled | available |
| 4 | compute-1 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[wrsroot@controller-0 ~(keystone_admin)]$

# worker nodes stuck for 45 minutes
NAME STATUS ROLES AGE VERSION
compute-0 NotReady <none> 45m v1.12.3
compute-1 NotReady <none> 44m v1.12.3
controller-0 Ready master 117m v1.12.3
controller-1 Ready master 65m v1.12.3

[wrsroot@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide | grep -v -e Completed -e Running
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
kube-system calico-node-m6znx 0/2 ContainerCreating 0 41m 192.168.204.91 compute-0 <none>
kube-system calico-node-w9nlk 0/2 ContainerCreating 0 40m 192.168.204.185 compute-1 <none>
kube-system kube-proxy-66j88 0/1 ContainerCreating 0 40m 192.168.204.185 compute-1 <none>
kube-system kube-proxy-86jn4 0/1 ContainerCreating 0 41m 192.168.204.91 compute-0 <none>

# seems to be pulling images from external repo
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal Scheduled 43m default-scheduler Successfully assigned kube-system/calico-node-m6znx to compute-0
  Warning FailedCreatePodSandBox 2m30s (x87 over 42m) kubelet, compute-0 Failed create pod sandbox: rpc error: code = Unknown desc = failed pulling image "k8s.gcr.io/pause:3.1": Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

# Note that workers eventually recovered automatically and reached Ready status. But the proper behaviour to minimize time spent accessing the external register is for worker nodes to pull from the internal registry.

Frank Miller (sensfan22)
tags: added: stx.containers
Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Angie Wang (angiewang)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating. The worker nodes should be pulling images from an internal registry on the controller; instead of going to the external registry every time. Medium priority as this is a performance optimization.

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.05
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Revision history for this message
Bill Zvonar (billzvonar) wrote :

Assigned to Bruce for re-assignment.

Changed in starlingx:
assignee: Angie Wang (angiewang) → Bruce Jones (brucej)
Bruce Jones (brucej)
Changed in starlingx:
assignee: Bruce Jones (brucej) → Abraham Arce (xe1gyq)
Abraham Arce (xe1gyq)
Changed in starlingx:
assignee: Abraham Arce (xe1gyq) → Erich Cordoba (ericho)
Revision history for this message
Erich Cordoba (ericho) wrote :

I was able to reproduce this issue.

It seems that the images downloaded in the ansible stage are not push to the internal registry. Also, it might need to be retagged so the additional workers can get them from the internal registry.

As a first attempt I'll try to retag the images using the registry.local and the push into the internal registry. This without matter if the image comes from Internet or from a private registry.

----
Probably not related, but in a first experiment, a "docker pull" to the registry.local was extremely
slow (seems to be the DNS), adding the host in /etc/host solved the issue.

Revision history for this message
Erich Cordoba (ericho) wrote :

I sent the following fix proposal to the mailing list:

So, to fix this I would like to propose the following:

1. Ensure that all images should have the registry.local prefix,
regardless if a private is defined or not. I'm not sure yet how to do
this and I will appreciate some help point me to the right direction.
2. Change the logic in download_an_image to do this:

Try to download image:
If not found then:
    Remove registry.local prefix
    Try to download from public/private registry
    Retag and push to local registry.

By this way all nodes will try registry.local first and then
public/private if not found.

Frank Miller (sensfan22)
tags: added: stx.3.0
removed: stx.2.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

This has been addressed as part of the authenticated registry story: https://storyboard.openstack.org/#!/story/2006274

Gerrit Review: https://review.opendev.org/#/c/686057/

Merged on 2019-10-04

Changed in starlingx:
assignee: Erich Cordoba (ericho) → Angie Wang (angiewang)
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.