nginx-ingress-controller apply-failed during setup of the StarlingX

Bug #1971981 reported by Alexandru Dimofte
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Reinildes Oliveira

Bug Description

Brief Description
-----------------
nginx-ingress-controller apply-failed during setup of the StarlingX

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
Try to install latest StarlingX image

Expected Behavior
------------------
StarlingX installation should work

Actual Behavior
----------------
Installation failed at the setup stage:
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+---------------------------------------------------------------------+--------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+---------------------------------------------------------------------+--------------------------+----------+-------------------+
| 750.002 | Application Apply Failure | k8s_application=nginx- | major | 2022-05-06T11:54: |
| | | ingress-controller | | 19.284945 |
| | | | | |
| 200.001 | controller-0 was administratively locked to take it out-of-service. | host=controller-0 | warning | 2022-05-06T11:21: |
| | | | | 02.657011 |
| | | | | |
+----------+---------------------------------------------------------------------+--------------------------+----------+-------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+---------+------------------+------------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+------------------+------------------+--------------+------------------------------------------+
| nginx-ingress-controller | 1.1-25 | fluxcd-manifests | fluxcd-manifests | apply-failed | operation aborted, check logs for detail |
+--------------------------+---------+------------------+------------------+--------------+------------------------------------------+

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
all configurations are affected

Branch/Pull Time/Commit
-----------------------
master-20220506T032159Z

Last Pass
---------
sorry, I don't remember exactly

Timestamp/Logs
--------------
will be attached

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Alex, I believe the sanity systems use a private registry. Did you pull the new images as per this stx email thread: http://lists.starlingx.io/pipermail/starlingx-discuss/2022-May/012955.html

tags: added: stx.7.0 stx.apps
Changed in starlingx:
importance: Undecided → Critical
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: stx.7.0 / critical - issue is causing a red sanity

Revision history for this message
Ghada Khalil (gkhalil) wrote :

There's also another application apply issue that is impacting multiple applications: https://bugs.launchpad.net/starlingx/+bug/1972019 A fix was merged on May 9, so it would be great to try with a newer load once it's confirmed the new images are accessible on the sanity systems.

Changed in starlingx:
status: New → In Progress
assignee: nobody → Ghada Khalil (gkhalil)
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I will check again and I will respond here ASAP.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

Ok, I checked again. We have this images pulled on our registry:
quay.io/jetstack/cert-manager-cainjector:v1.7.1
quay.io/jetstack/cert-manager-controller:v1.7.1
quay.io/jetstack/cert-manager-webhook:v1.7.1
quay.io/jetstack/cert-manager-ctl:v1.7.1
quay.io/jetstack/cert-manager-acmesolver:v1.7.1
k8s.gcr.io/defaultbackend:1.4
k8s.gcr.io/ingress-nginx/controller:v1.1.1
k8s.gcr.io/ingress-nginx/kube-webhook-certgen:v1.1.1

However, the setup of stx is still failing.
Checking the ansible.log file you can see that there are some old versions for the above images:
controller-0:~$ cat ansible.log | grep cert-manager-cainjector
  - quay.io/jetstack/cert-manager-cainjector:v0.15.0
  - Image kasstxj.ka.intel.com:5000/jetstack/cert-manager-cainjector:v0.15.0 not found on local registry, attempt to download...
  - 'Image download succeeded: kasstxj.ka.intel.com:5000/jetstack/cert-manager-cainjector:v0.15.0'
  - 'Image push succeeded: registry.local:9001/quay.io/jetstack/cert-manager-cainjector:v0.15.0'
  - Image kasstxj.ka.intel.com:5000/jetstack/cert-manager-cainjector:v0.15.0 download succeeded by containerd
controller-0:~$

I guess some ansible playbooks needs to be updated?!
controller-0:~$ cat /usr/share/ansible/stx-ansible/playbooks/roles/common/load-images-information/vars/k8s-v1.21.8/system-images.yml
---
# System images that are pre-pulled and pushed to local registry
n3000_opae_img: docker.io/starlingx/n3000-opae:stx.6.0-v1.0.1
tiller_img: ghcr.io/helm/tiller:v2.16.9
armada_img: quay.io/airshipit/armada:ddbdd7256c20f138737f6cbd772312f7a19f58b8-ubuntu_bionic
kubernetes_entrypoint_img: quay.io/stackanetes/kubernetes-entrypoint:v0.3.1
calico_cni_img: quay.io/calico/cni:v3.19.1
calico_node_img: quay.io/calico/node:v3.19.1
calico_kube_controllers_img: quay.io/calico/kube-controllers:v3.19.1
calico_flexvol_img: quay.io/calico/pod2daemon-flexvol:v3.19.1
multus_img: ghcr.io/k8snetworkplumbingwg/multus-cni:v3.7.1
sriov_cni_img: ghcr.io/k8snetworkplumbingwg/sriov-cni:v2.6.1
sriov_network_device_img: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.3.2
nginx_ingress_controller_img: k8s.gcr.io/ingress-nginx/controller:v0.41.2
default_backend_img: k8s.gcr.io/defaultbackend:1.4
cert_manager_acmesolver_img: quay.io/jetstack/cert-manager-acmesolver:v0.15.0
cert_manager_cainjector_img: quay.io/jetstack/cert-manager-cainjector:v0.15.0
cert_manager_controller_img: quay.io/jetstack/cert-manager-controller:v0.15.0
cert_manager_webhook_img: quay.io/jetstack/cert-manager-webhook:v0.15.0
# Keep the snapshot-controller image in sync with the one provided at:
# cluster/addons/volumesnapshots/volume-snapshot-controller/volume-snapshot-controller-deployment.yaml
# in the kubernetes github repo
snapshot_controller_img: quay.io/k8scsi/snapshot-controller:v2.0.0-rc2
rvmc_img: docker.io/starlingx/rvmc:stx.5.0-v1.0.0
pause_img: k8s.gcr.io/pause:3.4.1
flux_helm_controller_img: docker.io/fluxcd/helm-controller:v0.15.0
flux_source_controller_img: docker.io/fluxcd/source-controller:v0.20.1
controller-0:~$

Revision history for this message
Ghada Khalil (gkhalil) wrote (last edit ):

@Alex, which build is this output from? Ansible was updated to use the new version of nginx & cert-manager on May 3rd. Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/838591

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

Hi Ghada, the build is the latest. The review doesn't changes the file:
./playbookconfig/src/playbooks/roles/common/load-images-information/vars/k8s-v1.21.8/system-images.yml
and I am not sure if this is enough.

If you clone and check the content of that file, you'll see there some old versions which I guess needs to be updated.

for example:
cert_manager_acmesolver_img: quay.io/jetstack/cert-manager-acmesolver:v0.15.0
cert_manager_cainjector_img: quay.io/jetstack/cert-manager-cainjector:v0.15.0
cert_manager_controller_img: quay.io/jetstack/cert-manager-controller:v0.15.0
cert_manager_webhook_img: quay.io/jetstack/cert-manager-webhook:v0.15.0

Revision history for this message
Jerry Sun (jerry-sun-u) wrote :

@Alex what does "kubectl get pods --all-namespaces" show? Are the fluxcd pods running? and if the pods with nginx in their name shows up in the output, can you describe the pods? for example "kubectl describe pod ic-nginx-ingress-ingress-nginx-controller-jxgsf -n kube-system"

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Download full text (6.5 KiB)

controller-0:~$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
armada armada-api-5c5884945-b42q4 2/2 Running 0 11h
flux-helm helm-controller-64cdcd69c8-g4sqk 1/1 Running 0 11h
flux-helm source-controller-6d7db457f4-fmtdf 1/1 Running 0 11h
kube-system calico-kube-controllers-7675fdd9d9-6hd7k 1/1 Running 0 11h
kube-system calico-node-7hzpw 1/1 Running 0 11h
kube-system coredns-5f4fcc5c76-mtkp6 1/1 Running 0 11h
kube-system coredns-5f4fcc5c76-zzmct 0/1 Pending 0 11h
kube-system ic-nginx-ingress-ingress-nginx-admission-create-8qkvw 0/1 ImagePullBackOff 0 11h
kube-system kube-apiserver-controller-0 1/1 Running 0 11h
kube-system kube-controller-manager-controller-0 1/1 Running 0 11h
kube-system kube-multus-ds-amd64-rqjqp 1/1 Running 0 11h
kube-system kube-proxy-nkrgk 1/1 Running 0 11h
kube-system kube-scheduler-controller-0 1/1 Running 0 11h
kube-system kube-sriov-cni-ds-amd64-rwvwn 1/1 Running 0 11h
controller-0:~$ kubectl describe pod ic-nginx-ingress-ingress-nginx-admission-create-8qkvw -n kube-system
Name: ic-nginx-ingress-ingress-nginx-admission-create-8qkvw
Namespace: kube-system
Priority: 0
Node: controller-0/192.168.206.2
Start Time: Wed, 11 May 2022 06:01:26 +0000
Labels: app.kubernetes.io/component=admission-webhook
              app.kubernetes.io/instance=ic-ingress-nginx
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=ingress-nginx
              app.kubernetes.io/version=1.1.1
              controller-uid=59df6e2d-5d67-40a5-8951-b82a11fb7678
              helm.sh/chart=ingress-nginx-4.0.15
              job-name=ic-nginx-ingress-ingress-nginx-admission-create
Annotations: cni.projectcalico.org/podIP: 172.16.192.71/32
              cni.projectcalico.org/podIPs: 172.16.192.71/32
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "ips": [
                        "172.16.192.71"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "ips": [
                        "172.16.192.71"
                    ],
                    "default": true,
                    "dns": {}
                }]
Status: Pending
IP: 172....

Read more...

Revision history for this message
Jerry Sun (jerry-sun-u) wrote :

@Alex Your test environment is completely contained right? It has no access to the outside world to download images from

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Ghada Khalil (gkhalil) → nobody
Changed in starlingx:
assignee: nobody → Reinildes Oliveira (rjosemat)
Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Download full text (4.0 KiB)

@Jerry, we are using a local registry, and that docker image is present:

sys_stxval@kasstxj:/localdisk/starlingx/watcher/mirror_scripts$ docker images | grep kube-webhook-certgen
k8s.gcr.io/ingress-nginx/kube-webhook-certgen v1.1.1 c41e9fcadf5a 7 months ago 47.7MB
kasstxj.ka.intel.com:5000/ingress-nginx/kube-webhook-certgen v1.1.1 c41e9fcadf5a 7 months ago 47.7MB
sys_stxval@kasstxj:/localdisk/starlingx/watcher/mirror_scripts$
sys_stxval@kasstxj:/localdisk/starlingx/watcher/mirror_scripts$
sys_stxval@kasstxj:/localdisk/starlingx/watcher/mirror_scripts$
sys_stxval@kasstxj:/localdisk/starlingx/watcher/mirror_scripts$ docker image inspect c41e9fcadf5a
[
    {
        "Id": "sha256:c41e9fcadf5a291120de706b7dfa1af598b9f2ed5138b6dcb9f79a68aad0ef4c",
        "RepoTags": [
            "k8s.gcr.io/ingress-nginx/kube-webhook-certgen:v1.1.1",
            "kasstxj.ka.intel.com:5000/ingress-nginx/kube-webhook-certgen:v1.1.1"
        ],
        "RepoDigests": [
            "k8s.gcr.io/ingress-nginx/kube-webhook-certgen@sha256:64d8c73dca984af206adf9d6d7e46aa550362b1d7a01f3a0a91b20cc67868660",
            "kasstxj.ka.intel.com:5000/ingress-nginx/kube-webhook-certgen@sha256:78351fc9d9b5f835e0809921c029208faeb7fbb6dc2d3b0d1db0a6584195cfed"
        ],
        "Parent": "",
        "Comment": "buildkit.dockerfile.v0",
        "Created": "2021-10-12T17:20:42.65070441Z",
        "Container": "",
        "ContainerConfig": {
            "Hostname": "",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": null,
            "Cmd": null,
            "Image": "",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": null,
            "OnBuild": null,
            "Labels": null
        },
        "DockerVersion": "",
        "Author": "Bazel",
        "Config": {
            "Hostname": "",
            "Domainname": "",
            "User": "65532:65532",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt"
            ],
            "Cmd": null,
            "Image": "",
            "Volumes": null,
            "WorkingDir": "/",
            "Entrypoint": [
                "/kube-webhook-certgen"
            ],
            "OnBuild": null,
            "Labels": null
        },
        "Architecture": "amd64",
        "Os": "linux",
        "Size": 47736388,
        "VirtualSize": 47736388,
        "GraphDriver": {
            "Data": {
                "LowerDir": "/localdisk/.docker-images/overlay2/18a1fd0dcc4ce94fec78fec270d180b6f63e0ed382445f67b5e8d...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nginx-ingress-controller-armada-app (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/841789
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/979fbe6f497cca8b7a41cbee983a9a558444d0ab
Submitter: "Zuul (22348)"
Branch: master

commit 979fbe6f497cca8b7a41cbee983a9a558444d0ab
Author: Rei Oliveira <email address hidden>
Date: Fri May 13 15:39:52 2022 -0300

    Add system images for cert-manager and nginx

    Add updated images to load-images-information task so that they can be
    pre downloaded and pushed to registry.local.

    Test cases:

    PASS: Run the ansible bootstrap playbook with success
    PASS: After ansible bootstrap run 'crictl images' and check that images
          are prefixed with registry.local
    PASS: Check that nginx pod is Running after bootstrap completes
    PASS: Check that bootstrap works with kubernetes_version as 1.22.5
          and 1.23.1 as well

    Closes-Bug: 1971981
    Change-Id: I02f4de30eb8d33b3eb32cb49815b7ee77beae89c
    Signed-off-by: Rei Oliveira <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nginx-ingress-controller-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/nginx-ingress-controller-armada-app/+/841919
Committed: https://opendev.org/starlingx/nginx-ingress-controller-armada-app/commit/942a4a2647fbd9c760f8eb57e2232fb0ecb975a0
Submitter: "Zuul (22348)"
Branch: master

commit 942a4a2647fbd9c760f8eb57e2232fb0ecb975a0
Author: Rei Oliveira <email address hidden>
Date: Mon May 16 14:07:25 2022 -0300

    Add overrides locking versions for nginx images

    This commit adds the images and tags for the images used by nginx
    in order for the application framework do download them with sysinv
    during 'system application-apply'

    Test Cases:

    PASS: Built application successfully
    PASS: Application install successful and pods are Running
    PASS: Check that sysinv logs show images being downloaded from
          registry.local

    Closes-Bug: 1971981
    Depends-on: https://review.opendev.org/c/starlingx/ansible-playbooks/+/841789
    Change-Id: I74b7c49ccb4ad87862831cbefcd5a66178b7521a
    Signed-off-by: Rei Oliveira <email address hidden>

Revision history for this message
Reinildes Oliveira (rjosemat) wrote :

Code fixes are merged. Please retest and confirm whether it fixes your issue.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

The StarlingX installation doesn't fail anymore at the Setup stage.
However, I observed a new issue at the Provision stage, so I opened a new bug:
https://bugs.launchpad.net/starlingx/+bug/1973888

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers