Ansible reports no issues when critical image is missing

Bug #1831664 reported by Ghada Khalil
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Tee Ngo

Bug Description

Brief Description
-----------------
This is a follow-up on https://bugs.launchpad.net/starlingx/+bug/1831485
In this report, a private registry was used in the install. The registry was missing the calico images (ex: calico/cni:v3.6.2). This resulted in the system not coming up and rebooting due to a kernel panic (this is being investigated separately).

2019-06-03T08:35:09.774 localhost kubelet[103166]: info E0603 08:35:09.774413 103166 kuberuntime_manager.go:719] init container start failed: ImagePullBackOff: Back-off pulling image "192.168.100.60/calico/cni:v3.6.2"

However, no errors were reported from Ansible.

This new bug is opened to investigate if the Ansible apply success criteria can be updated to detect this kind of issue.

From Christopher Lemus - https://bugs.launchpad.net/starlingx/+bug/1831485/comments/18
- With config_controller, if an image was missing, the execution of config_controller failed. It was relatively easy to identify, because the list of missing images were on puppet.log, looked like a pre-check. Is it possible to have that pre-check on Ansible? On all cases it reported that the playbook completed without errors.

Severity
--------
Medium - ansible reporting success when critical images are failing to download

Steps to Reproduce
------------------
- Install controller-0
- Update the calico manifest to point to a non-existent image
- Run Ansible

Expected Behavior
------------------
Ansible reports an error

Actual Behavior
----------------
Ansible reported success, but the system rebooted shortly after due to the missing image

Reproducibility
---------------
100%

System Configuration
--------------------
All

Branch/Pull Time/Commit
-----------------------
Reported on 20190602T233000Z, but likely reproducible on any recent load

Last Pass
---------
N/A - this particular scenario was not attempted previously

Timestamp/Logs
--------------
See https://bugs.launchpad.net/starlingx/+bug/1831485 for logs/timestamps

Test Activity
-------------
Follow-up on sanity issue

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Tee Ngo (teewrs)
importance: Undecided → Medium
tags: added: stx.config
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating as critical deployment issues should be flagged by Ansible whenever possible. Medium priority as this deals with failure conditions.

tags: added: stx.2.0
Changed in starlingx:
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/664764

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/664764
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=97181aa756854800d40db3f6099ec31541b47a88
Submitter: Zuul
Branch: master

commit 97181aa756854800d40db3f6099ec31541b47a88
Author: Tee Ngo <email address hidden>
Date: Tue Jun 11 22:31:11 2019 -0400

    Check kube-system pods health before exiting

    Aside from kubeadm init, all kubectl apply commands to deploy
    k8s networking and Helm services are carried out asynchronously.
    Therefore, it is necessary to wait for kube-system pods to reach
    ready state and perform a final check for pods health before
    exiting as a success response from a kubectl apply task is not
    an indication of a successfull deployment. One or more pods could
    fail to come up due to bad image, image download error,
    configuration issue, etc... during the deployment of these
    services.

    Additionally, commit ab595415aa02e1010a710c4d4b9170f1c7a04ab2
    to address LP https://bugs.launchpad.net/bugs/1822880 - Two coredns
    pods in one node system - is also ported to the playbook in this
    commit.

    Tests:
       - Locally bootstrap and bring up a standard system.
       - Remotely bootstrap, replay the bootstrap with new config, and
         bring up a simplex system.

    Closes-Bug: 1831664
    Change-Id: I542ec530eaec684436b26e614a24f78f1f2c36a6
    Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/665353

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/665353
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=9568d970f174349aab9b17dcd1673b13aa9ebce2
Submitter: Zuul
Branch: master

commit 9568d970f174349aab9b17dcd1673b13aa9ebce2
Author: Tee Ngo <email address hidden>
Date: Thu Jun 13 22:46:03 2019 -0400

    Add pods wait time to initial bootstrap play

    In latest loads that have kernel update among other code
    changes to various StarlingX repos, it is observed that
    not all kube-system pods get started before the host
    becomes online whereas they consistently did in the same
    slow lab in an older load. As a result, the bootstrap
    playbook often fails in this slow lab toward the end where
    it verifies kube-system pods readiness.

    This commit is a follow-up of commit
    97181aa756854800d40db3f6099ec31541b47a88. In this commit, a
    30 second pause is applied to initial play to ensure all
    pods have been started before executing the task that waits
    for them to become ready. The total wait time for replay
    remains unchanged at 60 seconds.

    Tests:
      Play and replay the bootstrap playbook locally on slow
      hardware.

    Closes-Bug: 1831664
    Change-Id: I525c7771eafad2b9e79dd89e985696fb16bb5b24
    Signed-off-by: Tee Ngo <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/665756

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/665756
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=9f549af9ec0774771bffe4d5b3e681e92bf15112
Submitter: Zuul
Branch: master

commit 9f549af9ec0774771bffe4d5b3e681e92bf15112
Author: Tee Ngo <email address hidden>
Date: Mon Jun 17 15:56:56 2019 -0400

    Increase pods wait time

    In this commit, default wait time for all kube-system
    pods to be started is set to 120 seconds to account for
    all reasonable slowness and hardware types.

    Closes-Bug: 1831664
    Change-Id: I6a53cd7cb4e7c1db344bf0c475a084d2844ceca2
    Signed-off-by: Tee Ngo <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.