Ansible-playbook failed by nginx-ingress-controller / cert-mgr application in uploading status

Bug #1878684 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Low
Iago Regiani

Bug Description

Brief Description

-----------------

In regular system, running ansible-playbook to config the system. bootstrap was failed by nginx-ingress-controller application in uploading status

Severity

--------

Major

Steps to Reproduce

------------------

 run ansible-playbook

Expected Behavior
------------------

running ansible-playbook through successfully

Actual Behavior

----------------

running ansible-playbook failed

Reproducibility

---------------

intermittent
passed to replay

System Configuration

--------------------

Multi-node system

Lab-name: WCP_71-75

Branch/Pull Time/Commit

-----------------------

Load: 2020-05-13_20-00-00

Last Pass

---------

 Load: 2020-05-12_20-00-00

Timestamp/Logs

--------------

[2020-05-14 06:23:47,507] 140 INFO MainThread telnet.send :: Send: ansible-playbook lab-install-playbook.yaml -e "@local-install-overrides.yaml"
[2020-05-14 06:37:59,534] 3837 ERROR MainThread install_helper.controller_system_config:: ansible-playbook lab-install-playbook.yaml e "@local

E TASK [bootstrap/bringup-bootstrap-applications : Wait until application is in the uploaded state] ***
E FAILED - RETRYING: Wait until application is in the uploaded state (3 retries left).
E FAILED - RETRYING: Wait until application is in the uploaded state (2 retries left).
E FAILED - RETRYING: Wait until application is in the uploaded state (1 retries left).
E fatal: [localhost]: FAILED! => {"attempts": 3, "changed": true, "cmd": "source /etc/platform/openrc; system application-show nginx-ingress-controller --column status --format value", "delta": "0:00:01.877368", "end": "2020-05-14 06:37:57.656087", "rc": 0, "start": "2020-05-14 06:37:55.778719", "stderr": "", "stderr_lines": [], "stdout": "uploading", "stdout_lines": ["uploading"]}
E
E PLAY RECAP *********************************************************************
E localhost : ok=312 changed=175 unreachable=0 failed=1

Test Activity

-------------

config controller

Revision history for this message
Peng Peng (ppeng) wrote :
tags: added: stx.retestneeded
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Peng, can you monitor for a re-occurrence and leave the system in this state when it happens for further investigation? This type of issue is hard to investigate after the fact.

Please contact <email address hidden> to investigate live when the issue is reproduced.

Changed in starlingx:
status: New → Incomplete
assignee: nobody → Peng Peng (ppeng)
Revision history for this message
Sabeel Ansari (sansariwr) wrote :

Observed some AMQP connection errors.

Interestingly, it reports that upload was successful, reports AMQP errors at same timestamp, moves on. Later in the day, around 14:02:10 again goes through the process of uploading and applying nginx & cert-manager (no AMQP error messages at this point), but later on sysinv.kube_app purges it from the system (at 13:54:28)

Also noticed some Kubernetes API error earlier on (see second line below)…not sure if that is causing issues.

Timelines:
- 06:33:41 – Kubernetes is not configured log statement
- 06:37:36 – nginx chart upload (successfully)
- 06:37:41 – AMQP connection errors begin
- 06:38:46 – nginx manifest gets deleted & uploaded again (successfully)

- 14:02:10 – nginx & cert-manager upload/apply again (successfully)
- (no AMQP error message noted this time)
- 15:54:28 – sysinv purges nginx app from system

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
assignee: Peng Peng (ppeng) → Sabeel Ansari (sansariwr)
tags: added: stx.4.0 stx.apps stx.containers
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Incomplete → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

The same issue was also seen with the cert-mgr application which is started together with nginx as part of the ansible bootstrap. I've updated the title accordingly.

summary: - Ansible-playbook failed by nginx-ingress-controller application in
- uploading status
+ Ansible-playbook failed by nginx-ingress-controller / cert-mgr
+ application in uploading status
Ghada Khalil (gkhalil)
tags: added: stx.security
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Moving to stx.5.0 as this is an intermittent issue and is not seen often. A retry usually works. If this becomes a bigger problem, we can re-consider porting the fix to stx.4.0 at a later date.

tags: added: stx.5.0
removed: stx.4.0
Revision history for this message
Difu Hu (difuhu) wrote :

The issue was reproduced on 2020-08-25_20-00-00.

fatal: [localhost]: FAILED! => {"attempts": 3, "changed": true, "cmd": "source /etc/platform/openrc; system application-show nginx-ingress-controller --column status --format value", "delta": "0:00:02.761150", "end": "2020-08-26 03:13:53.962515", "rc": 0, "start": "2020-08-26 03:13:51.201365", "stderr": "", "stderr_lines": [], "stdout": "uploading", "stdout_lines": ["uploading"]}

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Lowering the priority to Medium as the issue is fairly intermittent

Changed in starlingx:
importance: High → Medium
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Lowering the priority further given there are no recent reports of this issue

Changed in starlingx:
assignee: Sabeel Ansari (sansariwr) → Iago Regiani (iregiani)
importance: Medium → Low
tags: removed: stx.5.0 stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

From Iago:
Tested this using an iso built this week (2021-03-25_10-37-56) and have been unable to reproduce it, having tried around 15 times.

Note: However, I was able to get another issue on bootstrap where it times out waiting for the application to reach the applied state. This seems to occur on low bandwidth conditions since doesn't wait time enough to download all docker images, and even after failing if enough time is given, the application reaches the applied state. This happened using VirtualBox, but it shouldn't happen in practice since there are minimum bandwidth guidelines to be followed

Changed in starlingx:
status: In Progress → Invalid
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Closing as the issue issue was reported a long time ago (almost 1 year ago) and is no longer reproducible with recent loads.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.