DC subcloud bootsrap failed - nginx-ingress-controller apply failed

Bug #1883791 reported by Nimalini Rasa
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Gerry Kopec

Bug Description

Brief Description
-----------------
Subcloud Bootstrap failed with 50ms delay on the network, due to the following:
fatal: [subcloud41]: FAILED! => {"attempts": 30, "changed": true, "cmd": "source /etc/platform/openrc; system application-show nginx-ingress-controller --column status --format value", "delta": "0:00:01.877380", "end": "2020-06-16 19:42:25.617316", "rc": 0, "start": "2020-06-16 19:42:23.739936", "stderr": "", "stderr_lines": [], "stdout": "apply-failed", "stdout_lines": ["apply-failed"]}

Severity
--------
Major

Steps to Reproduce
------------------
Add 25 subclouds at the same time with 50ms delay on the i/f used for bootstrap

Expected Behavior
------------------
Bootstrap to be successfull

Actual Behavior
----------------
Bootstrap failed for subcloud41 (1 out of 25 subclouds tried at the same time)

Reproducibility
---------------
intermittent

System Configuration
--------------------
duplex with worker system controller and One node system for subcloud

Branch/Pull Time/Commit
-----------------------
2020-06-11_20-00-00

Last Pass
---------
N/A
Timestamp/Logs
--------------
2020-06-16 19:01:37.453 3659026 Subcloud41 add started

Test Activity
-------------
System Test

Revision history for this message
Nimalini Rasa (nrasa) wrote :
Download full text (8.6 KiB)

From subcloud41:
system application-list
+--------------------------+---------+-----------------------------------+----------------------------------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+-----------------------------------+----------------------------------------+--------------+------------------------------------------+
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | apply-failed | operation aborted, check logs for detail |
+--------------------------+---------+-----------------------------------+----------------------------------------+--------------+------------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-5cd4695574-gq9wb 1/1 Running 1 25m
kube-system calico-node-kc7n9 1/1 Running 0 16m
kube-system coredns-7fc965fbd7-dv8gp 1/1 Running 0 25m
kube-system ic-nginx-ingress-controller-xcwfn 1/1 Running 0 17m
kube-system ic-nginx-ingress-default-backend-5ffcfd7744-hgdbj 1/1 Running 0 17m
kube-system kube-apiserver-controller-0 1/1 Running 0 25m
kube-system kube-controller-manager-controller-0 1/1 Running 1 25m
kube-system kube-multus-ds-amd64-x5gwt 1/1 Running 0 15m
kube-system kube-proxy-67kbc 1/1 Running 0 25m
kube-system kube-scheduler-controller-0 1/1 Running 1 25m
kube-system kube-sriov-cni-ds-amd64-p7p9n 1/1 Running 0 15m
kube-system tiller-deploy-5c8dd9fb56-zpnpq 1/1 Running 0 24m

Error msg from /var/log/armada/nginx-ingress-controller-apply_2020-06-16-19-32-12.log
get_release_status /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:547^[[00m
2020-06-16 19:33:07.861 46 ERROR armada.handlers.armada [-] Chart deploy [nginx-ingress] failed: armada.exceptions.tiller_exceptions.ReleaseException: Failed to Install release: ic-nginx-ingress - Tiller Message: b'Release "ic-nginx-ingress" failed: etcdserver: request timed out'
2020-06-16 19:33:07.861 46 ERROR armada.handlers.armada Traceback (most recent call last):
2020-06-16 19:33:07.861 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 473, in install_release
2020-06-16 19:33:07.861 46 ERROR armada.handlers.armada metadata=self.metadata)
2020-06-16 19:33:07.861 46 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in __call...

Read more...

Revision history for this message
Nimalini Rasa (nrasa) wrote :
  • Logs Edit (19.2 MiB, application/x-tar)
Revision history for this message
Nimalini Rasa (nrasa) wrote :
  • Logs Edit (219.3 MiB, application/x-tar)
Revision history for this message
Yang Liu (yliu12) wrote :
Download full text (3.3 KiB)

Subcloud24 logs are here: https://files.starlingx.kube.cengn.ca/launchpad/1883791

Here's analysis from Brent Rowsell for subcloud24 bootstrap failure:

Started the nginx apply here

2020-06-16 21:44:08.006 46 DEBUG armada.handlers.tiller [-] Tiller ListReleases() with timeout=300, request=limit: 32
2020-06-16 21:44:08.059 46 INFO armada.handlers.chart_deploy [-] [chart=nginx-ingress]: Installing release ic-nginx-ingress in namespace kube-system, wait=True, timeout=1800s^[[00m
2020-06-16 21:44:08.063 46 INFO armada.handlers.tiller [-] [chart=nginx-ingress]: Helm install release: wait=True, timeout=1800^[[00m

A 30 sec timeout here, not sure what this one is.

2020-06-16 21:44:41.643 46 ERROR armada.handlers.tiller [-] [chart=nginx-ingress]: Error while installing release ic-nginx-ingress: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "release ic-nginx-ingress failed: etcdserver: request timed out"
        debug_error_string = "{"created":"@1592343881.643161437","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release ic-nginx-ingress failed: etcdserver: request timed out","grpc_status"

helm ls
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
ic-nginx-ingress 1 Tue Jun 16 21:44:08 2020 FAILED nginx-ingress-1.4.0 kube-system

The pods did eventually start

ic-nginx-ingress-controller-8khh8
I0616 21:44:48.238363 6 status.go:148] new leader elected: ic-nginx-ingress-controller-8khh8
I0616 21:44:48.341623 6 controller.go:190] Backend successfully reloaded.
I0616 21:44:48.341652 6 controller.go:200] Initial sync, sleeping for 1 second.
[16/Jun/2020:21:44:49 +0000]TCP200000.000
W0616 21:44:51.809291 6 controller.go:371] Service "kube-system/ic-nginx-ingress-default-backend" does not have any active Endpoint
[16/Jun/2020:21:45:29 +0000]TCP200000.000

Seem to be some brutal access times to etcd in this timeframe

2020-06-16T21:44:35.770 localhost forward-journal[80404]: info 2020-06-16 21:44:35.770210 W | etcdserver: read-only range request "key:\"/registry/services/specs/kube-system/ic-nginx-ingress-controller\" " with result "error:etcdserver: request timed out" took too long (7.001181456s) to execute
2020-06-16T21:44:35.875 localhost forward-journal[80404]: info 2020-06-16 21:44:35.875665 W | wal: sync duration of 7.500530332s, expected less than 1s

It manually applied properly.

system application-list
+--------------------------+---------+-----------------------------------+----------------------------------------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+-----------------------------------+----------------------------------------+---------+-----------+
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
+------...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marked for further investigation in stx.5.0 as this appears to be a resource/cpu constraint issue in this setup resulting in etcd timeouts.

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.5.0 stx.containers
Changed in starlingx:
assignee: nobody → Gerry Kopec (gerry-kopec)
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
Frank Miller (sensfan22) wrote :

Marking this as fix released as this issue was not reproduced in recent tests and a number of commits related to nginx and system performance were done throughout the stx.5.0 timeline.

Changed in starlingx:
status: Triaged → Fix Released
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.