StarlingX

system application-apply fails to reach 192.168.204.2:9001 when a proxy is set

Bug #1813635 reported by Erich Cordoba on 2019-01-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Erich Cordoba

Bug Description

Brief Description
-----------------
When a proxy is configured, system application-apply fails due to timeout trying to reach http://192.168.204.2:9001/v2

Severity
--------
Major

Steps to Reproduce
------------------
Use the steps in the wiki https://wiki.openstack.org/wiki/StarlingX/Containers/Installation
to perform the installation using a proxy.

Run this command as detailed in the wiki:
system application-apply stx-openstack

Expected Behavior
------------------
The stx-openstack application should be applied successfully.

Actual Behavior
----------------
The application failed:

[wrsroot@controller-0 ~(keystone_admin)]$ system application-list
+---------------+-----------------+------------------------+--------------+------------------------------------------+
| application | manifest name | manifest file | status | progress |
+---------------+-----------------+------------------------+--------------+------------------------------------------+
| stx-openstack | armada-manifest | manifest-no-tests.yaml | apply-failed | operation aborted, check logs for detail |
+---------------+-----------------+------------------------+--------------+------------------------------------------+

The /var/log/sysinv.log file says:

2019-01-28 17:35:18.272 11851 ERROR sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/gcr.io/google_containers/defaultbackend:1.0 download failed from local registry: 500 Server Error: Internal Server Error ("Get http://192.168.204.2:9001/v2/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)")
2019-01-28 17:35:18.272 11851 ERROR sysinv.conductor.kube_app [-] Deployment of application stx-openstack failed: failed to download one or more image(s).
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 898, in perform_app_apply
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app self._download_images(app)
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 471, in _download_images
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app reason="failed to download one or more image(s).")
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app KubeAppApplyFailure: Deployment of application stx-openstack failed: failed to download one or more image(s).
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app
2019-01-28 17:35:18.276 11851 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

Reproducibility
---------------
100% under proxy configuration.

System Configuration
--------------------
Simplex containerized

Branch/Pull Time/Commit
-----------------------
Used this image : http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190128T060000Z/outputs/iso/

Timestamp/Logs
--------------

[wrsroot@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-node-lwr56 2/2 Running 2 53m
kube-system coredns-666d7dfb44-cdfkj 1/1 Running 1 53m
kube-system coredns-666d7dfb44-zkcdx 1/1 Running 1 53m
kube-system kube-apiserver-controller-0 1/1 Running 6 52m
kube-system kube-controller-manager-controller-0 1/1 Running 2 52m
kube-system kube-proxy-hvdtz 1/1 Running 1 53m
kube-system kube-scheduler-controller-0 1/1 Running 2 52m
kube-system tiller-deploy-67c9768568-8fkps 1/1 Running 1 50m

Tags:

Revision history for this message

Erich Cordoba (ericho) wrote on 2019-01-28:

sysinv.log Edit (248.0 KiB, text/plain)

Attach sysinv log

Revision history for this message

Erich Cordoba (ericho) wrote on 2019-01-28:

I changed the http-proxy file to look like this:

[Service]
Environment="HTTP_PROXY=http://myproxy" "HTTPS_PROXY=http://myproxy" "NO_PROXY=127.0.0.1,192.168.204.2"

I basically just added 192.168.204.2 to the NO_PROXY variable.

After this change the log shows:

2019-01-28 17:56:21.998 11851 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/starlingx/stx-panko:dev-centos-pike-latest is not available in local registry, download started from public registry

and start to download from external registry.

Revision history for this message

Erich Cordoba (ericho) wrote on 2019-01-28:

Just wondering, is kube_app.py fails due to timeout to the local registry, it should continue to the external ones or raise an exception?

Ghada Khalil (gkhalil) on 2019-01-28

tags:

added: stx.containers

Revision history for this message

Erich Cordoba (ericho) wrote on 2019-01-28:

I could, finally, get a success application-apply, however the behaviour is odd.

After adding the 192.168.204.2 IP into the NO_PROXY variable, the download of images started, however, the apply process failed as some images weren't downloaded:

2019-01-28 18:07:56.558 11851 ERROR sysinv.conductor.kube_app [-] Deployment of application stx-openstack failed: failed to download one or more image(s).
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 898, in perform_app_apply
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app self._download_images(app)
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 471, in _download_images
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app reason="failed to download one or more image(s).")
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app KubeAppApplyFailure: Deployment of application stx-openstack failed: failed to download one or more image(s).
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app
2019-01-28 18:07:56.562 11851 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.
2019-01-28 18:11:21.426 12613 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_start_2019-01-28-18-11-21 patch
2019-01-28 18:11:21.427 12613 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_end. No changes from mtce/1.0.
2019-01-28 18:14:05.437 11851 INFO sysinv.conductor.kube_app [-] Application (stx-openstack) apply started.

Running again 'system applycation-apply stx-openstack' and without any change, the images were downloaded and the progress reached the 42%:

2019-01-28 18:18:00.475 11851 INFO sysinv.conductor.kube_app [-] processing chart: osh-openstack-rabbitmq, overall completion: 32.0%
2019-01-28 18:18:35.551 11851 INFO sysinv.conductor.kube_app [-] processing chart: osh-openstack-keystone, overall completion: 37.0%
2019-01-28 18:19:49.370 11851 INFO sysinv.conductor.kube_app [-] processing chart: osh-openstack-glance, overall completion: 42.0%

But it failed later:

2019-01-28 18:49:52.400 11851 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.
(There's a lot of log here, so i'm not sure what to paste, the complete file is attached anyway)

Then, running the command again caused a successful application-apply.

Questions:

- It is normal to not get an applied application at the first attempt?
- If the apply fails, which log would be more helpful to debug the issue?
- Are the pod logs useful?

I could, finally, get a success application-apply, however the behaviour is odd.

After adding the 192.168.204.2 IP into the NO_PROXY variable, the download of images started, however, the apply process failed as some images weren't downloaded:

2019-01-28 18:07:56.558 11851 ERROR sysinv.conductor.kube_app [-] Deployment of application stx-openstack failed: failed to download one or more image(s).
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 898, in perform_app_apply
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app     self._download_images(app)
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 471, in _download_images
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app     reason="failed to download one or more image(s).")
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app KubeAppApplyFailure: Deployment of application stx-openstack failed: failed to download one or more image(s).
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app
2019-01-28 18:07:56.562 11851 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.
2019-01-28 18:11:21.426 12613 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_start_2019-01-28-18-11-21 patch
2019-01-28 18:11:21.427 12613 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_end.  No changes from mtce/1.0.
2019-01-28 18:14:05.437 11851 INFO sysinv.conductor.kube_app [-] Application (stx-openstack) apply started.

Running again 'system applycation-apply stx-openstack' and without any change, the images were downloaded and the progress reached the 42%:

But it failed later:

2019-01-28 18:49:52.400 11851 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.
(There's a lot of log here, so i'm not sure what to paste, the complete file is attached anyway)

Then, running the command again caused a successful application-apply.

Questions:

- It is normal to not get an applied application at the first attempt?
- If the apply fails, which log would be more helpful to debug the issue?
- Are the pod logs useful?

Revision history for this message

Erich Cordoba (ericho) wrote on 2019-01-29:

timeout-fix.diff Edit (780 bytes, text/plain)

This patch could help to fix this problem. So, on any API error from the local registry the download will continue with the external registry.

Revision history for this message

Erich Cordoba (ericho) wrote on 2019-01-30:

While this patch worked to continue downloading, in a next step is again timing out trying to reach 191.168.204.2. Seems that helm, or other component is trying to use that endpoint. So, the solution for now is to set "NO_PROXY=127.0.0.1,192.168.204.2"

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-01-30:

Marking as release gating since this affects deployments with proxy.

tags:	added: stx.2019.05
Changed in starlingx:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → Erich Cordoba (ericho)