system application-apply fails to reach 192.168.204.2:9001 when a proxy is set

Bug #1813635 reported by Erich Cordoba
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Erich Cordoba

Bug Description

Brief Description
-----------------
When a proxy is configured, system application-apply fails due to timeout trying to reach http://192.168.204.2:9001/v2

Severity
--------
Major

Steps to Reproduce
------------------
Use the steps in the wiki https://wiki.openstack.org/wiki/StarlingX/Containers/Installation
to perform the installation using a proxy.

Run this command as detailed in the wiki:
system application-apply stx-openstack

Expected Behavior
------------------
The stx-openstack application should be applied successfully.

Actual Behavior
----------------
The application failed:

[wrsroot@controller-0 ~(keystone_admin)]$ system application-list
+---------------+-----------------+------------------------+--------------+------------------------------------------+
| application | manifest name | manifest file | status | progress |
+---------------+-----------------+------------------------+--------------+------------------------------------------+
| stx-openstack | armada-manifest | manifest-no-tests.yaml | apply-failed | operation aborted, check logs for detail |
+---------------+-----------------+------------------------+--------------+------------------------------------------+

The /var/log/sysinv.log file says:

2019-01-28 17:35:18.272 11851 ERROR sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/gcr.io/google_containers/defaultbackend:1.0 download failed from local registry: 500 Server Error: Internal Server Error ("Get http://192.168.204.2:9001/v2/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)")
2019-01-28 17:35:18.272 11851 ERROR sysinv.conductor.kube_app [-] Deployment of application stx-openstack failed: failed to download one or more image(s).
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 898, in perform_app_apply
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app self._download_images(app)
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 471, in _download_images
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app reason="failed to download one or more image(s).")
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app KubeAppApplyFailure: Deployment of application stx-openstack failed: failed to download one or more image(s).
2019-01-28 17:35:18.272 11851 TRACE sysinv.conductor.kube_app
2019-01-28 17:35:18.276 11851 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

Reproducibility
---------------
100% under proxy configuration.

System Configuration
--------------------
Simplex containerized

Branch/Pull Time/Commit
-----------------------
Used this image : http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190128T060000Z/outputs/iso/

Timestamp/Logs
--------------

[wrsroot@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-node-lwr56 2/2 Running 2 53m
kube-system coredns-666d7dfb44-cdfkj 1/1 Running 1 53m
kube-system coredns-666d7dfb44-zkcdx 1/1 Running 1 53m
kube-system kube-apiserver-controller-0 1/1 Running 6 52m
kube-system kube-controller-manager-controller-0 1/1 Running 2 52m
kube-system kube-proxy-hvdtz 1/1 Running 1 53m
kube-system kube-scheduler-controller-0 1/1 Running 2 52m
kube-system tiller-deploy-67c9768568-8fkps 1/1 Running 1 50m

Revision history for this message
Erich Cordoba (ericho) wrote :

Attach sysinv log

Revision history for this message
Erich Cordoba (ericho) wrote :

I changed the http-proxy file to look like this:

[Service]
Environment="HTTP_PROXY=http://myproxy" "HTTPS_PROXY=http://myproxy" "NO_PROXY=127.0.0.1,192.168.204.2"

I basically just added 192.168.204.2 to the NO_PROXY variable.

After this change the log shows:

2019-01-28 17:56:21.998 11851 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/starlingx/stx-panko:dev-centos-pike-latest is not available in local registry, download started from public registry

and start to download from external registry.

Revision history for this message
Erich Cordoba (ericho) wrote :

Just wondering, is kube_app.py fails due to timeout to the local registry, it should continue to the external ones or raise an exception?

Ghada Khalil (gkhalil)
tags: added: stx.containers
Revision history for this message
Erich Cordoba (ericho) wrote :

I could, finally, get a success application-apply, however the behaviour is odd.

After adding the 192.168.204.2 IP into the NO_PROXY variable, the download of images started, however, the apply process failed as some images weren't downloaded:

2019-01-28 18:07:56.558 11851 ERROR sysinv.conductor.kube_app [-] Deployment of application stx-openstack failed: failed to download one or more image(s).
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 898, in perform_app_apply
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app self._download_images(app)
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 471, in _download_images
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app reason="failed to download one or more image(s).")
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app KubeAppApplyFailure: Deployment of application stx-openstack failed: failed to download one or more image(s).
2019-01-28 18:07:56.558 11851 TRACE sysinv.conductor.kube_app
2019-01-28 18:07:56.562 11851 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.
2019-01-28 18:11:21.426 12613 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_start_2019-01-28-18-11-21 patch
2019-01-28 18:11:21.427 12613 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_end. No changes from mtce/1.0.
2019-01-28 18:14:05.437 11851 INFO sysinv.conductor.kube_app [-] Application (stx-openstack) apply started.

Running again 'system applycation-apply stx-openstack' and without any change, the images were downloaded and the progress reached the 42%:

2019-01-28 18:18:00.475 11851 INFO sysinv.conductor.kube_app [-] processing chart: osh-openstack-rabbitmq, overall completion: 32.0%
2019-01-28 18:18:35.551 11851 INFO sysinv.conductor.kube_app [-] processing chart: osh-openstack-keystone, overall completion: 37.0%
2019-01-28 18:19:49.370 11851 INFO sysinv.conductor.kube_app [-] processing chart: osh-openstack-glance, overall completion: 42.0%

But it failed later:

2019-01-28 18:49:52.400 11851 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.
(There's a lot of log here, so i'm not sure what to paste, the complete file is attached anyway)

Then, running the command again caused a successful application-apply.

+---------------+-----------------+------------------------+---------+-----------+
| application | manifest name | manifest file | status | progress |
+---------------+-----------------+------------------------+---------+-----------+
| stx-openstack | armada-manifest | manifest-no-tests.yaml | applied | completed |
+---------------+-----------------+------------------------+---------+-----------+

Questions:

- It is normal to not get an applied application at the first attempt?
- If the apply fails, which log would be more helpful to debug the issue?
- Are the pod logs useful?

Revision history for this message
Erich Cordoba (ericho) wrote :

This patch could help to fix this problem. So, on any API error from the local registry the download will continue with the external registry.

Revision history for this message
Erich Cordoba (ericho) wrote :

While this patch worked to continue downloading, in a next step is again timing out trying to reach 191.168.204.2. Seems that helm, or other component is trying to use that endpoint. So, the solution for now is to set "NO_PROXY=127.0.0.1,192.168.204.2"

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating since this affects deployments with proxy.

tags: added: stx.2019.05
Changed in starlingx:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Erich Cordoba (ericho)
Revision history for this message
Frank Miller (sensfan22) wrote :

We believe this issue should be solved by Mingyuan's proxy commit:
https://review.openstack.org/#/c/633146/

Erich please confirm and close this bug.

Revision history for this message
Erich Cordoba (ericho) wrote :

I tried a deployment yesterday and I confirm this is solved.

Changed in starlingx:
status: Triaged → Fix Released
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.