Multiple Local registry: 500 Server Error cause application-apply errors

Bug #1839696 reported by Cristopher Lemus
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
High
Abraham Arce

Bug Description

Brief Description
-----------------
During application-apply of platform-integ-apps or stx-openstack, Local registry fails and return 500 errors, causing application-apply to fail.

Severity
--------
Provide the severity of the defect.
Minor/Major: System is usable, but it takes several extra minutes (from ~30mins to ~90mins) to do the install, each failure restarts application-apply process.

Steps to Reproduce
------------------
Follow up wiki/docs procedure. During application-apply, either of platform-integ-apps or stx-openstack, this error can be observed.

Expected Behavior
------------------
Docker images are successfully downloaded from Local registry.

Actual Behavior
----------------
Local registry fails to fulfill requests and returns 500 errors. System is not able to download the images required to complete the apply.

Reproducibility
---------------
100% reproducible, eventually, after two or three applies, all images are downloaded and application-apply completes.

System Configuration
--------------------
Observed on all configs, Simplex, Duplex, Standard, Standard-External, Baremetal.

Branch/Pull Time/Commit
-----------------------
###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190809T053000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="207"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-08-09 05:30:00 +0000"

Last Pass
---------
These errors started on build from 08/06, confirmed with logs on 08/05, no 500 errors were detected:

http://paste.openstack.org/show/755716/

Timestamp/Logs
--------------
This was observed on all 4 configs, Here are the errors from a Standard (2+2) collect attached:

http://paste.openstack.org/show/755717/

A collect is attached. This is what we have on our sanity logs from robot framework:

20190809 14:38:07.782 - INFO - +------ START KW: SSHLibrary.Write [ ${cmd} ]
-integ-apps|awk '{print $10}'- system application-list|grep platform
20190809 14:38:07.793 - INFO - +------ END KW: SSHLibrary.Write (11)
20190809 14:38:07.793 - INFO - +------ START KW: SSHLibrary.Read Until Prompt [ ]
20190809 14:38:08.879 - INFO - apply-failed
--
20190809 15:24:08.618 - INFO - +------- START KW: SSHLibrary.Write [ ${cmd} ]
stack|awk '{print $10}' INFO - system application-list|grep stx-open
20190809 15:24:08.629 - INFO - +------- END KW: SSHLibrary.Write (11)
20190809 15:24:08.629 - INFO - +------- START KW: SSHLibrary.Read Until Prompt [ ]
20190809 15:24:09.757 - INFO - apply-failed

As you can see, both, platform-integ-apps and also, stx-openstack failed to apply. This was observed on all 4 configurations. Our sanity suite automatically retries the application-apply, eventually it succeeds on applying it.

Test Activity
-------------
Sanity.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :
description: updated
Revision history for this message
Frank Miller (sensfan22) wrote :

Marking as high priority as application apply should pass and the apply should not take 60-90 minutes.
Assigning to Yong to identify a prime to lead the investigation.

tags: added: stx.2.0 stx.containers
Changed in starlingx:
status: New → Triaged
assignee: nobody → yong hu (yhu6)
importance: Undecided → High
Revision history for this message
yong hu (yhu6) wrote :

@Cristopher, just to confirm you were using our internal registry: http://edgehost01.sh.intel.com:5000, weren't you?

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hi Yong,

No, we are using a mirror registry on GDC. However, the errors are NOT related to the mirror registry. They come from the registry.local:9001, i.e.:

2019-08-12 12:58:54.569 109261 ERROR sysinv.conductor.kube_app [-] Image registry.local:9001/quay.io/external_storage/rbd-provisioner:v2.1.1-k8s1.11 download failed from local registry: 500 Server Error: Internal Server Error ("Get https://registry.local:9001/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)")

If you script some other action besides a docker pull, i.e.: `docker login registry.local:9001 ` you will notice that registry.local is 1.- Slow and 2.- Fails some times.

Revision history for this message
yong hu (yhu6) wrote :

@Abraham, please have someone debug this issue with @Cristopher in the same GDC lab.

I checked the attached controller-0_20190809.183544.tar, it was using internal registry in GDC:

    "insecure-registries" : [ "192.168.100.60" ]

so when the issue is being seen, you might like to check the status of this local registry server.

Changed in starlingx:
assignee: yong hu (yhu6) → nobody
assignee: nobody → Abraham Arce (xe1gyq)
Revision history for this message
Erich Cordoba (ericho) wrote :

I noticed a similar behavior when debugging https://bugs.launchpad.net/starlingx/+bug/1817958.

If you try a docker pull to the registry.local the connection is too slow, the DNS requests were taking a lot of time to complete, so adding the registry.local hostname to /etc/hosts solved the issue.

Anyway, we need to see why the DNS is taking that long (if this is the issue here).

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

The issue was caused by our local DNS. In summary, main controller sent requests to our local dns which were taking too long to respond, causing the 500 errors.

This DNS is on our infrastructure, it's an external DNS for StarlingX. Once we restarted it, and it went back to normal operation, any request to registry.local was responding faster.

Today, during sanity, the time that it took to do the application-apply was back to normal, less than 35mins.

Abraham Arce (xe1gyq)
Changed in starlingx:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.