various jobs failing in the gate during container image prepare 404 not found

Bug #1907657 reported by Marios Andreou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

at [1][2][3][4] various jobs are failing during container image prepare with a trace that looks like:

 2020-12-10 04:24:01,027 75958 DEBUG urllib3.connectionpool [ ] Starting new HTTPS connection (1): 10.4.70.228:5001
...
 2020-12-10 04:24:31,077 76040 DEBUG urllib3.util.retry [ ] Converted retries value: 8 -> Retry(total=8, connect=None, read=None, redirect=None, status=None)
 2020-12-10 04:24:31,077 76044 DEBUG urllib3.connectionpool [ ] http://192.168.24.1:8787 "POST /v2/tripleomaster/openstack-ironic-conductor/blobs/uploads/ HTTP/1.1" 404 196
 2020-12-10 04:24:31,078 76042 DEBUG urllib3.connectionpool [ ] http://192.168.24.1:8787 "POST /v2/tripleomaster/openstack-rabbitmq/blobs/uploads/ HTTP/1.1" 404 196
 2020-12-10 04:24:31,078 76048 DEBUG urllib3.connectionpool [ ] http://192.168.24.1:8787 "POST /v2/tripleomaster/openstack-keystone/blobs/uploads/ HTTP/1.1" 404 196
 2020-12-10 04:24:31,078 76044 INFO tripleo_common.image.image_uploader [ ] Non-2xx: id af3844a13e7f1d36b26a2b703834e730b3938d92, status 404, reason Not Found, text <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
 <html><head>
 <title>404 Not Found</title>
 </head><body>
 <h1>Not Found</h1>
 <p>The requested URL was not found on this server.</p>
 </body></html>

Not sure if this is because the consumer jobs cannot talk to the provider job container registry, or some other issue. I have seen [5] for example where container image prepare completes but then the job cannot reach the registry, don't know if they are related or yet another issue.

[1] https://7a89c7d607922873abcd-a2a9d29289e3c443a1047aebafc6ba1b.ssl.cf2.rackcdn.com/765964/4/gate/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates/f02eecd/logs/undercloud/var/log/tripleo-container-image-prepare.log
[2] http://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_177/764255/4/gate/tripleo-ci-centos-8-containers-multinode/1773ef8/logs/undercloud/var/log/tripleo-container-image-prepare.log
[3] http://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_628/765238/4/check/tripleo-ci-centos-8-containers-multinode/628a2ae/logs/undercloud/var/log/tripleo-container-image-prepare.log
[4] https://c38d0c9156ee6cc9fd3b-d97b0a3b599d6de6d0673faefd2f08b5.ssl.cf1.rackcdn.com/764255/4/gate/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates/1caf2a7/logs/undercloud/var/log/tripleo-container-image-prepare.log
[5] https://5137dc96534522704db9-2ea29c920450eb9e9df454d6b35c9277.ssl.cf1.rackcdn.com/764255/4/gate/tripleo-ci-centos-8-standalone/726aa3e/logs/undercloud/home/zuul/standalone_deploy.log

Revision history for this message
Marios Andreou (marios-b) wrote :

14:05 < chkumar|ruck> marios|rover: it is not related to the cloud
14:06 < marios|rover> chkumar|ruck: no sorry i should have said that on the bug... i initially thought it was vexx only but no i saw at least 2 different clouds
14:06 < chkumar|ruck> marios|rover: I checked all the logs to confirm whether it is linked to a cloud with specific reason but it is not

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

This registry is on undercloud which is serving containers from httpd.
Logs:
http://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_177/764255/4/gate/tripleo-ci-centos-8-containers-multinode/1773ef8/logs/undercloud/var/log/httpd/image_serve_access.log

192.168.24.1 - - [10/Dec/2020:09:11:11 +0000] "GET /v2/ HTTP/1.1" 200 2 "-" "python-requests/2.22.0"
192.168.24.1 - - [10/Dec/2020:09:11:11 +0000] "POST /v2/tripleomaster/openstack-heat-api/blobs/uploads/ HTTP/1.1" 404 196 "-" "python-requests/2.22.0"
192.168.24.1 - - [10/Dec/2020:09:11:11 +0000] "GET /v2/ HTTP/1.1" 200 2 "-" "python-requests/2.22.0"
192.168.24.1 - - [10/Dec/2020:09:11:11 +0000] "POST /v2/tripleomaster/openstack-keystone/blobs/uploads/ HTTP/1.1" 404 196 "-" "python-requests/2.22.0"
192.168.24.1 - - [10/Dec/2020:09:11:11 +0000] "GET /v2/ HTTP/1.1" 200 2 "-" "python-requests/2.22.0"
192.168.24.1 - - [10/Dec/2020:09:11:11 +0000] "GET /v2/ HTTP/1.1" 200 2 "-" "python-requests/2.22.0"
192.168.24.1 - - [10/Dec/2020:09:11:11 +0000] "POST /v2/tripleomaster/openstack-ironic-pxe/blobs/uploads/ HTTP/1.1" 404 196 "-" "python-requests/2.22.0"
192.168.24.1 - - [10/Dec/2020:09:11:11 +0000] "POST /v2/tripleomaster/openstack-memcached/blobs/uploads/ HTTP/1.1" 404 196 "-" "python-requests/2.22.0"

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Download full text (3.1 KiB)

the log of setting up this registry:

http://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_177/764255/4/gate/tripleo-ci-centos-8-containers-multinode/1773ef8/logs/undercloud/home/zuul/undercloud_install.log

tripleo_image_serve : ensure apache is installed | undercloud | 0:01:32.023494 | 4.73s
TASK | create image data directory
||CHANGED| create image data directory | undercloud
||TIMING | tripleo_image_serve : create image data directory | undercloud | 0:01:32.254103 | 0.18s
||TASK | create /v2/ response file
||CHANGED| create /v2/ response file | undercloud
||TIMING | tripleo_image_serve : create /v2/ response file | undercloud | 0:01:32.684918 | 0.37s
||TASK | Add listen line
||CHANGED| Add listen line | undercloud
||TIMING | tripleo_image_serve : Add listen line | undercloud | 0:01:33.059195 | 0.32s
||TASK | manage /etc/httpd/conf.d/image-serve.conf
||CHANGED| manage /etc/httpd/conf.d/image-serve.conf | undercloud
||TIMING | tripleo_image_serve : manage /etc/httpd/conf.d/image-serve.conf | undercloud | 0:01:33.483871 | 0.37s
||TASK | Image-Serve | restart httpd
||CHANGED| Image-Serve | restart httpd | undercloud
||TIMING | tripleo_image_serve : Image-Serve | restart httpd | undercloud | 0:01:34.127368 | 0.59s
||TASK | create persistent directories
||CHANGED| create persistent directories | undercloud | item={'mode': '0750', 'path': '/var/log/containers/heat', 'setype': 'container_file_t'}
||TIMING | create persistent directories | undercloud | 0:01:34.372919 | 0.19s
||CHANGED| create persistent directories | undercloud | item={'mode': '0750', 'path': '/var/log/containers/httpd/heat-api', 'setype': 'container_file_t'}
||TIMING | create persistent directories | undercloud | 0:01:34.516726 | 0.33s
||TIMING | create persistent directories | undercloud | 0:01:34.521188 | 0.34s
||TASK | create persistent directories
|| OK | create persistent directories | undercloud | item={'mode': '0750', 'path': '/var/log/containers/heat', 'setype': 'container_file_t'}
||TIMING | create persistent directories | undercloud | 0:01:34.752632 | 0.18s
||TIMING | create persistent directories | undercloud | 0:01:34.755934 | 0.18s
||TASK | enable virt_sandbox_use_netlink for healthcheck
||CHANGED| enable virt_sandbox_use_netlink for healthcheck | undercloud
||TIMING | enable virt_sandbox_use_netlink for healthcheck | undercloud | 0:01:35.919199 | 1.11s
||TASK | create persistent directories
||CHANGED| create persistent directories | undercloud | item={'mode': '0750', 'path': '/var/log/containers/ironic', 'setype': 'container_file_t'}
||TIMING | create persistent directories | undercloud | 0:01:36.161604 | 0.19s
||CHANGED| create persistent directories | undercloud | item={'mode': '0750', 'path': '/var/log/containers/httpd/ironic-api', 'setype': 'container_file_t'}
||TIMING | create persistent directories | undercloud | 0:01:36.316281 | 0.35s
||TIMING | create persistent directories | undercloud | 0:01:36.322200 | 0.35s
||TASK | Ensure /etc/modules-load.d exists
|| OK | Ensure /etc/modules-load.d exists | undercloud
||TIMING | tripleo_module_load : Ensure /etc/modules-load.d exists | undercloud | 0:01:36.556454 | 0.18s
||TASK ...

Read more...

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

tripleo_image_serve role in tripleo-ansible prepares apache config as:
http://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_177/764255/4/gate/tripleo-ci-centos-8-containers-multinode/1773ef8/logs/undercloud/etc/httpd/conf.d/image-serve.conf

 <LocationMatch "^/v2/.*/.*/blobs/sha256:.*$">
        SetEnvIf Request_URI "sha256:(.*)$" digest=sha256:$1
        Header set Docker-Content-Digest "%{digest}e"
        Header set ETag "%{digest}e"
        Header set Cache-Control "max-age=31536000"
        Header set Content-Type "application/octet-stream"
    </LocationMatch>

but URL in logs doesn't match pattern:

- LocationMatch "^/v2/.*/.*/blobs/sha256:.*$"
- POST /v2/tripleomaster/openstack-keystone/blobs/uploads/

That's why 404 errors.
But later the connection just fails:

 Retrying (Retry(total=7, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa1643f5b38>: Failed to establish a new connection: [Errno 113] No route to host',)': /v2/

Revision history for this message
wes hayutin (weshayutin) wrote :

fails the gate.. so marking promotion-blocker

tags: added: promotion-blocker
Revision history for this message
Alex Schultz (alex-schultz) wrote :

It couldn't fetch from the configured content provider.

https://7a89c7d607922873abcd-a2a9d29289e3c443a1047aebafc6ba1b.ssl.cf2.rackcdn.com/765964/4/gate/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates/f02eecd/logs/undercloud/var/log/tripleo-container-image-prepare.log

2020-12-10 04:24:31,079 76048 DEBUG urllib3.util.retry [ ] Converted retries value: 8 -> Retry(total=8, connect=None, read=None, redirect=None, status=None)
2020-12-10 04:24:31,079 76042 DEBUG urllib3.util.retry [ ] Converted retries value: 8 -> Retry(total=8, connect=None, read=None, redirect=None, status=None)
2020-12-10 04:24:31,081 76040 DEBUG urllib3.connectionpool [ ] Starting new HTTPS connection (1): 10.4.70.228:5001
2020-12-10 04:24:31,082 76048 DEBUG urllib3.connectionpool [ ] Starting new HTTPS connection (1): 10.4.70.228:5001
2020-12-10 04:24:31,083 76044 DEBUG urllib3.connectionpool [ ] Starting new HTTPS connection (1): 10.4.70.228:5001
2020-12-10 04:24:31,084 76042 DEBUG urllib3.connectionpool [ ] Starting new HTTPS connection (1): 10.4.70.228:5001
2020-12-10 04:25:01,085 76042 DEBUG urllib3.util.retry [ ] Incremented Retry for (url='/v2/'): Retry(total=7, connect=None, read=None, redirect=None, status=None)
2020-12-10 04:25:01,097 76042 WARNING urllib3.connectionpool [ ] Retrying (Retry(total=7, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7ffbe7c68390>, 'Connection to 10.4.70.228 timed out. (connect timeout=30)')': /v2/
2020-12-10 04:25:01,098 76042 DEBUG urllib3.connectionpool [ ] Starting new HTTPS connection (2): 10.4.70.228:5001

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

OK, so child job got wrong IP of container registry.

Content provider is running on 217.182.141.2:
https://6891a4bea265420bbf27-75977e1da0623c0b07e405a8a7f1a45b.ssl.cf5.rackcdn.com/765964/4/gate/tripleo-ci-centos-8-content-provider/9527caf/job-output.txt

Running podman registry and repository on 217.182.141.21 for branch master and DLRN tag no tag

Child job tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates gets from zuul:
    provider_job_branch: master
    registry_ip_address_branch:
      master: 10.4.70.228
https://7a89c7d607922873abcd-a2a9d29289e3c443a1047aebafc6ba1b.ssl.cf2.rackcdn.com/765964/4/gate/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates/f02eecd/zuul-info/inventory.yaml

So zuul passed wrong IP address to child jobs. The same issue appears in other jobs in this bug, wrong IP is passed to child job.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

After talks with Zuul and Infra folks:
this is caused by retries, which were caused in this case by frequent zuul executor restarts
When job is retried, the zuul still pass old data to consumers, while provider has a new IP
Zuul has the patch to solve it: https://review.opendev.org/c/zuul/zuul/+/711002

wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.