Intermitent failures at tripleo-container-image-prepare at gates with MANIFEST_BLOB_UNKNOWN

Bug #1815576 reported by Quique Llorente on 2019-02-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Steve Baker

Bug Description

Gate standalone jobs are starting to fail more ofthen at container preparation, but looking at the logs there is not much information:

standalone scn002

http://logs.openstack.org/78/633278/14/gate/tripleo-ci-centos-7-scenario002-standalone/dca82b9/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz#_2019-02-12_06_27_57

http://logs.openstack.org/78/633278/14/gate/tripleo-ci-centos-7-scenario002-standalone/dca82b9/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz

In the middle of this logs we can find

{"errors":[{"code":"MANIFEST_BLOB_UNKNOWN","message":"blob unknown to registry","detail":"sha256:e7d3075c4205ccccbd4489430acc058e2958469a08829ae3906e65fd4da1551f"}]}

Also from journald http://logs.openstack.org/78/633278/14/gate/tripleo-ci-centos-7-scenario002-standalone/dca82b9/logs/undercloud/var/log/journal.txt.gz#_Feb_12_06_32_06

Feb 12 06:32:06 standalone.localdomain registry[20249]: time="2019-02-12T06:32:06Z" level=error msg="response completed with error" err.code="blob unknown" err.detail=sha256:e7d3075c4205ccccbd4489430acc058e2958469a08829ae3906e65fd4da1551f err.message="blob unknown to registry" go.version=go1.9.4 http.request.host="192.168.24.1:8787" http.request.id=8cfa411d-904b-4c94-9025-9f6dbade0a64 http.request.method=HEAD http.request.remoteaddr="192.168.24.1:34328" http.request.uri="/v2/tripleomaster/centos-binary-aodh-notifier/blobs/sha256:e7d3075c4205ccccbd4489430acc058e2958469a08829ae3906e65fd4da1551f" http.request.useragent="python-requests/2.19.1" http.response.contenttype="application/json; charset=utf-8" http.response.duration=3.428097ms http.response.status=404 http.response.written=157 instance.id=4071e315-a1b9-4d87-9b9b-11fc81da918b vars.digest="sha256:e7d3075c4205ccccbd4489430acc058e2958469a08829ae3906e65fd4da1551f" vars.name="tripleomaster/centos-binary-aodh-notifier" version="v2.6.2+unknown"
F

standalone scn003

http://logs.openstack.org/17/633617/10/gate/tripleo-ci-centos-7-scenario003-standalone/fae4151/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz#_2019-02-12_07_37_03

http://logs.openstack.org/17/633617/10/gate/tripleo-ci-centos-7-scenario003-standalone/fae4151/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz

Same here

{"errors":[{"code":"MANIFEST_BLOB_UNKNOWN","message":"blob unknown to registry","detail":"sha256:f1d2e3b7d076ab08269367d348980fda70468b45dfd2b82f4a713686f7582563"}]}

description: updated
summary: - Intermitent failures at tripleo-container-image-prepare at gates
+ Intermitent failures at tripleo-container-image-prepare at gates with
+ MANIFEST_BLOB_UNKNOWN
description: updated
description: updated
Changed in tripleo:
milestone: none → stein-3
Emilien Macchi (emilienm) wrote :

I think there is a bug in the python uploader, where a blob would be missing:

Feb 12 06:32:06 standalone.localdomain registry[20249]: time="2019-02-12T06:32:06Z" level=error msg="response completed with error" err.code="blob unknown" err.detail=sha256:e7d3075c4205ccccbd4489430acc058e2958469a08829ae3906e65fd4da1551f err.message="blob unknown to registry" go.version=go1.9.4 http.request.host="192.168.24.1:8787" http.request.id=8cfa411d-904b-4c94-9025-9f6dbade0a64 http.request.method=HEAD http.request.remoteaddr="192.168.24.1:34328" http.request.uri="/v2/tripleomaster/centos-binary-aodh-notifier/blobs/sha256:e7d3075c4205ccccbd4489430acc058e2958469a08829ae3906e65fd4da1551f" http.request.useragent="python-requests/2.19.1" http.response.contenttype="application/json; charset=utf-8" http.response.duration=3.428097ms http.response.status=404 http.response.written=157 instance.id=4071e315-a1b9-4d87-9b9b-11fc81da918b vars.digest="sha256:e7d3075c4205ccccbd4489430acc058e2958469a08829ae3906e65fd4da1551f" vars.name="tripleomaster/centos-binary-aodh-notifier" version="v2.6.2+unknown"
Feb 12 06:32:06 standalone.localdomain registry[20249]: 192.168.24.1 - - [12/Feb/2019:06:32:06 +0000] "HEAD /v2/tripleomaster/centos-binary-aodh-notifier/blobs/sha256:e7d3075c4205ccccbd4489430acc058e2958469a08829ae3906e65fd4da1551f HTTP/1.1" 404 157 "" "python-requests/2.19.1"

It returns 404 which I don't think is considered as a requests.exceptions.RequestException.
If that's the case, the _cross_repo_mount won't retry:
https://git.openstack.org/cgit/openstack/tripleo-common/tree/tripleo_common/image/image_uploader.py#n579

And move on to the next layer, and at the end report a failure.

Change abandoned by Quique Llorente (<email address hidden>) on branch: master
Review: https://review.openstack.org/636317
Reason: Red herring allright

Emilien Macchi (emilienm) wrote :

Error in my last comment, the 404 would cause HTTPError which inherits from requests.exceptions.RequestException so the retry would happen.

Emilien Macchi (emilienm) wrote :

I've been trying to reproduce in my local env, and so far I failed. However when I increase the workers to upload the layers, it fails to pull the images.
It makes me wonder if our multi-threading for layers upload is safely configured (value is hardcoded to 4). A potential scenario could be that the layer failed to be uploaded (flacky network), therefore later when mounting cross repositories it fails to download the blobs and build images after the modify image role.

Emilien Macchi (emilienm) wrote :

With more debug:

Uploading layer: sha256:4c13f4fa9e39ad08a9478b7e4890e95ec8006e051f0f592e2f619964317187e3
Fetching layer: sha256:4c13f4fa9e39ad08a9478b7e4890e95ec8006e051f0f592e2f619964317187e3
Resetting dropped connection: mirror.bhs1.ovh.openstack.org

then later:

Uploading manifest of type application/vnd.docker.distribution.manifest.v2+json to: http://192.168.24.1:8787/v2/tripleomaster/centos-binary-glance-api/manifests/current-tripleo
http://192.168.24.1:8787 "PUT /v2/tripleomaster/centos-binary-glance-api/manifests/current-tripleo HTTP/1.1" 400 166
{"errors":[{"code":"MANIFEST_BLOB_UNKNOWN","message":"blob unknown to registry","detail":"sha256:4c13f4fa9e39ad08a9478b7e4890e95ec8006e051f0f592e2f619964317187e3"}]}

So it's really a network issue with infra, but our code should try again to fetch and make sure the layer is really uploaded.

Fix proposed to branch: master
Review: https://review.openstack.org/636405

Changed in tripleo:
assignee: Quique Llorente (quiquell) → Emilien Macchi (emilienm)
status: Triaged → In Progress

Reviewed: https://review.openstack.org/636405
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=c3c1f61c9a952fc7806ab8aec379bafd5bd87a94
Submitter: Zuul
Branch: master

commit c3c1f61c9a952fc7806ab8aec379bafd5bd87a94
Author: Emilien Macchi <email address hidden>
Date: Tue Feb 12 13:08:31 2019 -0500

    image_uplaoder: add retry to _layer_stream_registry

    We have seen that uploading a layer can be problematic if the fetch is
    dropped by network issues (like it happens in OpenStack Infra).

    This patch adds a retry to the function that fetch the layer and retry
    if an HTTP exception is raised.

    Change-Id: Ia9716447a7d037aaf32c3b01ab511c08e6598e07
    Closes-Bug: #1815576

Changed in tripleo:
status: In Progress → Fix Released
Emilien Macchi (emilienm) wrote :
Changed in tripleo:
status: Fix Released → In Progress
tags: removed: alert

This issue was fixed in the openstack/tripleo-common 10.4.0 release.

Quique Llorente (quiquell) wrote :

I am starting to see it that often :-( have being massive at the gates here https://review.openstack.org/#/c/636232/5

tox-py35 SUCCESS in 3m 49s
tripleo-ci-centos-7-scenario001-standalone FAILURE in 1h 00m 55s
tripleo-ci-centos-7-scenario002-standalone FAILURE in 59m 38s
tripleo-ci-centos-7-scenario003-standalone FAILURE in 1h 02m 54s
tripleo-ci-centos-7-scenario004-standalone FAILURE in 1h 01m 07s
openstack-tox-lower-constraints SUCCESS in 9m 52s
openstack-tox-pep8

Emilien Macchi (emilienm) wrote :

It also seems to happen with image-serve (without Docker Registry v2):

http://logs.openstack.org/61/637861/2/check/tripleo-ci-centos-7-scenario001-standalone/d8575f5/logs/undercloud/var/log/httpd/image_serve_access.log.txt.gz

192.168.24.1 - - [20/Feb/2019:00:36:48 +0000] "HEAD /v2/tripleomaster/centos-binary-heat-api/blobs/sha256:b84e07faf72d1581aff50907b7ae59a6a98fefcfacf69714ab83e9248c999251 HTTP/1.1" 404 - "-" "python-requests/2.21.0"
192.168.24.1 - - [20/Feb/2019:00:36:48 +0000] "HEAD /v2/tripleomaster/centos-binary-heat-api/blobs/sha256:36198514f16bc5c8bac0b6c89be92bab49e3999b536ac06ba26f7c823cb5f5c8 HTTP/1.1" 200 - "-" "python-requests/2.21.0"
192.168.24.1 - - [20/Feb/2019:00:36:48 +0000] "HEAD /v2/tripleomaster/centos-binary-heat-api/blobs/sha256:b84e07faf72d1581aff50907b7ae59a6a98fefcfacf69714ab83e9248c999251 HTTP/1.1" 404 - "-" "python-requests/2.21.0"

The blob "b84e07faf72d1581aff50907b7ae59a6a98fefcfacf69714ab83e9248c999251" isn't found.
What is weird is that it doesn't show up in error logs (but access) nor in /var/log/tripleo-container-image-prepare.log on this case.

Cédric Jeanneret (cjeanner) wrote :

So apparently the image centos-binary-heat-api could be pulld, even with the 2 missing hashes you pointed in the access_log:
Pulling 192.168.24.1:8787/tripleomaster/centos-binary-heat-api:current-tripleo
Running podman pull --tls-verify=false 192.168.24.1:8787/tripleomaster/centos-binary-heat-api:current-tripleo
13be152e017149dfb12cec9732b2a09752f03c6d1759786bcc3e460bc9c8c416

Not sure if the 404 we get in the access_log are really relevant (especially since the image is fetched without error) - still, it's not nice.

Fix proposed to branch: master
Review: https://review.openstack.org/638266

Changed in tripleo:
assignee: Emilien Macchi (emilienm) → Steve Baker (steve-stevebaker)

Reviewed: https://review.openstack.org/638266
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=67a3c4d8b31e6cac256012df65c36dcf3542e129
Submitter: Zuul
Branch: master

commit 67a3c4d8b31e6cac256012df65c36dcf3542e129
Author: Steve Baker <email address hidden>
Date: Thu Feb 21 09:24:05 2019 +1300

    Improve retry behaviour of _copy_layer_registry_to_registry

    There are failures occuring in CI which result in missing layers after
    a registry->registry transfer.

    A job.exception() check is added to ensure the whole image transfer
    fails if a single layer fails.

    _copy_layer_registry_to_registry now retries on IOError. This is
    the parent class of RequestException, and could plausably be raised
    during a layer transfer.

    Change-Id: Ibe9c046ed9172dd0bc276c6400496392b2c69dda
    Partial-Bug: #1815576

Steve Baker (steve-stevebaker) wrote :

Has this issue been witnessed again since the above change landed?

wes hayutin (weshayutin) wrote :

we have been seeing a lot of jobs failing in container image prep. The log is in debug mode which is making a very large and to be honest has made it difficult for me to figure out what's going on.
I have not yet put enough time into rechecking this issue and parsing the logs. I am asking for a few more days, we've been busy as usual :(

tags: added: alert
wes hayutin (weshayutin) wrote :

adding alert just to make sure the ruck/rovers are aware as well

Steve Baker (steve-stevebaker) wrote :

OK, thanks. I'm going to propose a change which does the following:

For registry->registry layer transfers, validate that the calculated digest matches the expected digest and trigger a retry if it doesn't.

Add a final validation check that there is a layer file for every layer entry in the manifest (this would cause that image transfer to fail with a detailed error, no retries.

Reviewed: https://review.openstack.org/641856
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=7a5b2903e6d9383c7759df405a8168799ef9d348
Submitter: Zuul
Branch: master

commit 7a5b2903e6d9383c7759df405a8168799ef9d348
Author: Steve Baker <email address hidden>
Date: Fri Mar 8 11:50:06 2019 +1300

    Improve handling of layer transfer errors

    To improve the reliability of registry->registry layer transfers the
    following changes have been made:
    - raise an IOError to trigger a retry when the calculated digest does
      not match the expected digest
    - any error when writing the blob file will remove the file

    This change also ensures the tripleo-container-image-prepare writes
    any exception to the log file, it appears that errors are not being
    logged and the ansible task does not print its output.

    Change-Id: I98f25932a66bb7a9f04cdc1c906223e457fe44b1
    Closes-Bug: #1815576

Changed in tripleo:
status: In Progress → Fix Released
Steve Baker (steve-stevebaker) wrote :

Please reopen and link to the log files if this is seen again

This issue was fixed in the openstack/tripleo-common 10.6.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers