tripleo

Intermitent failures at tripleo-container-image-prepare at gates with MANIFEST_BLOB_UNKNOWN

Bug #1815576 reported by Quique Llorente on 2019-02-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Steve Baker	tripleo stein-3

Bug Description

Gate standalone jobs are starting to fail more ofthen at container preparation, but looking at the logs there is not much information:

standalone scn002

http://logs.openstack.org/78/633278/14/gate/tripleo-ci-centos-7-scenario002-standalone/dca82b9/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz#_2019-02-12_06_27_57

http://logs.openstack.org/78/633278/14/gate/tripleo-ci-centos-7-scenario002-standalone/dca82b9/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz

In the middle of this logs we can find

{"errors":[{"code":"MANIFEST_BLOB_UNKNOWN","message":"blob unknown to registry","detail":"sha256:e7d3075c4205ccccbd4489430acc058e2958469a08829ae3906e65fd4da1551f"}]}

Also from journald http://logs.openstack.org/78/633278/14/gate/tripleo-ci-centos-7-scenario002-standalone/dca82b9/logs/undercloud/var/log/journal.txt.gz#_Feb_12_06_32_06

Feb 12 06:32:06 standalone.localdomain registry[20249]: time="2019-02-12T06:32:06Z" level=error msg="response completed with error" err.code="blob unknown" err.detail=sha256:e7d3075c4205ccccbd4489430acc058e2958469a08829ae3906e65fd4da1551f err.message="blob unknown to registry" go.version=go1.9.4 http.request.host="192.168.24.1:8787" http.request.id=8cfa411d-904b-4c94-9025-9f6dbade0a64 http.request.method=HEAD http.request.remoteaddr="192.168.24.1:34328" http.request.uri="/v2/tripleomaster/centos-binary-aodh-notifier/blobs/sha256:e7d3075c4205ccccbd4489430acc058e2958469a08829ae3906e65fd4da1551f" http.request.useragent="python-requests/2.19.1" http.response.contenttype="application/json; charset=utf-8" http.response.duration=3.428097ms http.response.status=404 http.response.written=157 instance.id=4071e315-a1b9-4d87-9b9b-11fc81da918b vars.digest="sha256:e7d3075c4205ccccbd4489430acc058e2958469a08829ae3906e65fd4da1551f" vars.name="tripleomaster/centos-binary-aodh-notifier" version="v2.6.2+unknown"
F

standalone scn003

http://logs.openstack.org/17/633617/10/gate/tripleo-ci-centos-7-scenario003-standalone/fae4151/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz#_2019-02-12_07_37_03

http://logs.openstack.org/17/633617/10/gate/tripleo-ci-centos-7-scenario003-standalone/fae4151/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz

Same here

{"errors":[{"code":"MANIFEST_BLOB_UNKNOWN","message":"blob unknown to registry","detail":"sha256:f1d2e3b7d076ab08269367d348980fda70468b45dfd2b82f4a713686f7582563"}]}

See original description

Tags:

Quique Llorente (quiquell) on 2019-02-12

description:	updated
summary:	- Intermitent failures at tripleo-container-image-prepare at gates + Intermitent failures at tripleo-container-image-prepare at gates with + MANIFEST_BLOB_UNKNOWN

Quique Llorente (quiquell) on 2019-02-12

description:	updated
description:	updated

Revision history for this message

Quique Llorente (quiquell) wrote on 2019-02-12:

Also noop are starting to fail too

http://logs.openstack.org/45/560445/247/check/tripleo-ci-centos-7-scenario004-standalone/6bbc324/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz

Revision history for this message

Quique Llorente (quiquell) wrote on 2019-02-12:

Podman is using overlay http://logs.openstack.org/45/560445/247/check/tripleo-ci-centos-7-scenario004-standalone/6bbc324/logs/undercloud/var/log/extra/podman/podman_allinfo.log.txt.gz

But docker is using overlay2
http://logs.openstack.org/45/560445/247/check/tripleo-ci-centos-7-scenario004-standalone/6bbc324/logs/undercloud/var/log/extra/docker/docker_allinfo.log.txt.gz

Could this be related ? http://logs.openstack.org/45/560445/247/check/tripleo-ci-centos-7-scenario004-standalone/6bbc324/logs/undercloud/var/log/extra/docker/docker_allinfo.log.txt.gz

at standalone scenarios job we override container cli with "docker" but prepare images script uses podman

Emilien Macchi (emilienm) on 2019-02-12

Changed in tripleo:
milestone:	none → stein-3

Revision history for this message

Emilien Macchi (emilienm) wrote on 2019-02-12:

I think there is a bug in the python uploader, where a blob would be missing:

It returns 404 which I don't think is considered as a requests.exceptions.RequestException.
If that's the case, the _cross_repo_mount won't retry:
https://git.openstack.org/cgit/openstack/tripleo-common/tree/tripleo_common/image/image_uploader.py#n579

And move on to the next layer, and at the end report a failure.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-12: Change abandoned on tripleo-heat-templates (master)

Change abandoned by Quique Llorente (<email address hidden>) on branch: master
Review: https://review.openstack.org/636317
Reason: Red herring allright

Revision history for this message

Emilien Macchi (emilienm) wrote on 2019-02-12:

Error in my last comment, the 404 would cause HTTPError which inherits from requests.exceptions.RequestException so the retry would happen.

Revision history for this message

Emilien Macchi (emilienm) wrote on 2019-02-12:

I've been trying to reproduce in my local env, and so far I failed. However when I increase the workers to upload the layers, it fails to pull the images.
It makes me wonder if our multi-threading for layers upload is safely configured (value is hardcoded to 4). A potential scenario could be that the layer failed to be uploaded (flacky network), therefore later when mounting cross repositories it fails to download the blobs and build images after the modify image role.

Revision history for this message

Emilien Macchi (emilienm) wrote on 2019-02-12:

With more debug:

Uploading layer: sha256:4c13f4fa9e39ad08a9478b7e4890e95ec8006e051f0f592e2f619964317187e3
Fetching layer: sha256:4c13f4fa9e39ad08a9478b7e4890e95ec8006e051f0f592e2f619964317187e3
Resetting dropped connection: mirror.bhs1.ovh.openstack.org

then later:

Uploading manifest of type application/vnd.docker.distribution.manifest.v2+json to: http://192.168.24.1:8787/v2/tripleomaster/centos-binary-glance-api/manifests/current-tripleo
http://192.168.24.1:8787 "PUT /v2/tripleomaster/centos-binary-glance-api/manifests/current-tripleo HTTP/1.1" 400 166
{"errors":[{"code":"MANIFEST_BLOB_UNKNOWN","message":"blob unknown to registry","detail":"sha256:4c13f4fa9e39ad08a9478b7e4890e95ec8006e051f0f592e2f619964317187e3"}]}

So it's really a network issue with infra, but our code should try again to fetch and make sure the layer is really uploaded.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-12: Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/636405

Changed in tripleo:
assignee:	Quique Llorente (quiquell) → Emilien Macchi (emilienm)
status:	Triaged → In Progress

Revision history for this message

Emilien Macchi (emilienm) wrote on 2019-02-12:

Link of debug btw: http://logs.openstack.org/70/636370/1/check/tripleo-ci-centos-7-standalone/635f0ea/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz

tags:

added: ci containers

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-13: Fix merged to tripleo-common (master)

#10

Reviewed: https://review.openstack.org/636405
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=c3c1f61c9a952fc7806ab8aec379bafd5bd87a94
Submitter: Zuul
Branch: master

commit c3c1f61c9a952fc7806ab8aec379bafd5bd87a94
Author: Emilien Macchi <email address hidden>
Date: Tue Feb 12 13:08:31 2019 -0500

image_uplaoder: add retry to _layer_stream_registry

We have seen that uploading a layer can be problematic if the fetch is
dropped by network issues (like it happens in OpenStack Infra).

This patch adds a retry to the function that fetch the layer and retry
if an HTTP exception is raised.

Change-Id: Ia9716447a7d037aaf32c3b01ab511c08e6598e07
Closes-Bug: #1815576

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

Emilien Macchi (emilienm) wrote on 2019-02-14:

#11

It seems like we still see that bug:

http://logs.openstack.org/73/636373/4/check/tripleo-ci-centos-7-containers-multinode/d1ef1fc/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz

Removing the alert though, as we don't see it that often.

Changed in tripleo:
status:	Fix Released → In Progress
tags:	removed: alert

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-18: Fix included in openstack/tripleo-common 10.4.0

#12

This issue was fixed in the openstack/tripleo-common 10.4.0 release.

Revision history for this message

Quique Llorente (quiquell) wrote on 2019-02-20:

#13

I am starting to see it that often :-( have being massive at the gates here https://review.openstack.org/#/c/636232/5

tox-py35 SUCCESS in 3m 49s
tripleo-ci-centos-7-scenario001-standalone FAILURE in 1h 00m 55s
tripleo-ci-centos-7-scenario002-standalone FAILURE in 59m 38s
tripleo-ci-centos-7-scenario003-standalone FAILURE in 1h 02m 54s
tripleo-ci-centos-7-scenario004-standalone FAILURE in 1h 01m 07s
openstack-tox-lower-constraints SUCCESS in 9m 52s
openstack-tox-pep8

Revision history for this message

Quique Llorente (quiquell) wrote on 2019-02-20:

#14

Also here too like in short period of time

http://logs.openstack.org/38/637838/2/gate/tripleo-ci-centos-7-standalone/f26aa60/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz

Revision history for this message

Emilien Macchi (emilienm) wrote on 2019-02-20:

#15

It also seems to happen with image-serve (without Docker Registry v2):

http://logs.openstack.org/61/637861/2/check/tripleo-ci-centos-7-scenario001-standalone/d8575f5/logs/undercloud/var/log/httpd/image_serve_access.log.txt.gz

192.168.24.1 - - [20/Feb/2019:00:36:48 +0000] "HEAD /v2/tripleomaster/centos-binary-heat-api/blobs/sha256:b84e07faf72d1581aff50907b7ae59a6a98fefcfacf69714ab83e9248c999251 HTTP/1.1" 404 - "-" "python-requests/2.21.0"
192.168.24.1 - - [20/Feb/2019:00:36:48 +0000] "HEAD /v2/tripleomaster/centos-binary-heat-api/blobs/sha256:36198514f16bc5c8bac0b6c89be92bab49e3999b536ac06ba26f7c823cb5f5c8 HTTP/1.1" 200 - "-" "python-requests/2.21.0"
192.168.24.1 - - [20/Feb/2019:00:36:48 +0000] "HEAD /v2/tripleomaster/centos-binary-heat-api/blobs/sha256:b84e07faf72d1581aff50907b7ae59a6a98fefcfacf69714ab83e9248c999251 HTTP/1.1" 404 - "-" "python-requests/2.21.0"

The blob "b84e07faf72d1581aff50907b7ae59a6a98fefcfacf69714ab83e9248c999251" isn't found.
What is weird is that it doesn't show up in error logs (but access) nor in /var/log/tripleo-container-image-prepare.log on this case.

Revision history for this message

Cédric Jeanneret (cjeanner) wrote on 2019-02-20:

#16

So apparently the image centos-binary-heat-api could be pulld, even with the 2 missing hashes you pointed in the access_log:
Pulling 192.168.24.1:8787/tripleomaster/centos-binary-heat-api:current-tripleo
Running podman pull --tls-verify=false 192.168.24.1:8787/tripleomaster/centos-binary-heat-api:current-tripleo
13be152e017149dfb12cec9732b2a09752f03c6d1759786bcc3e460bc9c8c416

Not sure if the 404 we get in the access_log are really relevant (especially since the image is fetched without error) - still, it's not nice.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-20: Fix proposed to tripleo-common (master)

#17

Fix proposed to branch: master
Review: https://review.openstack.org/638266

Changed in tripleo:
assignee:	Emilien Macchi (emilienm) → Steve Baker (steve-stevebaker)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-22: Fix merged to tripleo-common (master)

#18

Reviewed: https://review.openstack.org/638266
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=67a3c4d8b31e6cac256012df65c36dcf3542e129
Submitter: Zuul
Branch: master

commit 67a3c4d8b31e6cac256012df65c36dcf3542e129
Author: Steve Baker <email address hidden>
Date: Thu Feb 21 09:24:05 2019 +1300

Improve retry behaviour of _copy_layer_registry_to_registry

There are failures occuring in CI which result in missing layers after
a registry->registry transfer.

A job.exception() check is added to ensure the whole image transfer
fails if a single layer fails.

    _copy_layer_registry_to_registry now retries on IOError. This is
    the parent class of RequestException, and could plausably be raised
    during a layer transfer.

Change-Id: Ibe9c046ed9172dd0bc276c6400496392b2c69dda
Partial-Bug: #1815576

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2019-02-27:

#19

Has this issue been witnessed again since the above change landed?

Revision history for this message

wes hayutin (weshayutin) wrote on 2019-02-28:

#20

we have been seeing a lot of jobs failing in container image prep. The log is in debug mode which is making a very large and to be honest has made it difficult for me to figure out what's going on.
I have not yet put enough time into rechecking this issue and parsing the logs. I am asking for a few more days, we've been busy as usual :(

tags:

added: alert

Revision history for this message

wes hayutin (weshayutin) wrote on 2019-02-28:

#21

adding alert just to make sure the ruck/rovers are aware as well

Revision history for this message

Gabriele Cerami (gcerami) wrote on 2019-03-07:

#22

This is Still happening

https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset030-master/dad79ce/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-03-07_00_17_43

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2019-03-07:

#23

OK, thanks. I'm going to propose a change which does the following:

For registry->registry layer transfers, validate that the calculated digest matches the expected digest and trigger a retry if it doesn't.

Add a final validation check that there is a layer file for every layer entry in the manifest (this would cause that image transfer to fail with a detailed error, no retries.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-07: Fix proposed to tripleo-common (master)

#24

Fix proposed to branch: master
Review: https://review.openstack.org/641856

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-11: Fix merged to tripleo-common (master)

#25

Reviewed: https://review.openstack.org/641856
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=7a5b2903e6d9383c7759df405a8168799ef9d348
Submitter: Zuul
Branch: master

commit 7a5b2903e6d9383c7759df405a8168799ef9d348
Author: Steve Baker <email address hidden>
Date: Fri Mar 8 11:50:06 2019 +1300

Improve handling of layer transfer errors

    To improve the reliability of registry->registry layer transfers the
    following changes have been made:
    - raise an IOError to trigger a retry when the calculated digest does
      not match the expected digest
    - any error when writing the blob file will remove the file

    This change also ensures the tripleo-container-image-prepare writes
    any exception to the log file, it appears that errors are not being
    logged and the ansible task does not print its output.

Change-Id: I98f25932a66bb7a9f04cdc1c906223e457fe44b1
Closes-Bug: #1815576

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2019-03-11:

#26

Please reopen and link to the log files if this is seen again

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-15: Fix included in openstack/tripleo-common 10.6.0

#27

This issue was fixed in the openstack/tripleo-common 10.6.0 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.