Container images build in RDOcloud are randomly failing to be pushed to localhost

Bug #1729328 reported by Gabriele Cerami on 2017-11-01
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Gabriele Cerami

Bug Description

We are seeing errors like this

INFO:kolla.image.build.swift-account:Trying to push the image
ERROR:kolla.image.build.swift-account:Unknown error when pushing
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 322, in run
    self.push_image(image)
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 342, in push_image
    insecure_registry=True):
  File "/usr/lib/python2.7/site-packages/docker/api/client.py", line 302, in _stream_helper
    data = reader.read(1)
  File "/usr/lib/python2.7/site-packages/requests/packages/urllib3/response.py", line 401, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python2.7/site-packages/requests/packages/urllib3/response.py", line 307, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.

on a different image each time, during container builds job
This one was on swift-account, but it happens to others too

eg:https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/5b90e48/kolla/logs/swift-account.log

From a first evaluation, it could be performance problem on storage in RDOcloud, but it may be a problem with docker registry itself. I've seen no error for registry in the logs.

Tags: ci Edit Tag help
Changed in kolla:
status: New → Confirmed
affects: kolla → tripleo
Changed in tripleo:
assignee: nobody → Gabriele Cerami (gcerami)
importance: Undecided → Critical
tags: added: alert
Attila Darazs (adarazs) wrote :

This docker issue might be related: https://github.com/docker/compose/issues/3927

Changed in tripleo:
status: Confirmed → Triaged
milestone: none → queens-2
Alan Pevec (apevec) wrote :

Local docker is pushing to the remote RDO registry where we see IO bottleneck.

Alfredo Moralejo (amoralej) wrote :

After removing old images from the registry, we haven't hit this issue. The problem was caused by bad performance in the registry which is highly impacted by the number of images/layers stored. We are implementing a policy to remove old unused containers from the registry periodically.

tags: removed: alert promotion-blocker
Alfredo Moralejo (amoralej) wrote :

Following jobs failed with the same issue right today (UTC time):

Nov 6 20:10 periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset006-pike/builds/lastFailedBuild/log
Nov 6 20:10 periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset007-master/builds/lastFailedBuild/log
Nov 6 20:10 periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset007-pike/builds/lastFailedBuild/log
Nov 6 20:10 periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset008-pike/builds/lastFailedBuild/log
Nov 7 09:12 DLRN-rpmbuild-rpm-packaging/builds/lastFailedBuild/log
Nov 7 10:37 gate-tripleo-ci-centos-7-containers-multinode-upgrades-master/builds/lastFailedBuild/log
Nov 7 10:37 periodic-tripleo-centos-7-master-containers-build/builds/lastFailedBuild/log
Nov 7 10:37 periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset005-master/builds/lastFailedBuild/log
Nov 7 10:37 periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset005-pike/builds/lastFailedBuild/log
Nov 7 10:37 periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset006-master/builds/lastFailedBuild/log
Nov 7 10:37 periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp_1ceph-featureset024-master/builds/lastFailedBuild/log
Nov 7 10:37 periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp_1ceph-featureset024-pike/builds/lastFailedBuild/log
Nov 7 10:37 periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/builds/lastFailedBuild/log
Nov 7 10:37 periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-pike/builds/lastFailedBuild/log
Nov 7 10:37 periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset002-master-upload/builds/lastFailedBuild/log
Nov 7 10:37 periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset002-pike-upload/builds/lastFailedBuild/log
Nov 7 10:37 rdoinfo-tripleo-pike-testing-centos-7-multinode-1ctlr-featureset005-nv/builds/lastFailedBuild/log

Alfredo Moralejo (amoralej) wrote :

We can check failures in the last day using kibana, check url http://bit.ly/2hhdx95

Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Alex Schultz (alex-schultz) wrote :

Closing for now. If this is still an issue, please feel free to reopen the bug.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.