periodic: container build jobs are failing when pushing to rdo registry (500, 504, read timeout)

Bug #1771634 reported by Matt Young
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Quique Llorente

Bug Description

master container build job is failing

c75527260bf212b21ce57998f1f0c668e4c53059_d032039d

# failing job (most recent as of 5/16)
https://review.rdoproject.org/jenkins/job/periodic-tripleo-centos-7-master-containers-build/907/consoleFull

12:55:08 TASK [rdo-kolla-build : Build and push images]
https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/6e68284/console.txt.gz#_2018-05-16_12_55_08_965

# Individual container build logs:
https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/6e68284/kolla/logs/

---

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/6e68284/kolla/kolla-build.conf

https://trunk.rdoproject.org/centos7/c7/55/c75527260bf212b21ce57998f1f0c668e4c53059_d032039d/delorean.repo
https://trunk.rdoproject.org/centos7-master/delorean-deps.repo
tag = tripleo-ci-testing
template_override = /usr/share/openstack-tripleo-common-containers/container-images/tripleo_kolla_template_overrides.j2
template_override = /tmp/kolla/template-overrides.j2

---

ERROR:kolla.common.utils.neutron-l3-agent:received unexpected HTTP status: 504 Gateway Time-out
ERROR:kolla.common.utils.octavia-housekeeping:received unexpected HTTP status: 500 Internal Server Error

---

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/6e68284/kolla/logs/neutron-l3-agent.log

INFO:kolla.common.utils.neutron-l3-agent:Successfully built 21412a799e99
INFO:kolla.common.utils.neutron-l3-agent:Built
INFO:kolla.common.utils.neutron-l3-agent:Trying to push the image
ERROR:kolla.common.utils.neutron-l3-agent:received unexpected HTTP status: 504 Gateway Time-out
INFO:kolla.common.utils.neutron-l3-agent:Trying to push the image
INFO:kolla.common.utils.neutron-l3-agent:Trying to push the image
INFO:kolla.common.utils.neutron-l3-agent:Trying to push the image

---

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/6e68284/kolla/logs/octavia-housekeeping.log

INFO:kolla.common.utils.octavia-housekeeping:Successfully built d3aab0bf7d4b
INFO:kolla.common.utils.octavia-housekeeping:Built
INFO:kolla.common.utils.octavia-housekeeping:Trying to push the image
ERROR:kolla.common.utils.octavia-housekeeping:received unexpected HTTP status: 500 Internal Server Error
INFO:kolla.common.utils.octavia-housekeeping:Trying to push the image
INFO:kolla.common.utils.octavia-housekeeping:Trying to push the image
INFO:kolla.common.utils.octavia-housekeeping:Trying to push the image

Matt Young (halcyondude)
Changed in tripleo:
assignee: nobody → Matt Young (halcyondude)
Matt Young (halcyondude)
description: updated
description: updated
tags: added: alert
Revision history for this message
Matt Young (halcyondude) wrote :

https://review.rdoproject.org/jenkins/job/periodic-tripleo-centos-7-master-containers-build/908 is spinning now, watching that.

Also looking back thru previous builds from last night/this morning to determine frequency distribution of errors.

Note: curious that we're getting both 504 and 500...

(working) Hypothesis: we're running 16 threads in parallel, we could be swamping but this seems unlikely. more data will help here...

Revision history for this message
Matt Young (halcyondude) wrote :

906 was openstack-base, 504

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/92453a6/kolla/logs/openstack-base.log

INFO:kolla.common.utils.openstack-base:Built
INFO:kolla.common.utils.openstack-base:Trying to push the image
ERROR:kolla.common.utils.openstack-base:received unexpected HTTP status: 504 Gateway Time-out
INFO:kolla.common.utils.openstack-base:Trying to push the image
INFO:kolla.common.utils.openstack-base:Trying to push the image
INFO:kolla.common.utils.openstack-base:Trying to push the image

Revision history for this message
Matt Young (halcyondude) wrote :
Download full text (4.7 KiB)

905, glance-api, haproxy

---

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/551e606/kolla/logs/haproxy.log

INFO:kolla.common.utils.haproxy:Trying to push the image
ERROR:kolla.common.utils.haproxy:Unknown error when pushing
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 323, in run
    self.push_image(image)
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 349, in push_image
    for response in self.dc.push(image.canonical_name, **kwargs):
  File "/usr/lib/python2.7/site-packages/docker/api/client.py", line 302, in _stream_helper
    data = reader.read(1)
  File "/usr/lib/python2.7/site-packages/requests/packages/urllib3/response.py", line 401, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python2.7/site-packages/requests/packages/urllib3/response.py", line 307, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.
INFO:kolla.common.utils.haproxy:Trying to push the image
ERROR:kolla.common.utils.haproxy:Unknown error when pushing
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 323, in run
    self.push_image(image)
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 349, in push_image
    for response in self.dc.push(image.canonical_name, **kwargs):
  File "/usr/lib/python2.7/site-packages/docker/api/client.py", line 302, in _stream_helper
    data = reader.read(1)
  File "/usr/lib/python2.7/site-packages/requests/packages/urllib3/response.py", line 401, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python2.7/site-packages/requests/packages/urllib3/response.py", line 307, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.
INFO:kolla.common.utils.haproxy:Trying to push the image
INFO:kolla.common.utils.haproxy:Trying to push the image

---

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/551e606/kolla/logs/glance-api.log

INFO:kolla.common.utils.glance-api:Built
INFO:kolla.common.utils.glance-api:Trying to push the image
ERROR:kolla.common.utils.glance-api:Unknown error when pushing
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 323, in run
    self.push_image(image)
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 349, in push_image
    for response in self.dc.push(image.canonical_name, **kwargs):
  File "/usr/lib/python2.7/site-packages/docker/api/client.py", line 302, in _stream_helper
    data = reader.read(1)
  File "/usr/lib/python2.7/site-packages/requests/packages/urllib3/resp...

Read more...

Revision history for this message
Matt Young (halcyondude) wrote :

904, collectd, nova-compute

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/2be2998/kolla/logs/collectd.log

INFO:kolla.common.utils.collectd:Successfully built ffa7d025da5c
INFO:kolla.common.utils.collectd:Built
INFO:kolla.common.utils.collectd:Trying to push the image
ERROR:kolla.common.utils.collectd:received unexpected HTTP status: 500 Internal Server Error
INFO:kolla.common.utils.collectd:Trying to push the image
INFO:kolla.common.utils.collectd:Trying to push the image

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/2be2998/kolla/logs/nova-compute.log

INFO:kolla.common.utils.nova-compute:Successfully built e9e0a2bf611e
INFO:kolla.common.utils.nova-compute:Built
INFO:kolla.common.utils.nova-compute:Trying to push the image
ERROR:kolla.common.utils.nova-compute:received unexpected HTTP status: 504 Gateway Time-out
INFO:kolla.common.utils.nova-compute:Trying to push the image
INFO:kolla.common.utils.nova-compute:Trying to push the image
INFO:kolla.common.utils.nova-compute:Trying to push the image

Revision history for this message
Matt Young (halcyondude) wrote :

903, gnocci-metricd

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/a071d0e/kolla/logs/gnocchi-metricd.log

INFO:kolla.common.utils.gnocchi-metricd:Successfully built b6fc122b1d2e
INFO:kolla.common.utils.gnocchi-metricd:Built
INFO:kolla.common.utils.gnocchi-metricd:Trying to push the image
ERROR:kolla.common.utils.gnocchi-metricd:Unknown error when pushing
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 323, in run
    self.push_image(image)
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 349, in push_image
    for response in self.dc.push(image.canonical_name, **kwargs):
  File "/usr/lib/python2.7/site-packages/docker/api/client.py", line 302, in _stream_helper
    data = reader.read(1)
  File "/usr/lib/python2.7/site-packages/requests/packages/urllib3/response.py", line 401, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python2.7/site-packages/requests/packages/urllib3/response.py", line 307, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.

Revision history for this message
Matt Young (halcyondude) wrote :

902, gnocchi-api

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/af4fcba/kolla/logs/gnocchi-api.log

INFO:kolla.common.utils.gnocchi-api:Successfully built 954afd1d9adc
INFO:kolla.common.utils.gnocchi-api:Built
INFO:kolla.common.utils.gnocchi-api:Trying to push the image
ERROR:kolla.common.utils.gnocchi-api:received unexpected HTTP status: 504 Gateway Time-out

Revision history for this message
Matt Young (halcyondude) wrote :
summary: - periodic: master container build is failing
+ periodic: container build jobs are failing when pushing to rdo registry
+ (500, 504, read timeout)
Revision history for this message
Matt Young (halcyondude) wrote :

dmsimard is running container layer cleanups in an attempt to help on the registry's slow i/o side of things.

As a potential mitigation we could attempt to patch kolla where it's doing the docker push to add retry logic...

https://github.com/openstack/kolla/blob/25f2e6b754366471aaff9faeeac0d7570bb006fc/kolla/image/build.py#L349

Revision history for this message
David Moreau Simard (dmsimard) wrote :

For context, my explanation about performance: http://eavesdrop.openstack.org/irclogs/%23tripleo/%23tripleo.2018-05-16.log.html#t2018-05-16T18:44:14

Old namespaces (master, queens, pike) have been purged now.
There's a prune running for tripleomaster, queens and pike to get rid of old tags.

Just to give you an idea, for tripleomaster:
===
2018-05-16 21:21:01,862 WARNING openshift_tag_pruner: ********** CONFIRMED DELETION OF TAGS **********
2018-05-16 21:21:06,863 INFO openshift_tag_pruner: Whitelisted tags: 00b405b072dd8bdf2e4c80bf61abdd6996395e62_eadc82cc, 1ba7734082acaef6e95d489e4c32cea52aa92c4c_de76e108, 24fd4bd776d47ab956490ff555c7471cb01c0b99_3b49aa87, 385a69f2686230d1c455497fee596ec4d4145d5d_2ef89dac, a7fbaca2a97adda856c3ba5d5166fb1665f02bc0_85b157a9, c06e315e87d6326f984bbbb038f7cdec9d1e187f_cd134cc0, c75527260bf212b21ce57998f1f0c668e4c53059_d032039d, current-tripleo, current-tripleo-rdo, current-tripleo-rdo-internal, d52ad67500aacdb4c2a1321363bfe87de4e6b518_88c9954e, f1aac46ba7aa26b4f2baf5da210dadda94aa39fe_c928cd3f, f3f166935b484683ba352f81dd7488f0e5ab53de_ddffc593, tripleo-ci-testing
2018-05-16 21:21:06,864 INFO openshift_tag_pruner: Deleting tags from tripleomaster older than 7 days
2018-05-16 21:21:17,784 INFO openshift_tag_pruner: 2221 tags found.
2018-05-16 21:21:17,850 INFO openshift_tag_pruner: 997 tags protected by whitelist.
2018-05-16 21:21:17,851 INFO openshift_tag_pruner: 816 tags will be deleted.
===

Can you let me know if the performance looks better tomorrow ?

Revision history for this message
Matt Young (halcyondude) wrote :

Checking on jobs from overnight...updates incoming

wes hayutin (weshayutin)
tags: removed: alert
Revision history for this message
Matt Young (halcyondude) wrote :
Download full text (6.2 KiB)

Latest container build job failed, more 504's:

https://review.rdoproject.org/jenkins/job/periodic-tripleo-centos-7-master-containers-build/910

failed tagging trunk.registry.rdoproject.org/tripleomaster/centos-binary-mistral-api

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/920d5f2/console.txt.gz#_2018-05-17_13_43_52_377

{
    "changed": true,
    "cmd": "docker push trunk.registry.rdoproject.org/tripleomaster/centos-binary-mistral-api:c75527260bf212b21ce57998f1f0c668e4c53059_d032039d",
    "delta": "0:01:23.355876",
    "end": "2018-05-17 13:43:52.339673",
    "msg": "non-zero return code",
    "rc": 1,
    "start": "2018-05-17 13:42:28.983797",
    "stderr": "received unexpected HTTP status: 504 Gateway Time-out",
    "stderr_lines": [
        "received unexpected HTTP status: 504 Gateway Time-out"
    ],
    "stdout": "The push refers to a repository [trunk.registry.rdoproject.org/tripleomaster/centos-binary-mistral-api]
b4e33d29eee3: Preparing
3136df935e2b: Preparing
575e16e31e14: Preparing
65d20cb60231: Preparing
07c6d6ef2aee: Preparing
71d7ca70c075: Preparing
83f3503ed8a3: Preparing
155c52c6da42: Preparing
d436821105e1: Preparing
fa5ea5044bf4: Preparing
d00d873b695b: Preparing
edf39ff216ae: Preparing
fa66111e90b8: Preparing
1554127d68f4: Preparing
3363262f47c7: Preparing
9bf6289230da: Preparing
772be8b5ac9e: Preparing
603ca0c55cd8: Preparing
b49a7293399e: Preparing
4a93e98c7eaf: Preparing
5124e50365f5: Preparing
2977db5329cc: Preparing
8b11de0b5c21: Preparing
a330c99d4cdb: Preparing
044743a8c0d9: Preparing
f98e6ceed009: Preparing
8ae9e17c5c70: Preparing
608a75a4dec4: Preparing
7adf9f53a5c8: Preparing
fdab0c0efbbe: Preparing
43e653f84b79: Preparing
772be8b5ac9e: Waiting
603ca0c55cd8: Waiting
b49a7293399e: Waiting
4a93e98c7eaf: Waiting
5124e50365f5: Waiting
a330c99d4cdb: Waiting
044743a8c0d9: Waiting
2977db5329cc: Waiting
f98e6ceed009: Waiting
8ae9e17c5c70: Waiting
8b11de0b5c21: Waiting
608a75a4dec4: Waiting
7adf9f53a5c8: Waiting
fdab0c0efbbe: Waiting
43e653f84b79: Waiting
3136df935e2b: Layer already exists
83f3503ed8a3: Layer already exists
b4e33d29eee3: Layer already exists
65d20cb60231: Layer already exists
3363262f47c7: Layer already exists
155c52c6da42: Layer already exists
d436821105e1: Layer already exists
1554127d68f4: Layer already exists
fa5ea5044bf4: Layer already exists
07c6d6ef2aee: Layer already exists
71d7ca70c075: Layer already exists
575e16e31e14: Layer already exists
fa66111e90b8: Layer already exists
edf39ff216ae: Layer already exists
d00d873b695b: Layer already exists
9bf6289230da: Layer already exists
772be8b5ac9e: Layer already exists
8b11de0b5c21: Layer already exists
a330c99d4cdb: Layer already exists
5124e50365f5: Layer already exists
2977db5329cc: Layer already exists
603ca0c55cd8: Layer already exists
7adf9f53a5c8: Layer already exists
044743a8c0d9: Layer already exists
608a75a4dec4: Layer already exists
f98e6ceed009: Layer already exists
8ae9e17c5c70: Layer already exists
b49a7293399e: Layer already exists
fdab0c0efbbe: Layer already exists
4a93e98c7eaf: Layer already exists
43e653f84b79: Layer already exists",
    "stdout_lines": [
        "The push refers ...

Read more...

Revision history for this message
Matt Young (halcyondude) wrote :
Revision history for this message
Matt Young (halcyondude) wrote :

Per CIX meeting this issue isn't occuring any more, but we should understand why (specifically) it was happening so we can prevent in the future.

Changed in tripleo:
status: Triaged → Confirmed
assignee: Matt Young (halcyondude) → nobody
tags: removed: promotion-blocker quickstart
Changed in tripleo:
importance: Critical → High
Revision history for this message
Ronelle Landy (rlandy) wrote :

Marking this back to critical - failed promotion job on 05/31

Changed in tripleo:
importance: High → Critical
Revision history for this message
Matt Young (halcyondude) wrote :

latest failure...

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-centos-7-master-containers-build/2c2f5c7/kolla/logs/ceilometer-base.log

ERROR:kolla.common.utils.ceilometer-base:Unknown error when pushing
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 326, in run
    self.push_image(image)
  File "/usr/lib/python2.7/site-packages/kolla/image/build.py", line 352, in push_image
    for response in self.dc.push(image.canonical_name, **kwargs):
  File "/usr/lib/python2.7/site-packages/docker/api/client.py", line 296, in _stream_helper
    for chunk in json_stream(self._stream_helper(response, False)):
  File "/usr/lib/python2.7/site-packages/docker/utils/json_stream.py", line 66, in split_buffer
    for data in stream_as_text(stream):
  File "/usr/lib/python2.7/site-packages/docker/utils/json_stream.py", line 22, in stream_as_text
    for data in stream:
  File "/usr/lib/python2.7/site-packages/docker/api/client.py", line 302, in _stream_helper
    data = reader.read(1)
  File "/usr/lib/python2.7/site-packages/requests/packages/urllib3/response.py", line 401, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python2.7/site-packages/requests/packages/urllib3/response.py", line 307, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.

tags: added: promotion-blocker
Changed in tripleo:
status: Confirmed → Triaged
Changed in tripleo:
milestone: rocky-2 → rocky-3
Ronelle Landy (rlandy)
Changed in tripleo:
assignee: nobody → Quique Llorente (quiquell)
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.