periodic container build (train) timing out

Bug #1850188 reported by Rafael Folco
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
David Moreau Simard
Revision history for this message
Marios Andreou (marios-b) wrote :

actual timeout during the build at [1][2] but found one case where it finished build [3] but timeout during the retag [4]. However even looking at the 'good' logs it's veery close to the timeout e.g. first & last lines in [5] like:

        2019-10-27 09:14:33.009687 | Job console starting...
        2019-10-27 11:15:23.793588 | LOOP [upload-logs : Upload console log and json output]

same with [6]

        2019-10-26 21:12:29.716020 | Job console starting...
        2019-10-26 23:08:03.724051 | LOOP [upload-logs : Upload console log and json output]

current timeout is at 2 hours defined via inheritance from [7]

So i just proposed bump to 2.5 hours for now while we debug [8] - seems to be train specific since master is running more reasonably like e.g. first/last lines at [9]

        2019-10-28 00:15:23.245957 | Job console starting...
        2019-10-28 01:39:53.981084 | LOOP [upload-logs : Upload console log and json output]

[1] http://logs.rdoproject.org/openstack-periodic-latest-released/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-7-train-containers-build-push/81a95b5/logs/build.log.txt.gz
[2] http://logs.rdoproject.org/openstack-periodic-latest-released/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-7-train-containers-build-push/f1b38fe/logs/build.log.txt.gz
[3] http://logs.rdoproject.org/openstack-periodic-latest-released/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-7-train-containers-build-push/63c980d/logs/build.log.txt.gz
[4] http://logs.rdoproject.org/openstack-periodic-latest-released/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-7-train-containers-build-push/63c980d/job-output.txt
[5] http://logs.rdoproject.org/openstack-periodic-latest-released/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-7-train-containers-build-push/684dffa/job-output.txt
[6] http://logs.rdoproject.org/openstack-periodic-latest-released/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-7-train-containers-build-push/2302cb6/job-output.txt
[7] https://github.com/openstack/tripleo-ci/blob/3e38ea023e357a73fe055722b83063d19158adf9/zuul.d/base.yaml#L200
[8] https://review.rdoproject.org/r/23503
[9] http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-7-master-containers-build-push/fa81adb/job-output.txt

Revision history for this message
Marios Andreou (marios-b) wrote :

green in my test at https://review.rdoproject.org/r/#/c/23502/ SUCCESS in 2h 21m 44s (non-voting) after the timeout bump merged.. timeout at 2.5 hours and success 2hr 21

Revision history for this message
Marios Andreou (marios-b) wrote :

so the timeout bump is necessary - could be to do with the extra re-tagging we now have to do

The master job is still running ok at 2 hours, so there is definitely something causing extra overhead for train. Not sure if we can get someone to look into that?

i.e. we are green now with the time bump but there is still a difference between time taken to build + tag+ push containers in master and train.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

The problem is that pushing containers in train job takes longer that in master job.
For example:

In master:
2019-11-04 18:27:25 | INFO:kolla.common.utils:Attempt number: 1 to run task: PushTask(base)
2019-11-04 18:27:25 | INFO:kolla.common.utils.base:Trying to push the image
2019-11-04 18:27:44 | INFO:kolla.common.utils.base:Pushed successfully

it's 24 seconds:

In train:
2019-11-05 09:28:07 | INFO:kolla.common.utils:Attempt number: 1 to run task: PushTask(base)
2019-11-05 09:28:07 | INFO:kolla.common.utils.base:Trying to push the image
2019-11-05 09:28:49 | INFO:kolla.common.utils.base:Pushed successfully

it's 42 seconds, doubled than in master.

Each container takes more about 30-50 seconds to push and this makes more than hour difference.
List of container and delta of master and train push time:

aodh-api 31
aodh-base 28
aodh-evaluator 36
aodh-listener 51
aodh-notifier 45
barbican-api 32
barbican-base 35
barbican-keystone-listener 38
barbican-worker 27
base 23
ceilometer-base 27
ceilometer-central 31
ceilometer-compute 36
ceilometer-ipmi 33
ceilometer-notification 29
cinder-api 34
cinder-base 30
cinder-scheduler 35
collectd 30
cron 31
designate-api 33
designate-backend-bind9 32
designate-base 36
designate-central 33
designate-mdns 32
designate-producer 34
designate-sink 34
designate-worker 32
ec2-api 35
etcd 31
fluentd 40
glance-api 37
glance-base 35
gnocchi-api 39
gnocchi-base 37
gnocchi-metricd 37
gnocchi-statsd 38
haproxy 28
heat-all 35
heat-api 30
heat-api-cfn 32
heat-base 62
heat-engine 38
horizon 28
ironic-api 32
ironic-base 31
ironic-conductor 30
ironic-inspector 19
ironic-neutron-agent 35
ironic-pxe 31
iscsid 43
keepalived 14
keystone 25
keystone-base 35
keystone-fernet 40
keystone-ssh 40
manila-api 20
manila-base -4
manila-scheduler 21
manila-share 20
mariadb -8
memcached 17
mistral-api 35
mistral-base 26
mistral-engine 11
mistral-event-engine 33
multipathd 47
neutron-base 20
neutron-dhcp-agent 45
neutron-l3-agent 45
neutron-metadata-agent 38
neutron-metadata-agent-ovn 43
neutron-openvswitch-agent 33
neutron-server 37
neutron-server-ovn 38
neutron-sriov-agent 37
nova-api 24
nova-base 37
nova-conductor 35
nova-libvirt -2
nova-novncproxy 27
nova-scheduler 33
nova-serialproxy 35
novajoin-base 33
novajoin-notifier 33
novajoin-server 34
octavia-api 28
octavia-base 57
octavia-health-manager 29
octavia-housekeeping 43
octavia-worker 42
openstack-base 27
openvswitch-base 30
ovn-base 31
ovn-controller 38
ovn-nb-db-server 35
ovn-northd 40
ovn-sb-db-server 41
panko-api 35
panko-base 29
placement-api 38
placement-base 32
prometheus-base 39
prometheus-haproxy-exporter 56
prometheus-memcached-exporter 53
qdrouterd 30
rabbitmq 25
redis 37
redis-base 43
redis-sentinel 43
rsyslog 35
rsyslog-base 22
sahara-api 29
sahara-base 26
sensu-base 38
sensu-client 40
skydive-agent 37
skydive-analyzer 39
skydive-base 30
swift-account 40
swift-base 34
swift-container 47
swift-object 41
swift-object-expirer 45
swift-proxy-server 44
tempest 28
tripleoclient -6
zaqar-base 38
zaqar-wsgi 37

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Related bug 1847225. Please try with https://review.opendev.org/#/c/687288/ that hopefully alleviates the issue.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Although I'm not sure the aforementioned bug and patch has something to improving results for the "openstack overcloud container image build" ...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/693225

Revision history for this message
Alex Schultz (alex-schultz) wrote :

@Bogdan, that job doesn't use the code patched in that review. The container build promotion jobs are basically 95% kolla (as seen in the logs). The larger issue is that the train job should be using buildah (and it doesn't look like it is)

Revision history for this message
Ronelle Landy (rlandy) wrote :

Possible registry issue:

<dmsimard> rlandy, sshnaidm: it doesn't look like the images are pruned in train
<rlandy> dmsimard: is that a train only problem?
<dmsimard> rlandy: yes
<dmsimard> somehow the namespace was added but it wasn't added to the script that iterates through the list of namespaces to prune images in

Revision history for this message
yatin (yatinkarel) wrote :

https://review.rdoproject.org/r/#/c/22124/ added tripleotrain namespace to pruner config, @dmsimard you mean that patch didn't worked or that patch missed something? or the issue is only with rhel images not centos ones as that patch has missing rhel dlrn endpoint as noted with other bug https://bugs.launchpad.net/tripleo/+bug/1851440/comments/3?

Revision history for this message
Ronelle Landy (rlandy) wrote :

Closing this out - the infra team is on it - and train is promoting.

Changed in tripleo:
assignee: nobody → David Moreau Simard (dmsimard)
status: Triaged → Fix Released
Revision history for this message
Javier Peña (jpena-c) wrote :

The issue with https://review.rdoproject.org/r/#/c/22124/ is that the rdo-infra-playbooks role for the registry cannot be directly applied to the running registry. The playbooks work for OpenShift 3.11, while we have 3.7 running in registry.rdo.

The upgrade has been delayed many times due to the migration to a new cloud, so when doing updates to the playbooks we need to apply some of the changes manually. It looks like, in this case, the manual update missed adding the tripleotrain namespace to the pruning script.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.