centos-7-master-containers-build-push time out pushing containers

Bug #1857884 reported by Marios Andreou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

The centos-7-master-containers-build-push has been timing out since ~ 23rd December as can be seen at [1]. This is legitimate timeout in the sense that the job is still pushing containers when it fails - e.g. see [2] the last lines in the log are

        2019-12-30 02:13:13 | INFO:kolla.common.utils.sahara-engine:Pushed successfully
        2019-12-30 02:13:13 | INFO:kolla.common.utils:Attempt number: 1 to run task: PushTask(glance-api)
        2019-12-30 02:13:13 | INFO:kolla.common.utils.glance-api:Trying to push the image

This is a master promotion blocker - bumping the timeout to 3 hours in [3]

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-centos-7-master-containers-build-push
[2] http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-centos-7-master-containers-build-push/d487ae4/logs/build.log.txt.gz
[3] https://review.opendev.org/700764

description: updated
Revision history for this message
Marios Andreou (marios-b) wrote :

ykarel cleaned up tags and looks like that helped so we don't need the timeout bump

13:10 < ykarel> marios, my test resulted good https://review.rdoproject.org/r/#/c/24321/
13:11 < ykarel> so next master run in 1 hour should not hit timeout in container build
13:11 < marios> ykarel: ack thanks but what did you change?
13:12 < marios> ykarel: i don't see depends-on at that test
13:12 < ykarel> marios, as said earlier issue is in infra side
13:12 < ykarel> that happened due to ppc jobs tags and component promotion pipeline in master
13:12 < marios> ykarel: but its pretty consistent. it has been timing out for almost a week now
13:13 < marios> https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-centos-7-master-containers-build-push
13:13 < ykarel> marios, so there were too many tags, i cleaned up tags older than 4 days for master
13:13 < ykarel> marios, if u notice timings ^^, u will find timings are increasing
13:13 < ykarel> from day by day
13:13 < marios> ykarel: ah ok, was there a change we can point to (for tags cleanup) or this is manual thing
13:14 < ykarel> marios, it's happens automatically daily, but due to too much tags push due to ppc and component promotion, this
                cleanup went insufficient
13:15 < ykarel> marios, /me will post the findings on bug itself in some time
13:15 < ykarel> after the next periodic run

Revision history for this message
yatin (yatinkarel) wrote :

Some background can be found in https://bugs.launchpad.net/tripleo/+bug/1850188.

So i discovered timing for master was too much compared to other releases, and also it was growing since last couple of days:- https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-centos-7-master-containers-build-push, since this is master only and considering extra jobs for ppc and component promotion jobs which pushes images i doubted it's related to extra images in registry, which happend some days back with lp 1850188.

Pruner script deletes tags from all releases which are older than 7 days and not pointed to https://github.com/rdo-infra/rdo-infra-playbooks/blob/master/roles/rdo-infra/registry-image-pruning/defaults/main.yml#L2-L6

Since master has multiple container push jobs running ppc, regular promotion and component promotion too frequently there are too much tags in master, for example i found below from pruner logs:-

Deleting tags from tripleomaster older than 7 days
14007 tags found.
1354 tags protected by whitelist.
1894 tags will be deleted.

As compared to train:-
Deleting tags from tripleotrain older than 7 days
7956 tags found.
827 tags protected by whitelist.
1056 tags will be deleted.
Finished.

For now i have deleted tags older than 4 days in master to unblock master promotion jobs, tested job after cleanup https://review.rdoproject.org/r/#/c/24321/, it finished in 1.5 hours. Next periodic run should have container build job pass in CentOS7.

The issues/fixes that needs to be done:-
- component job have different dlrnapi server(trunk-staging) so whitelist is not working for it.
- Need to see if we really need 7 days older container tags(non whitelisted) in rdo registry, and if possible to avoid unnecessary push of two many container images(which are not used by other jobs) until rdo registry get's more stable to handle too many images, i heard there are plans to upgrade it and to shift it to some other infra.

Revision history for this message
wes hayutin (weshayutin) wrote :
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.