CI: undercloud takes long time which causes job fail with timeout

Bug #1799895 reported by Sagi (Sergey) Shnaidman
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Invalid
Critical
Unassigned
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

A known issue https://bugs.launchpad.net/bugs/1797525, moved critical prio and alert there

tags: removed: alert
Revision history for this message
Alex Schultz (alex-schultz) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

4:58:58 PM GMT+2 - sshnaidm|ruck: bogdando, I think it's different reason, as I see in logs the yum update takes sometimes up to 10 minutes for container
4:59:20 PM GMT+2 - sshnaidm|ruck: bogdando, let's separate bugs, but leave comment in last one that it could be related

tags: added: alert
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

it is possible a dup of bug 1797525

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

As I can see from the logs:
the most expensive operations are "yum update"s in every container, they could take from 1 to 10(!) minutes for each container. Downloading updates is not the issue, it takes no longer than 2 seconds, we use proxy properly.
So what remains is actual installation and removing yum cache. I'd like to add timestamps to "yum update" script to be sure which operations are expensive in the script. But it looks like sometimes it take a long time to install packages. Maybe it could be a disk issue(?)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Se my previous attempt for adding **individual** caches to containers update https://review.openstack.org/#/c/575742/4/scripts/container-update.py@146

but I'm not sure it'd be good idea to use a common cache with concurrent workers doing updates there. Doing updates of containers serialized would allow having a shared for all cache... But updating 50+ containers one by one, even with shared cache, might take even more time in the end.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Another idea, is to rebuild all kolla containers just in the job, from the tags what we have after original images prepared. Then we could only install yum updates for the base layer perhaps?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/613575

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/613577

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Shared yum cache for individual containers' update-retry, with improved failback for retries should help as well (https://review.openstack.org/575742)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart (master)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: master
Review: https://review.openstack.org/613577
Reason: https://review.openstack.org/613640

Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
assignee: Rafael Folco (rafaelfolco) → nobody
tags: removed: alert
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.openstack.org/613575
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=c073cdb80d8351606accc722c7d18eac1d3dabca
Submitter: Zuul
Branch: master

commit c073cdb80d8351606accc722c7d18eac1d3dabca
Author: Emilien Macchi <email address hidden>
Date: Fri Oct 26 09:02:09 2018 -0400

    undercloud/hieradata: configure nova-scheduler workers too

    The new parameter in puppet-nova will allow us to reduce the number of
    workers to 1 in our CI and eventually reduce the timeouts.

    Change-Id: I9c2ea60960f1652f62f7f05879ebebddf3f8664e
    Related-Bug: #1799895

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/575742

Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.