Promotion jobs fail on timeout, ControllerServiceChain takes too long

Bug #1676250 reported by Emilien Macchi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Unassigned

Bug Description

Promotions to Pike master fails, OVB jobs timout at step 5:
http://logs.openstack.org/15/359215/76/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/db6d3e6/

More details to come later in this bug report (I just noticed it this week-end).

Revision history for this message
Emilien Macchi (emilienm) wrote :

Set to critical because we haven't promoted TripleO CI for 2 weeks.

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
milestone: none → pike-1
tags: added: alert ci promotion-blocker
Changed in tripleo:
importance: High → Critical
summary: - Promotion to OpenStack trunk fails (timeout at step3)
+ Promotion to OpenStack trunk fails (timeout at step5)
description: updated
Revision history for this message
Emilien Macchi (emilienm) wrote :

Based on my research and log analysis, something made ControllerServiceChain very slow (from 10 min to 50 min) between March 17th and March 25th.

summary: - Promotion to OpenStack trunk fails (timeout at step5)
+ Promotion jobs fail on timeout, ControllerServiceChain takes too long
Changed in tripleo:
assignee: nobody → Emilien Macchi (emilienm)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to python-tripleoclient (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/450481

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Alfredo Moralejo (amoralej) wrote :

I'm not sure if it's related, but i've observed problems when registering nodes waiting for messages in zaqar queues that i suspect may be related to https://review.openstack.org/#/c/442482/ , it's reported in https://bugs.launchpad.net/tripleo/+bug/1675384

Revision history for this message
Emilien Macchi (emilienm) wrote :

Moved bug to High (not Critical anymore, since we found the reason why we wouldn't get a promotion).
Moved the bug to tripleoclient, since we know it's related to the zaqar events that might consume too much CPU.
Unassigned the bug from myself, since I'm not going to work on it anymore.

Changed in tripleo:
assignee: Emilien Macchi (emilienm) → nobody
importance: Critical → High
tags: added: tripleoclient
removed: alert ci promotion-blocker
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to python-tripleoclient (master)

Reviewed: https://review.openstack.org/450481
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=f7c032fb583edfe127572644fd76f10a7189a079
Submitter: Jenkins
Branch: master

commit f7c032fb583edfe127572644fd76f10a7189a079
Author: Emilien Macchi <email address hidden>
Date: Mon Mar 27 22:48:49 2017 +0000

    Revert "Use a Zaqar queue to get stack events"

    It seems like we have a ton of events in zaqar
    logs that might slow down the overcloud deployment
    runtime.

    This reverts commit f4f0e92d75e635f08bdcbe14327a2956ad647f22.

    Change-Id: I1f262666b565ec64084603d341169f51c6100dca
    Related-Bug: #1676250

Changed in tripleo:
milestone: pike-1 → pike-2
Changed in tripleo:
milestone: pike-2 → pike-3
Revision history for this message
Emilien Macchi (emilienm) wrote :

There are no currently open reviews on this bug, changing the status back to the previous state and unassigning. If there are active reviews related to this bug, please include links in comments.

Changed in tripleo:
status: In Progress → Triaged
Changed in tripleo:
milestone: pike-3 → pike-rc1
Revision history for this message
Ben Nemec (bnemec) wrote :

It looks like this was fixed by the revert.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.