OVB frequent timeouts on rh1 cloud

Bug #1722864 reported by Sagi (Sergey) Shnaidman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

OVB jobs have a lot of timeouts recently (from start of October). Especially it's related to ha job: gate-tripleo-ci-centos-7-ovb-ha-oooq
Other jobs fail less.
From Monday 09 Oct ha jobs fail much more: http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22setup%20script%20run%20by%20this%20job%20failed%20-%20exit%20code:%20143%5C%22%20AND%20build_name:*tripleo-ci-*%20AND%20tags:console%20AND%20voting:1%20AND%20build_status:FAILURE

It could be related to disk io (or/and network problem?). Downloading image take 5 mins instead of 1 and closing qcow2 image 10 mins instead of 1:
http://logs.openstack.org/99/511199/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-oooq/5dc995f/console.html#_2017-10-11_10_30_40_250797

Tags: ci
Revision history for this message
Ben Nemec (bnemec) wrote :

I did some performance comparisons, running the undercloud install and image builds against 3 different clouds: rh1, my local dev cloud, and RDO cloud (which has known io issues). Those two operations are pretty io and network sensitive because of all the package downloads, db syncs, and image manipulation. I intentionally did not use any proxies or mirrors since not all the clouds have equivalent options there, but if anything that should make things faster. Here are the results:

rh1:
real 50m37.337s
user 37m32.825s
sys 4m59.338s

local:
real 55m53.872s
user 32m49.704s
sys 4m13.334s

RDO cloud:
real 100m5.553s (yes, really)
user 24m9.183s
sys 3m13.861s

Now this was only one vm so it only tested one compute node. I guess if we can come up with a list of instances that were part of the timeouts we might be able to figure out if it's a few compute nodes that are a problem, but at the very least there doesn't seem to be a cloud-wide io or networking issue.

I should also note that the close qcow2 image task has been taking 5+ minutes for a lot longer than we've had these timeouts: http://logs.openstack.org/56/506956/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-oooq/69f8656/console.html.gz#_2017-09-24_17_50_56_076016

Oh, and on the image download front, we should not be using the floating ip for ci jobs. That's causing a minimum 3x increase in download time for me (see below). And the cloud is lightly loaded right now, that's only going to get worse as we hit bottlenecks on the external connection from the cloud. Plus it appears to eliminate any advantage we gain from having 10 Gig connections between the servers of our cloud because all traffic ends up bouncing off the bastion, which is likely only 1 Gig.

So I would suggest that our first change be to use the internal address of the mirror server so we aren't routing traffic out of the cloud for no reason. That is likely the cause of the slow image transfer when the cloud is maxed out.

[centos@test ~]$ time curl -sfL -C- -o _overcloud-full.tar http://192.168.103.253/builds/current-tripleo/overcloud-full.tar

real 0m14.238s
user 0m0.619s
sys 0m12.085s
[centos@test ~]$ rm _overcloud-full.tar
[centos@test ~]$ time curl -sfL -C- -o _overcloud-full.tar http://66.187.229.139/builds/current-tripleo/overcloud-full.tar

real 0m44.452s
user 0m2.687s
sys 0m20.921s

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Configure OVB jobs to use local mirrors for images
https://review.openstack.org/#/c/511434

Revision history for this message
Ben Nemec (bnemec) wrote :

Also, to follow up on something Derek mentioned a while back, why is overcloud-full.tar 3 GB? On the mirror server:

3.0G Sep 26 14:10 overcloud-full.tar

In my local environment the full qcow2 is only 1.3 GB:

1.3G Oct 11 19:40 overcloud-full.qcow2

Add ~10 MB for the kernel and ramdisk and you don't get anywhere near 3 GB. That much extra data is going to cause additional time converting qcow2 to raw.

Revision history for this message
wes hayutin (weshayutin) wrote :

this is the command that's used to build the images:

#!/bin/bash
# script to build overcloud images

set -eux

sudo yum -y install python-tripleoclient

export DIB_YUM_REPO_CONF=""
export DIB_YUM_REPO_CONF="$DIB_YUM_REPO_CONF $(ls /etc/yum.repos.d/delorean*)"
export DIB_YUM_REPO_CONF="$DIB_YUM_REPO_CONF $(ls /etc/yum.repos.d/CentOS-Ceph-*)"
export DIB_YUM_REPO_CONF="$DIB_YUM_REPO_CONF $(ls /etc/yum.repos.d/centos-*)"
export DIB_YUM_REPO_CONF="$DIB_YUM_REPO_CONF $(ls /etc/yum.repos.d/quickstart-*)"
export DIB_YUM_REPO_CONF="$DIB_YUM_REPO_CONF $(ls /home/jenkins/web-gating.repo)"

openstack overcloud image build \
--config-file /usr/share/tripleo-common/image-yaml/overcloud-images-centos7.yaml \
--config-file /usr/share/tripleo-common/image-yaml/overcloud-images.yaml \

Revision history for this message
Emilien Macchi (emilienm) wrote :

Steve is working on reducing the services deployed by default: https://review.openstack.org/#/c/500942/ as a PoC before disabling them just for OVB.

Revision history for this message
Ben Nemec (bnemec) wrote :

Dropping alert. This is still happening, but the relevant people are aware of it and it's not accomplishing much to keep spamming the channel with this bug.

tags: removed: alert
Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Revision history for this message
Alex Schultz (alex-schultz) wrote :

Closing since we have moved off of RH1 to RDO cloud.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.