tripleo

OVB frequent timeouts on rh1 cloud

Bug #1722864 reported by Sagi (Sergey) Shnaidman on 2017-10-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Unassigned	tripleo queens-rc1

Bug Description

OVB jobs have a lot of timeouts recently (from start of October). Especially it's related to ha job: gate-tripleo-ci-centos-7-ovb-ha-oooq
Other jobs fail less.
From Monday 09 Oct ha jobs fail much more: http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22setup%20script%20run%20by%20this%20job%20failed%20-%20exit%20code:%20143%5C%22%20AND%20build_name:*tripleo-ci-*%20AND%20tags:console%20AND%20voting:1%20AND%20build_status:FAILURE

It could be related to disk io (or/and network problem?). Downloading image take 5 mins instead of 1 and closing qcow2 image 10 mins instead of 1:
http://logs.openstack.org/99/511199/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-oooq/5dc995f/console.html#_2017-10-11_10_30_40_250797

Tags:

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-10-11:

I did some performance comparisons, running the undercloud install and image builds against 3 different clouds: rh1, my local dev cloud, and RDO cloud (which has known io issues). Those two operations are pretty io and network sensitive because of all the package downloads, db syncs, and image manipulation. I intentionally did not use any proxies or mirrors since not all the clouds have equivalent options there, but if anything that should make things faster. Here are the results:

rh1:
real 50m37.337s
user 37m32.825s
sys 4m59.338s

local:
real 55m53.872s
user 32m49.704s
sys 4m13.334s

RDO cloud:
real 100m5.553s (yes, really)
user 24m9.183s
sys 3m13.861s

Now this was only one vm so it only tested one compute node. I guess if we can come up with a list of instances that were part of the timeouts we might be able to figure out if it's a few compute nodes that are a problem, but at the very least there doesn't seem to be a cloud-wide io or networking issue.

I should also note that the close qcow2 image task has been taking 5+ minutes for a lot longer than we've had these timeouts: http://logs.openstack.org/56/506956/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-oooq/69f8656/console.html.gz#_2017-09-24_17_50_56_076016

Oh, and on the image download front, we should not be using the floating ip for ci jobs. That's causing a minimum 3x increase in download time for me (see below). And the cloud is lightly loaded right now, that's only going to get worse as we hit bottlenecks on the external connection from the cloud. Plus it appears to eliminate any advantage we gain from having 10 Gig connections between the servers of our cloud because all traffic ends up bouncing off the bastion, which is likely only 1 Gig.

So I would suggest that our first change be to use the internal address of the mirror server so we aren't routing traffic out of the cloud for no reason. That is likely the cause of the slow image transfer when the cloud is maxed out.

[centos@test ~]$ time curl -sfL -C- -o _overcloud-full.tar http://192.168.103.253/builds/current-tripleo/overcloud-full.tar

real 0m14.238s
user 0m0.619s
sys 0m12.085s
[centos@test ~]$ rm _overcloud-full.tar
[centos@test ~]$ time curl -sfL -C- -o _overcloud-full.tar http://66.187.229.139/builds/current-tripleo/overcloud-full.tar

real 0m44.452s
user 0m2.687s
sys 0m20.921s

I did some performance comparisons, running the undercloud install and image builds against 3 different clouds: rh1, my local dev cloud, and RDO cloud (which has known io issues).  Those two operations are pretty io and network sensitive because of all the package downloads, db syncs, and image manipulation.  I intentionally did not use any proxies or mirrors since not all the clouds have equivalent options there, but if anything that should make things faster.  Here are the results:

rh1:
real    50m37.337s
user    37m32.825s
sys     4m59.338s

local:
real    55m53.872s
user    32m49.704s
sys     4m13.334s

RDO cloud:
real    100m5.553s (yes, really)
user    24m9.183s
sys     3m13.861s

Now this was only one vm so it only tested one compute node.  I guess if we can come up with a list of instances that were part of the timeouts we might be able to figure out if it's a few compute nodes that are a problem, but at the very least there doesn't seem to be a cloud-wide io or networking issue.

Oh, and on the image download front, we should not be using the floating ip for ci jobs.  That's causing a minimum 3x increase in download time for me (see below).  And the cloud is lightly loaded right now, that's only going to get worse as we hit bottlenecks on the external connection from the cloud.  Plus it appears to eliminate any advantage we gain from having 10 Gig connections between the servers of our cloud because all traffic ends up bouncing off the bastion, which is likely only 1 Gig.

So I would suggest that our first change be to use the internal address of the mirror server so we aren't routing traffic out of the cloud for no reason.  That is likely the cause of the slow image transfer when the cloud is maxed out.

[centos@test ~]$ time curl -sfL -C- -o _overcloud-full.tar http://192.168.103.253/builds/current-tripleo/overcloud-full.tar

real    0m14.238s
user    0m0.619s
sys     0m12.085s
[centos@test ~]$ rm _overcloud-full.tar 
[centos@test ~]$ time curl -sfL -C- -o _overcloud-full.tar http://66.187.229.139/builds/current-tripleo/overcloud-full.tar

real    0m44.452s
user    0m2.687s
sys     0m20.921s

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-10-12:

Configure OVB jobs to use local mirrors for images
https://review.openstack.org/#/c/511434

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-10-16:

Also, to follow up on something Derek mentioned a while back, why is overcloud-full.tar 3 GB? On the mirror server:

3.0G Sep 26 14:10 overcloud-full.tar

In my local environment the full qcow2 is only 1.3 GB:

1.3G Oct 11 19:40 overcloud-full.qcow2

Add ~10 MB for the kernel and ramdisk and you don't get anywhere near 3 GB. That much extra data is going to cause additional time converting qcow2 to raw.

Revision history for this message

wes hayutin (weshayutin) wrote on 2017-10-17:

this is the command that's used to build the images:

#!/bin/bash
# script to build overcloud images

set -eux

sudo yum -y install python-tripleoclient

export DIB_YUM_REPO_CONF=""
export DIB_YUM_REPO_CONF="$DIB_YUM_REPO_CONF $(ls /etc/yum.repos.d/delorean*)"
export DIB_YUM_REPO_CONF="$DIB_YUM_REPO_CONF $(ls /etc/yum.repos.d/CentOS-Ceph-*)"
export DIB_YUM_REPO_CONF="$DIB_YUM_REPO_CONF $(ls /etc/yum.repos.d/centos-*)"
export DIB_YUM_REPO_CONF="$DIB_YUM_REPO_CONF $(ls /etc/yum.repos.d/quickstart-*)"
export DIB_YUM_REPO_CONF="$DIB_YUM_REPO_CONF $(ls /home/jenkins/web-gating.repo)"

openstack overcloud image build \
--config-file /usr/share/tripleo-common/image-yaml/overcloud-images-centos7.yaml \
--config-file /usr/share/tripleo-common/image-yaml/overcloud-images.yaml \

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-10-20:

Steve is working on reducing the services deployed by default: https://review.openstack.org/#/c/500942/ as a PoC before disabling them just for OVB.

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-10-26:

Dropping alert. This is still happening, but the relevant people are aware of it and it's not accomplishing much to keep spamming the channel with this bug.

tags:

removed: alert

Alex Schultz (alex-schultz) on 2017-12-05

Changed in tripleo:
milestone:	queens-2 → queens-3

Emilien Macchi (emilienm) on 2018-01-26

Changed in tripleo:
milestone:	queens-3 → queens-rc1

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2018-02-21:

Closing since we have moved off of RH1 to RDO cloud.

Changed in tripleo:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.