Comment 1 for bug 1722864

Revision history for this message
Ben Nemec (bnemec) wrote :

I did some performance comparisons, running the undercloud install and image builds against 3 different clouds: rh1, my local dev cloud, and RDO cloud (which has known io issues). Those two operations are pretty io and network sensitive because of all the package downloads, db syncs, and image manipulation. I intentionally did not use any proxies or mirrors since not all the clouds have equivalent options there, but if anything that should make things faster. Here are the results:

rh1:
real 50m37.337s
user 37m32.825s
sys 4m59.338s

local:
real 55m53.872s
user 32m49.704s
sys 4m13.334s

RDO cloud:
real 100m5.553s (yes, really)
user 24m9.183s
sys 3m13.861s

Now this was only one vm so it only tested one compute node. I guess if we can come up with a list of instances that were part of the timeouts we might be able to figure out if it's a few compute nodes that are a problem, but at the very least there doesn't seem to be a cloud-wide io or networking issue.

I should also note that the close qcow2 image task has been taking 5+ minutes for a lot longer than we've had these timeouts: http://logs.openstack.org/56/506956/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-oooq/69f8656/console.html.gz#_2017-09-24_17_50_56_076016

Oh, and on the image download front, we should not be using the floating ip for ci jobs. That's causing a minimum 3x increase in download time for me (see below). And the cloud is lightly loaded right now, that's only going to get worse as we hit bottlenecks on the external connection from the cloud. Plus it appears to eliminate any advantage we gain from having 10 Gig connections between the servers of our cloud because all traffic ends up bouncing off the bastion, which is likely only 1 Gig.

So I would suggest that our first change be to use the internal address of the mirror server so we aren't routing traffic out of the cloud for no reason. That is likely the cause of the slow image transfer when the cloud is maxed out.

[centos@test ~]$ time curl -sfL -C- -o _overcloud-full.tar http://192.168.103.253/builds/current-tripleo/overcloud-full.tar

real 0m14.238s
user 0m0.619s
sys 0m12.085s
[centos@test ~]$ rm _overcloud-full.tar
[centos@test ~]$ time curl -sfL -C- -o _overcloud-full.tar http://66.187.229.139/builds/current-tripleo/overcloud-full.tar

real 0m44.452s
user 0m2.687s
sys 0m20.921s