OVB frequent timeouts on rh1 cloud
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
Unassigned |
Bug Description
OVB jobs have a lot of timeouts recently (from start of October). Especially it's related to ha job: gate-tripleo-
Other jobs fail less.
From Monday 09 Oct ha jobs fail much more: http://
It could be related to disk io (or/and network problem?). Downloading image take 5 mins instead of 1 and closing qcow2 image 10 mins instead of 1:
http://
Changed in tripleo: | |
milestone: | queens-2 → queens-3 |
Changed in tripleo: | |
milestone: | queens-3 → queens-rc1 |
I did some performance comparisons, running the undercloud install and image builds against 3 different clouds: rh1, my local dev cloud, and RDO cloud (which has known io issues). Those two operations are pretty io and network sensitive because of all the package downloads, db syncs, and image manipulation. I intentionally did not use any proxies or mirrors since not all the clouds have equivalent options there, but if anything that should make things faster. Here are the results:
rh1:
real 50m37.337s
user 37m32.825s
sys 4m59.338s
local:
real 55m53.872s
user 32m49.704s
sys 4m13.334s
RDO cloud:
real 100m5.553s (yes, really)
user 24m9.183s
sys 3m13.861s
Now this was only one vm so it only tested one compute node. I guess if we can come up with a list of instances that were part of the timeouts we might be able to figure out if it's a few compute nodes that are a problem, but at the very least there doesn't seem to be a cloud-wide io or networking issue.
I should also note that the close qcow2 image task has been taking 5+ minutes for a lot longer than we've had these timeouts: http:// logs.openstack. org/56/ 506956/ 1/check- tripleo/ gate-tripleo- ci-centos- 7-ovb-ha- oooq/69f8656/ console. html.gz# _2017-09- 24_17_50_ 56_076016
Oh, and on the image download front, we should not be using the floating ip for ci jobs. That's causing a minimum 3x increase in download time for me (see below). And the cloud is lightly loaded right now, that's only going to get worse as we hit bottlenecks on the external connection from the cloud. Plus it appears to eliminate any advantage we gain from having 10 Gig connections between the servers of our cloud because all traffic ends up bouncing off the bastion, which is likely only 1 Gig.
So I would suggest that our first change be to use the internal address of the mirror server so we aren't routing traffic out of the cloud for no reason. That is likely the cause of the slow image transfer when the cloud is maxed out.
[centos@test ~]$ time curl -sfL -C- -o _overcloud-full.tar http:// 192.168. 103.253/ builds/ current- tripleo/ overcloud- full.tar
real 0m14.238s 66.187. 229.139/ builds/ current- tripleo/ overcloud- full.tar
user 0m0.619s
sys 0m12.085s
[centos@test ~]$ rm _overcloud-full.tar
[centos@test ~]$ time curl -sfL -C- -o _overcloud-full.tar http://
real 0m44.452s
user 0m2.687s
sys 0m20.921s