scenario001 timeout: ceph-ansible + config-download takes more time than before
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
High
|
John Fulton |
Bug Description
We recently switched Ceph services to use external tasks (config-download interface) to deploy ceph via ceph-ansible instead of the previous mistral workflow.
Previously, the Ceph deployment took 5 minutes:
http://
And now it takes 16 minutes:
http://
Also, you'll notice that ansible logs now miss INSTALLER STATUS so we can't figure out which tasks took more time than before.
It's causing random timeouts of scenario001 (probably scenario004 too?) in the gate. The overcloud takes +18 min with config-download enabled on scenario001 (can be observed on the 2 previous links, in overcloud deploy logs).
We need to look at:
1) Why the ceph deployments takes 10 more minutes?
2) Investigate how much time config download would add to any deployment, and eventually increase timeouts and document it.
One thing with config-download is that we are not using SSH ControlPersist since that fails when we bounce the networking
ControlMaster/
interfaces when we apply NetworkDeployment.
This could cause each task to take ~1-3s longer on average. With several
hundred tasks, this is several minutes, and could explain the difference
between ceph-ansible under config-download vs under the mistral worfklow.
We could change the ceph-ansible execution to use a new ansible.cfg that does
not disable ControlMaster, since by that point we shouldn't be doing anything
to network interfaces.
Other than that, the other thing I spotted was:
mistral workflow job: io/ceph/ daemon: v3.0.1- stable- 3.0-luminous- centos- 7-x86_64 image] ***
2018-04-19 10:31:24,787 p=29695 u=mistral | TASK [ceph-docker-common : pulling docker.
2018-04-19 10:31:24,787 p=29695 u=mistral | Thursday 19 April 2018 10:31:24 +0000 (0:00:00.066) 0:02:46.284 ********
2018-04-19 10:31:25,569 p=29695 u=mistral | ok: [192.168.24.14] =>
(took 1 sec)
vs.
(config-download job) 24.1:8787/ ceph/daemon: v3.0.1- stable- 3.0-luminous- centos- 7-x86_64 image] *** 7-rax-dfw- 0003755461] =>
2018-04-26 18:07:45,641 p=3567 u=mistral | TASK [ceph-docker-common : pulling 192.168.
2018-04-26 18:08:11,411 p=3567 u=mistral | ok: [centos-
(took 26s)
and notice docker.io vs 192.168.24.1. I can't explain that difference other
than differences in hosted cloud networking, although in the task output it
looks like downloading 192.168.24.1 pulled a lot more layers. Maybe there's
some configuration where we are getting only the final image from docker.io,
but we get all individual layers from the undercloud registry.
(added same comment to bug).