scenario001 timeout: ceph-ansible + config-download takes more time than before

Bug #1767252 reported by Emilien Macchi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
John Fulton

Bug Description

We recently switched Ceph services to use external tasks (config-download interface) to deploy ceph via ceph-ansible instead of the previous mistral workflow.

Previously, the Ceph deployment took 5 minutes:
http://logs.openstack.org/32/555732/6/check/tripleo-ci-centos-7-scenario001-multinode-oooq-container/5569578/logs/undercloud/var/log/mistral/ceph-install-workflow.log.txt.gz

And now it takes 16 minutes:
http://logs.openstack.org/31/537931/5/check/tripleo-ci-centos-7-scenario001-multinode-oooq-container/66462ea/logs/undercloud/var/lib/mistral/5793f4d1-afab-4ad7-8c67-e3e035e60ec7/ceph-ansible/ceph_ansible_command.log.txt.gz
Also, you'll notice that ansible logs now miss INSTALLER STATUS so we can't figure out which tasks took more time than before.

It's causing random timeouts of scenario001 (probably scenario004 too?) in the gate. The overcloud takes +18 min with config-download enabled on scenario001 (can be observed on the 2 previous links, in overcloud deploy logs).

We need to look at:
1) Why the ceph deployments takes 10 more minutes?
2) Investigate how much time config download would add to any deployment, and eventually increase timeouts and document it.

Tags: ci
Revision history for this message
James Slagle (james-slagle) wrote :

One thing with config-download is that we are not using SSH
ControlMaster/ControlPersist since that fails when we bounce the networking
interfaces when we apply NetworkDeployment.

This could cause each task to take ~1-3s longer on average. With several
hundred tasks, this is several minutes, and could explain the difference
between ceph-ansible under config-download vs under the mistral worfklow.

We could change the ceph-ansible execution to use a new ansible.cfg that does
not disable ControlMaster, since by that point we shouldn't be doing anything
to network interfaces.

Other than that, the other thing I spotted was:

mistral workflow job:
2018-04-19 10:31:24,787 p=29695 u=mistral | TASK [ceph-docker-common : pulling docker.io/ceph/daemon:v3.0.1-stable-3.0-luminous-centos-7-x86_64 image] ***
2018-04-19 10:31:24,787 p=29695 u=mistral | Thursday 19 April 2018 10:31:24 +0000 (0:00:00.066) 0:02:46.284 ********
2018-04-19 10:31:25,569 p=29695 u=mistral | ok: [192.168.24.14] =>
(took 1 sec)

vs.

(config-download job)
2018-04-26 18:07:45,641 p=3567 u=mistral | TASK [ceph-docker-common : pulling 192.168.24.1:8787/ceph/daemon:v3.0.1-stable-3.0-luminous-centos-7-x86_64 image] ***
2018-04-26 18:08:11,411 p=3567 u=mistral | ok: [centos-7-rax-dfw-0003755461] =>
(took 26s)

and notice docker.io vs 192.168.24.1. I can't explain that difference other
than differences in hosted cloud networking, although in the task output it
looks like downloading 192.168.24.1 pulled a lot more layers. Maybe there's
some configuration where we are getting only the final image from docker.io,
but we get all individual layers from the undercloud registry.

(added same comment to bug).

Revision history for this message
John Fulton (jfulton-org) wrote :
Download full text (7.8 KiB)

I wrote a quick script [1] to compare the timestamps from the logs to answer the following two questions for all tasks which took 1 second or greater:

1. Which tasks took the longest from the direct mistral?

[fultonj@skagra lp1767252]$ ./compare.py mistral_tasks
0:00:26.284000: [ceph-osd : get default vm.min_free_kbytes]
0:00:13.623000: [ceph-mon : assign rbd application to pool(s)]
0:00:11.325000: [ceph-osd : automatic prepare ceph containerized osd disk collocated]
0:00:07.031000: [ceph-mon : create openstack cephx key(s)]
0:00:05.668000: [gather and delegate facts]
0:00:05.186000: [ceph-mon : fetch openstack cephx key(s)]
0:00:05.185000: [set ceph monitor install 'Complete']
0:00:04.929000: [ceph-client : slurp client cephx key(s)]
0:00:02.920000: [ceph-docker-common : install python-docker-py on red hat / centos]
0:00:02.779000: [ceph-mon : create ceph rest api keyring when mon is containerized]
0:00:02.642000: [ceph-mgr : add modules to ceph-mgr]
0:00:02.573000: [ceph-docker-common : install python-docker on ubuntu]
0:00:02.273000: [ceph-mon : try to copy ceph keys]
0:00:02.008000: [set ceph manager install 'Complete']
0:00:01.847000: [ceph-mon : populate kv_store with default ceph.conf]
0:00:01.762000: [set ceph osd install 'Complete']
0:00:01.675000: [ceph-mgr : generate systemd unit file]
0:00:01.564000: [ceph-osd : increase aio-max-nr for bluestore]
0:00:01.471000: [ceph-mon : systemd start mon container]
0:00:01.451000: [ceph-config : create ceph conf directory]
0:00:01.436000: [pull ceph/daemon image]
0:00:01.371000: [ceph-mgr : try to fetch ceph config]
0:00:01.232000: [ceph-osd : systemd start osd container]
0:00:01.217000: [ceph-mon : wait for monitor socket to exist]
0:00:01.206000: [ceph-osd : generate ceph osd docker run script]
0:00:01.165000: [ceph-docker-common : set_fact ceph_version ceph_version.stdout.split]
0:00:01.119000: [ceph-mon : setfacl for openstack cephx key(s) on the other mons and this mon]
0:00:01.107000: [ceph-mgr : systemd start mgr container]
0:00:01.090000: [ceph-mon : copy to other mons the openstack cephx key(s)]
0:00:01.013000: [ceph-mon : stat for ceph mgr key(s)]
0:00:00.989000: [ceph-osd : generate systemd unit file]

2. Which tasks took the longest from config_download?

[fultonj@skagra lp1767252]$ ./compare.py config-download_tasks
0:00:29.793000: [ceph-docker-common : fail if systemd is not present]
0:00:27.049000: [ceph-mon : assign rbd application to pool(s)]
0:00:18.522000: [ceph-mon : create openstack cephx key(s)]
0:00:17.241000: [Gathering Facts]
0:00:15.249000: [ceph-mon : create ceph rest api keyring when mon is containerized]
0:00:14.135000: [pull ceph/daemon image]
0:00:13.543000: [ceph-osd : automatic prepare ceph containerized osd disk collocated]
0:00:12.448000: [ceph-config : create ceph conf directory]
0:00:11.758000: [ceph-mon : fetch openstack cephx key(s)]
0:00:10.365000: [ceph-osd : increase aio-max-nr for bluestore]
0:00:08.043000: [ceph-client : slurp client cephx key(s)]
0:00:08.038000: [set ceph osd install 'Complete']
0:00:08.022000: [ceph-docker-common : install python-docker-py on red hat / centos]
0:00:07.708000: [set ceph manager install 'Complete']
0:00:07.557...

Read more...

Revision history for this message
John Fulton (jfulton-org) wrote :

James wrote:
> We could change the ceph-ansible execution to use a new ansible.cfg that does
> not disable ControlMaster, since by that point we shouldn't be doing anything
> to network interfaces.

This seems to fit the data since of the ~300 tasks each job runs:
- With config download 72 tasks took 2 or more seconds
- With direct mistral 14 tasks took 2 or more seconds + 16 tasks took between 1 or 2 seconds

ceph-ansible ships with its own ansible.cfg [1] and the mistral triggered job used it.

[1] https://github.com/ceph/ceph-ansible/blob/master/ansible.cfg

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/564833

Changed in tripleo:
assignee: nobody → John Fulton (jfulton-org)
status: Triaged → In Progress
Revision history for this message
John Fulton (jfulton-org) wrote :

https://review.openstack.org/564833 brought the runtime down to 7 minutes:

2018-04-27 18:33:16,965
2018-04-27 18:40:41,120

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/564833
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=cf8c443ffb2c5afaa2257d1834c391204c100759
Submitter: Zuul
Branch: master

commit cf8c443ffb2c5afaa2257d1834c391204c100759
Author: John Fulton <email address hidden>
Date: Fri Apr 27 12:27:47 2018 -0400

    Execute ceph-ansible with its own ansible.cfg

    Modify the ceph-ansible execution so that it uses the ansible.cfg
    that is shipped with ceph-ansible. ceph-ansible's own testing uses
    this file and the project's maintainers recommend it be used.

    This configuration file also uses ControlMaster=auto which is not
    used by config download at this time. Adding this setting to the
    ceph-ansible execution should make it take less time.

    Change-Id: I774132eb8b7a43c921c0696635d56b4136301346
    Closes-Bug: 1767252

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 9.0.0.0b3

This issue was fixed in the openstack/tripleo-heat-templates 9.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.