Bug #1767252 “scenario001 timeout: ceph-ansible + config-downloa...” : Bugs : tripleo

Revision history for this message

James Slagle (james-slagle) wrote on 2018-04-27:

#1

One thing with config-download is that we are not using SSH
ControlMaster/ControlPersist since that fails when we bounce the networking
interfaces when we apply NetworkDeployment.

This could cause each task to take ~1-3s longer on average. With several
hundred tasks, this is several minutes, and could explain the difference
between ceph-ansible under config-download vs under the mistral worfklow.

We could change the ceph-ansible execution to use a new ansible.cfg that does
not disable ControlMaster, since by that point we shouldn't be doing anything
to network interfaces.

Other than that, the other thing I spotted was:

mistral workflow job:
2018-04-19 10:31:24,787 p=29695 u=mistral | TASK [ceph-docker-common : pulling docker.io/ceph/daemon:v3.0.1-stable-3.0-luminous-centos-7-x86_64 image] ***
2018-04-19 10:31:24,787 p=29695 u=mistral | Thursday 19 April 2018 10:31:24 +0000 (0:00:00.066) 0:02:46.284 ********
2018-04-19 10:31:25,569 p=29695 u=mistral | ok: [192.168.24.14] =>
(took 1 sec)

vs.

(config-download job)
2018-04-26 18:07:45,641 p=3567 u=mistral | TASK [ceph-docker-common : pulling 192.168.24.1:8787/ceph/daemon:v3.0.1-stable-3.0-luminous-centos-7-x86_64 image] ***
2018-04-26 18:08:11,411 p=3567 u=mistral | ok: [centos-7-rax-dfw-0003755461] =>
(took 26s)

and notice docker.io vs 192.168.24.1. I can't explain that difference other
than differences in hosted cloud networking, although in the task output it
looks like downloading 192.168.24.1 pulled a lot more layers. Maybe there's
some configuration where we are getting only the final image from docker.io,
but we get all individual layers from the undercloud registry.

(added same comment to bug).

Revision history for this message

John Fulton (jfulton-org) wrote on 2018-04-27:

#2

Download full text (7.8 KiB)

I wrote a quick script [1] to compare the timestamps from the logs to answer the following two questions for all tasks which took 1 second or greater:

1. Which tasks took the longest from the direct mistral?

[fultonj@skagra lp1767252]$ ./compare.py mistral_tasks
0:00:26.284000: [ceph-osd : get default vm.min_free_kbytes]
0:00:13.623000: [ceph-mon : assign rbd application to pool(s)]
0:00:11.325000: [ceph-osd : automatic prepare ceph containerized osd disk collocated]
0:00:07.031000: [ceph-mon : create openstack cephx key(s)]
0:00:05.668000: [gather and delegate facts]
0:00:05.186000: [ceph-mon : fetch openstack cephx key(s)]
0:00:05.185000: [set ceph monitor install 'Complete']
0:00:04.929000: [ceph-client : slurp client cephx key(s)]
0:00:02.920000: [ceph-docker-common : install python-docker-py on red hat / centos]
0:00:02.779000: [ceph-mon : create ceph rest api keyring when mon is containerized]
0:00:02.642000: [ceph-mgr : add modules to ceph-mgr]
0:00:02.573000: [ceph-docker-common : install python-docker on ubuntu]
0:00:02.273000: [ceph-mon : try to copy ceph keys]
0:00:02.008000: [set ceph manager install 'Complete']
0:00:01.847000: [ceph-mon : populate kv_store with default ceph.conf]
0:00:01.762000: [set ceph osd install 'Complete']
0:00:01.675000: [ceph-mgr : generate systemd unit file]
0:00:01.564000: [ceph-osd : increase aio-max-nr for bluestore]
0:00:01.471000: [ceph-mon : systemd start mon container]
0:00:01.451000: [ceph-config : create ceph conf directory]
0:00:01.436000: [pull ceph/daemon image]
0:00:01.371000: [ceph-mgr : try to fetch ceph config]
0:00:01.232000: [ceph-osd : systemd start osd container]
0:00:01.217000: [ceph-mon : wait for monitor socket to exist]
0:00:01.206000: [ceph-osd : generate ceph osd docker run script]
0:00:01.165000: [ceph-docker-common : set_fact ceph_version ceph_version.stdout.split]
0:00:01.119000: [ceph-mon : setfacl for openstack cephx key(s) on the other mons and this mon]
0:00:01.107000: [ceph-mgr : systemd start mgr container]
0:00:01.090000: [ceph-mon : copy to other mons the openstack cephx key(s)]
0:00:01.013000: [ceph-mon : stat for ceph mgr key(s)]
0:00:00.989000: [ceph-osd : generate systemd unit file]

2. Which tasks took the longest from config_download?

[fultonj@skagra lp1767252]$ ./compare.py config-download_tasks
0:00:29.793000: [ceph-docker-common : fail if systemd is not present]
0:00:27.049000: [ceph-mon : assign rbd application to pool(s)]
0:00:18.522000: [ceph-mon : create openstack cephx key(s)]
0:00:17.241000: [Gathering Facts]
0:00:15.249000: [ceph-mon : create ceph rest api keyring when mon is containerized]
0:00:14.135000: [pull ceph/daemon image]
0:00:13.543000: [ceph-osd : automatic prepare ceph containerized osd disk collocated]
0:00:12.448000: [ceph-config : create ceph conf directory]
0:00:11.758000: [ceph-mon : fetch openstack cephx key(s)]
0:00:10.365000: [ceph-osd : increase aio-max-nr for bluestore]
0:00:08.043000: [ceph-client : slurp client cephx key(s)]
0:00:08.038000: [set ceph osd install 'Complete']
0:00:08.022000: [ceph-docker-common : install python-docker-py on red hat / centos]
0:00:07.708000: [set ceph manager install 'Complete']
0:00:07.557...

I wrote a quick script [1] to compare the timestamps from the logs to answer the following two questions for all tasks which took 1 second or greater:

1. Which tasks took the longest from the direct mistral?

[fultonj@skagra lp1767252]$ ./compare.py mistral_tasks 
0:00:26.284000: [ceph-osd : get default vm.min_free_kbytes]
0:00:13.623000: [ceph-mon : assign rbd application to pool(s)]
0:00:11.325000: [ceph-osd : automatic prepare ceph containerized osd disk collocated]
0:00:07.031000: [ceph-mon : create openstack cephx key(s)]
0:00:05.668000: [gather and delegate facts]
0:00:05.186000: [ceph-mon : fetch openstack cephx key(s)]
0:00:05.185000: [set ceph monitor install 'Complete']
0:00:04.929000: [ceph-client : slurp client cephx key(s)]
0:00:02.920000: [ceph-docker-common : install python-docker-py on red hat / centos]
0:00:02.779000: [ceph-mon : create ceph rest api keyring when mon is containerized]
0:00:02.642000: [ceph-mgr : add modules to ceph-mgr]
0:00:02.573000: [ceph-docker-common : install python-docker on ubuntu]
0:00:02.273000: [ceph-mon : try to copy ceph keys]
0:00:02.008000: [set ceph manager install 'Complete']
0:00:01.847000: [ceph-mon : populate kv_store with default ceph.conf]
0:00:01.762000: [set ceph osd install 'Complete']
0:00:01.675000: [ceph-mgr : generate systemd unit file]
0:00:01.564000: [ceph-osd : increase aio-max-nr for bluestore]
0:00:01.471000: [ceph-mon : systemd start mon container]
0:00:01.451000: [ceph-config : create ceph conf directory]
0:00:01.436000: [pull ceph/daemon image]
0:00:01.371000: [ceph-mgr : try to fetch ceph config]
0:00:01.232000: [ceph-osd : systemd start osd container]
0:00:01.217000: [ceph-mon : wait for monitor socket to exist]
0:00:01.206000: [ceph-osd : generate ceph osd docker run script]
0:00:01.165000: [ceph-docker-common : set_fact ceph_version ceph_version.stdout.split]
0:00:01.119000: [ceph-mon : setfacl for openstack cephx key(s) on the other mons and this mon]
0:00:01.107000: [ceph-mgr : systemd start mgr container]
0:00:01.090000: [ceph-mon : copy to other mons the openstack cephx key(s)]
0:00:01.013000: [ceph-mon : stat for ceph mgr key(s)]
0:00:00.989000: [ceph-osd : generate systemd unit file]

2. Which tasks took the longest from config_download?

[fultonj@skagra lp1767252]$ ./compare.py config-download_tasks 
0:00:29.793000: [ceph-docker-common : fail if systemd is not present]
0:00:27.049000: [ceph-mon : assign rbd application to pool(s)]
0:00:18.522000: [ceph-mon : create openstack cephx key(s)]
0:00:17.241000: [Gathering Facts]
0:00:15.249000: [ceph-mon : create ceph rest api keyring when mon is containerized]
0:00:14.135000: [pull ceph/daemon image]
0:00:13.543000: [ceph-osd : automatic prepare ceph containerized osd disk collocated]
0:00:12.448000: [ceph-config : create ceph conf directory]
0:00:11.758000: [ceph-mon : fetch openstack cephx key(s)]
0:00:10.365000: [ceph-osd : increase aio-max-nr for bluestore]
0:00:08.043000: [ceph-client : slurp client cephx key(s)]
0:00:08.038000: [set ceph osd install 'Complete']
0:00:08.022000: [ceph-docker-common : install python-docker-py on red hat / centos]
0:00:07.708000: [set ceph manager install 'Complete']
0:00:07.557000: [ceph-mon : copy to other mons the openstack cephx key(s)]
0:00:07.518000: [gather and delegate facts]
0:00:07.347000: [ceph-client : create ceph pool(s)]
0:00:06.958000: [set ceph monitor install 'In Progress']
0:00:06.877000: [set ceph monitor install 'Complete']
0:00:06.640000: [ceph-mgr : add modules to ceph-mgr]
0:00:05.822000: [ceph-mon : systemd start mon container]
0:00:05.558000: [ceph-mon : wait for monitor socket to exist]
0:00:05.459000: [ceph-mgr : systemd start mgr container]
0:00:05.452000: [ceph-osd : systemd start osd container]
0:00:05.263000: [ceph-osd : generate systemd unit file]
0:00:05.047000: [ceph-docker-common : allow apt to use a repository over https (debian)]
0:00:04.964000: [ceph-osd : get default vm.min_free_kbytes]
0:00:04.955000: [ceph-docker-common : install python-docker on ubuntu]
0:00:04.949000: [ceph-osd : copy ceph key(s) if needed]
0:00:04.946000: [ceph-mgr : set mgr key permissions]
0:00:04.155000: [ceph-config : set fsid fact when generate_fsid = true]
0:00:04.092000: [ceph-osd : set_fact ceph_disk_cli_options '--cluster ceph --bluestore']
0:00:03.454000: [ceph-docker-common : set_fact ceph_version ceph_version.stdout.split]
0:00:03.448000: [ceph-osd : generate ceph osd docker run script]
0:00:03.270000: [ceph-mgr : disable ceph mgr enabled modules]
0:00:03.211000: [ceph-mon : stat for ceph mgr key(s)]
0:00:03.137000: [ceph-defaults : check if /var/lib/mistral/5793f4d1-afab-4ad7-8c67-e3e035e60ec7/ceph-ansible/fetch_dir directory exists]
0:00:02.961000: [ceph-osd : check the partition status of the osd disks]
0:00:02.949000: [ceph-client : set docker_exec_client_cmd for containers]
0:00:02.918000: [ceph-docker-common : pause after docker install before starting (on openstack vms)]
0:00:02.885000: [ceph-mgr : get enabled modules from ceph-mgr]
0:00:02.880000: [ceph-mon : configure ceph profile.d aliases]
0:00:02.797000: [ceph-mon : ipv6 - force peer addition as potential bootstrap peer for cluster bringup - monitor_interface]
0:00:02.771000: [ceph-client : get client cephx keys]
0:00:02.736000: [ceph-mon : ipv4 - force peer addition as potential bootstrap peer for cluster bringup - monitor_interface]
0:00:02.692000: [ceph-docker-common : inspect ceph osd container]
0:00:02.665000: [set_fact is_atomic]
0:00:02.662000: [ceph-docker-common : inspecting ceph rbd mirror container image before pulling]
0:00:02.648000: [ceph-defaults : check for a mds container]
0:00:02.624000: [ceph-docker-common : set_fact ceph_config_keys]
0:00:02.620000: [ceph-docker-common : inspecting ceph rgw container image before pulling]
0:00:02.619000: [ceph-docker-common : inspecting ceph osd container image before pulling]
0:00:02.616000: [ceph-mon : fetch ceph mgr key(s)]
0:00:02.606000: [ceph-docker-common : inspecting 192.168.24.1:8787/ceph/daemon:v3.0.1-stable-3.0-luminous-centos-7-x86_64 image after pulling]
0:00:02.588000: [ceph-defaults : check for an osd container]
0:00:02.578000: [ceph-docker-common : inspect ceph rbd mirror container]
0:00:02.578000: [ceph-docker-common : inspect ceph mds container]
0:00:02.574000: [ceph-docker-common : install ntp]
0:00:02.572000: [ceph-mgr : install ceph-mgr package on RedHat or SUSE]
0:00:02.565000: [ceph-docker-common : set_fact image_repodigest_after_pulling]
0:00:02.563000: [ceph-mon : configure crush hierarchy]
0:00:02.561000: [ceph-osd : include scenarios/collocated.yml]
0:00:02.540000: [ceph-defaults : set_fact build devices from resolved symlinks]
0:00:02.536000: [ceph-defaults : check for a rbd mirror container]
0:00:02.532000: [ceph-config : generate ceph.conf configuration file]
0:00:02.531000: [ceph-docker-common : set_fact ceph_docker_version ceph_docker_version.stdout.split]
0:00:02.527000: [ceph-osd : create gpt disk label]
0:00:02.512000: [ceph-osd : set_fact vm_min_free_kbytes]
0:00:02.512000: [ceph-mgr : copy ceph keyring(s) if needed]
0:00:02.503000: [ceph-osd : disable transparent hugepage]
0:00:02.430000: [ceph-defaults : set_fact is_atomic]
0:00:02.181000: [ceph-mon : try to copy ceph keys]
0:00:00.573000: [set ceph client install 'In Progress']

[1] 
import sys
my_file = sys.argv[1]

from dateutil import parser
counts = {}
with open(my_file) as f:
    prev_time = 0
    for line in f:
        line_list = line.split()
        this_time = line_list[0]
        line_list.pop(0)
        key = ' '.join(line_list)
        if prev_time != 0:
            prev_time_dt = parser.parse(prev_time)
            this_time_dt = parser.parse(this_time)
            diff_dt = this_time_dt - prev_time_dt
            counts[key] = str(diff_dt)
        prev_time = this_time # for the next loop
f.close()

for key, value in sorted(counts.iteritems(), key=lambda (k,v): (v,k), reverse=True):
    print "%s: %s" % (value, key)

Revision history for this message

John Fulton (jfulton-org) wrote on 2018-04-27:

#3

James wrote:
> We could change the ceph-ansible execution to use a new ansible.cfg that does
> not disable ControlMaster, since by that point we shouldn't be doing anything
> to network interfaces.

This seems to fit the data since of the ~300 tasks each job runs:
- With config download 72 tasks took 2 or more seconds
- With direct mistral 14 tasks took 2 or more seconds + 16 tasks took between 1 or 2 seconds

ceph-ansible ships with its own ansible.cfg [1] and the mistral triggered job used it.

[1] https://github.com/ceph/ceph-ansible/blob/master/ansible.cfg

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-27: Fix proposed to tripleo-heat-templates (master)

#4

Fix proposed to branch: master
Review: https://review.openstack.org/564833

Changed in tripleo:
assignee:	nobody → John Fulton (jfulton-org)
status:	Triaged → In Progress

Revision history for this message

John Fulton (jfulton-org) wrote on 2018-04-27:

#5

https://review.openstack.org/564833 brought the runtime down to 7 minutes:

2018-04-27 18:33:16,965
2018-04-27 18:40:41,120

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-01: Fix merged to tripleo-heat-templates (master)

#6

Reviewed: https://review.openstack.org/564833
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=cf8c443ffb2c5afaa2257d1834c391204c100759
Submitter: Zuul
Branch: master

commit cf8c443ffb2c5afaa2257d1834c391204c100759
Author: John Fulton <email address hidden>
Date: Fri Apr 27 12:27:47 2018 -0400

Execute ceph-ansible with its own ansible.cfg

    Modify the ceph-ansible execution so that it uses the ansible.cfg
    that is shipped with ceph-ansible. ceph-ansible's own testing uses
    this file and the project's maintainers recommend it be used.

    This configuration file also uses ControlMaster=auto which is not
    used by config download at this time. Adding this setting to the
    ceph-ansible execution should make it take less time.

Change-Id: I774132eb8b7a43c921c0696635d56b4136301346
Closes-Bug: 1767252

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-05: Fix included in openstack/tripleo-heat-templates 9.0.0.0b3

#7

This issue was fixed in the openstack/tripleo-heat-templates 9.0.0.0b3 development milestone.

tripleo

scenario001 timeout: ceph-ansible + config-download takes more time than before

Bug Description

Other bug subscribers

Remote bug watches