tripleo

image_uploader: default multiprocessing.Pool is killing small underclouds

Bug #1746305 reported by Emilien Macchi on 2018-01-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	High	Steve Baker	tripleo queens-rc1

Bug Description

This bug has been found by Red Hat QE when testing Queens on internal CI.
The original bug report is here: https://bugzilla.redhat.com/show_bug.cgi?id=1538338

The issue is when running `openstack overcloud container image upload`, which randomly fails after a few uploads.
The commands needs to be run again, and fails again after a few more uploads.
Looking at the CPU usage, it seems to go over 150% sometimes, where dockerd-current process takes a lot of resources during a short amount of time.

Flavor of the undercloud: 16GB of Ram and 6 vCPUs (undercloud is a VM).

When reducing the multiprocessing.Pool to 2 instead of 16 and 4 in tripleo_common/image/image_uploader.py, the image upload works fine.

So maybe we could reduce the default to sane values (even if slower) but add options to increase it, undercloud.conf maybe?

See original description

Tags:

Emilien Macchi (emilienm) on 2018-01-30

Changed in tripleo:
assignee:	nobody → Emilien Macchi (emilienm)

Emilien Macchi (emilienm) on 2018-01-30

description:

updated

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2018-01-30:

It looks like upload is CPU bound on dockerd. There is one dockerd worker thread per core, and I suspect one "docker pull" call will split work between all available workers by downloading layers in parallel.

I've got an 8 core undercloud here, I'll come up with a heuristic for pinning the upload workers to the number of cores.

Changed in tripleo:
assignee:	Emilien Macchi (emilienm) → Steve Baker (steve-stevebaker)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-31: Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/539383

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-31: Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/539383
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=e2dac0ef415c51792f1f779c05a9c439ae5f5265
Submitter: Zuul
Branch: master

commit e2dac0ef415c51792f1f779c05a9c439ae5f5265
Author: Steve Baker <email address hidden>
Date: Wed Jan 31 13:45:13 2018 +1300

Make container image upload more resilient

    This change implements a retry loop for the image push, just like the
    existing one for pull. This will be necessary for some non-undercloud
    registries which might periodically fail on some pushes.

    This change also determines the upload worker count to be half of the
    CPU count, with a minimum of two. This is to make CI more resilient on
    smaller (6 core) flavors. A docker pull causes high CPU load in
    dockerd, as the download of each layer is spread across dockerd's
    workers.

Change-Id: Ia30658e3283d4b69d2bd8b0dddd375e1918169d3
Closes-Bug: #1746305

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-03: Fix included in openstack/tripleo-common 8.5.0

This issue was fixed in the openstack/tripleo-common 8.5.0 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.