image_uploader: default multiprocessing.Pool is killing small underclouds

Bug #1746305 reported by Emilien Macchi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Steve Baker

Bug Description

This bug has been found by Red Hat QE when testing Queens on internal CI.
The original bug report is here: https://bugzilla.redhat.com/show_bug.cgi?id=1538338

The issue is when running `openstack overcloud container image upload`, which randomly fails after a few uploads.
The commands needs to be run again, and fails again after a few more uploads.
Looking at the CPU usage, it seems to go over 150% sometimes, where dockerd-current process takes a lot of resources during a short amount of time.

Flavor of the undercloud: 16GB of Ram and 6 vCPUs (undercloud is a VM).

When reducing the multiprocessing.Pool to 2 instead of 16 and 4 in tripleo_common/image/image_uploader.py, the image upload works fine.

So maybe we could reduce the default to sane values (even if slower) but add options to increase it, undercloud.conf maybe?

Tags: containers
Changed in tripleo:
assignee: nobody → Emilien Macchi (emilienm)
description: updated
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

It looks like upload is CPU bound on dockerd. There is one dockerd worker thread per core, and I suspect one "docker pull" call will split work between all available workers by downloading layers in parallel.

I've got an 8 core undercloud here, I'll come up with a heuristic for pinning the upload workers to the number of cores.

Changed in tripleo:
assignee: Emilien Macchi (emilienm) → Steve Baker (steve-stevebaker)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/539383

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/539383
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=e2dac0ef415c51792f1f779c05a9c439ae5f5265
Submitter: Zuul
Branch: master

commit e2dac0ef415c51792f1f779c05a9c439ae5f5265
Author: Steve Baker <email address hidden>
Date: Wed Jan 31 13:45:13 2018 +1300

    Make container image upload more resilient

    This change implements a retry loop for the image push, just like the
    existing one for pull. This will be necessary for some non-undercloud
    registries which might periodically fail on some pushes.

    This change also determines the upload worker count to be half of the
    CPU count, with a minimum of two. This is to make CI more resilient on
    smaller (6 core) flavors. A docker pull causes high CPU load in
    dockerd, as the download of each layer is spread across dockerd's
    workers.

    Change-Id: Ia30658e3283d4b69d2bd8b0dddd375e1918169d3
    Closes-Bug: #1746305

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 8.5.0

This issue was fixed in the openstack/tripleo-common 8.5.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.