After the switch to podman & skopeo, undercloud deployment takes +60% longer

Bug #1797525 reported by Bogdan Dobrelya
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Unassigned

Bug Description

Key points to keep in mind:

* https://review.openstack.org/#/c/600517/ (Oct 3) fs027 switched to podman runtime
* https://review.openstack.org/#/c/604664/ (Oct 9) skopeo switch for preparing containers images

After the switch to podman, undercloud deployment takes +60% longer. And even more longer after the skopeo switch.

See the profiling data https://pastebin.com/pYPUp0sC showing the comparison of https://review.openstack.org/#/c/606220 (podman, not yet skopeo)
vs
https://review.openstack.org/#/c/604664/ (no podman yet for that CI job results taken for Oct 3)

Note, the same https://review.openstack.org/#/c/604664/ patch CI job results taken for Oct 9, after the podman switch, show the same +60% time increase.

Here is also profiling data for https://pastebin.com/pYPUp0sC compared to
https://pastebin.com/wJfYT9e, which is with the skopeo fix https://review.openstack.org/#/c/609586/ in-place

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note, that issue also seems like a major show stopper for other multi node CI jobs now hitting timeouts for mistral workflows and more places (see also https://bugs.launchpad.net/tripleo/+bug/1789680 )

Changed in tripleo:
status: New → Triaged
importance: Undecided → Critical
milestone: none → stein-1
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Small thing regarding skopeo: it will be more than probably dropped due to several bugs we had, especially when running skopeo from within mistral containers.
The following patch drop skopeo, and re-implement its basic functions in plain python:
https://review.openstack.org/609586

Regarding podman, there are some things to keep in mind, especially regarding perfs: it uses strict locking on its database, preventing any concurrent access.
A "good" way to see that is to just run a "podman ps" when an undercloud is being deployed with podman, it usually hangs up until it can get an access.
Imagine what's happening with the 30+ containers being created with a deploy... That can explain a lot (maybe not the 60% in its entirety, but at least a good part of it).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-quickstart (master)

Fix proposed to branch: master
Review: https://review.openstack.org/609963

Changed in tripleo:
assignee: nobody → Bogdan Dobrelya (bogdando)
status: Triaged → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
tags: added: alert
description: updated
tags: added: containers
description: updated
description: updated
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

seems no need to revert the skopeo switch, but an alternative fix https://review.openstack.org/#/c/609586/

description: updated
Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → Steve Baker (steve-stevebaker)
wes hayutin (weshayutin)
tags: added: promotion-blocker
Revision history for this message
Steven Baker (srbaker) wrote : Re: [Bug 1797525] [NEW] After the switch to podman & skopeo, undercloud deployment takes +60% longer

Hi,

I'm the wrong Steven Baker, but I got an email about this bug. I can't figure out how to remove myself from the bug.

-Steven

> On 12 Oct 2018, at 10:26, Launchpad Bug Tracker <email address hidden> wrote:
>
> You have been subscribed to a public bug by Bogdan Dobrelya (bogdando):
>
> Key points to keep in mind:
>
> * https://review.openstack.org/#/c/600517/ fs027 switched to podman runtime
> * https://review.openstack.org/#/c/609586/ skopeo switch for preparing containers images
>
> After the switch to podman, undercloud deployment takes +60% longer. And
> even more longer after the skopeo switch.
>
> See the profiling data https://pastebin.com/8xQ0Hprk showing the comparison of https://review.openstack.org/#/c/606220 (podman, not yet skopeo)
> vs
> https://review.openstack.org/#/c/604664/ (no podman yet for that CI job results taken for Oct 3)
>
> Note, the same https://review.openstack.org/#/c/604664/ patch CI job
> results taken for Oct 9, after the podman switch, show the same +60%
> time increase.
>
> Note, would be also nice to add the profiling data for
> https://review.openstack.org/#/c/604664/ after the skopeo switch
> happened vs https://review.openstack.org/#/c/609586/ (which attempts to
> speed up the skopeo tooling)
>
> ** Affects: tripleo
> Importance: Critical
> Status: Triaged
>
> --
> After the switch to podman & skopeo, undercloud deployment takes +60% longer
> https://bugs.launchpad.net/bugs/1797525
> You received this bug notification because you are subscribed to the bug report.

Changed in tripleo:
assignee: Steve Baker (steve-stevebaker) → Bogdan Dobrelya (bogdando)
Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → Steve Baker (steve-stevebaker)
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

BTW, the skopeo uploader has landed but is not yet used until https://review.openstack.org/#/c/590087/ has landed. It should not land yet because it will be really slow.

The change https://review.openstack.org/#/c/609586/ replaces "skopeo inspect" calls for a nice speed improvement, but the skopeo uploader uses "skopeo copy" to transfer images.

skopeo copy from local storage to the undercloud registry is slow because it does not detect when a layer is already in the registry, so there are a lot of duplicate transfers. This is being tracked in github containers/skopeo and containers/image

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/609941
Reason: we have https://review.openstack.org/#/c/609586/ merged and hopefully need no more to revert

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Given that we see no timeouts for all jobs cuz of these podman and remaining skopeo layers issues, that probably allows us to reduce the priority to high. When we're back to switch all jobs to podman, let's re-iterate if that still blocks us.

Changed in tripleo:
importance: Critical → High
tags: added: tech-debt
removed: alert promotion-blocker
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/609963
Reason: This does not make the tripleo-ci-centos-7-undercloud-containers job timing out, so no need to revert it

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The upstream podman issue with benchmarking results https://github.com/containers/libpod/issues/1656

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

As benchmarking shows (see https://github.com/containers/libpod/issues/1656) ``podman ps`` is a bottleneck as it locks for seconds/tens of seconds, while containers being operated in a concurrent fashion. And paunch does rely a lot on the ps results. Perhaps, to speed up things we should avoid using ps as much as possible, for podman, and instead just use try/except with retries or something.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to paunch (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/611312

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

So that we have now and for future work (to optimize long running ps calls, like caching):

BaseRunner.discover_container_name (only used for action exec): https://review.openstack.org/#/c/611312/

Future work:
BaseRunner.rename_containers: not supported yet for podman, so omit that
BaseRunner.containers_in_config: can it consume cached ps results, or these need to be be always actual?
BaseRunner.list_configs: can it consume cached ps results, or these need to be be always actual?
BaseRunner.remove_containers: make it bulk deleting instead of one by one via self.runner.remove_container (not really podman ps related)
BaseRunner.current_config_ids: can it consume cached ps results, or these need to be be always actual?
BaseRunner.container_names: can we cache ps results for future use by that class methods, or these need to be kept always actual?
init.delete: make it bulk deleting? (not really podman ps related)
BaseRunner.delete_missing_configs: make it bulk deleting? (not really podman ps related)
__
BaseBuilder.apply: cache ps call from runner.container_names and do not repeat for self.delete_missing_and_updated?
BaseBuilder.delete_missing_and_updated:
 * make it bulk deleting and called only once at the end? (not really podman ps related)
 * cache ps call from runner.container_names and do not repeat for
          self.runner.rename_containers? (rename_containers is not supported for podman, omit that)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/611911

tags: added: alert
Changed in tripleo:
importance: High → Critical
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The related fixes for bug 1799902 should also help with debugging of the original issues

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

A 'podman ps' caching (WIP) patch https://review.openstack.org/611911 addresses some of the proposals listed in https://bugs.launchpad.net/tripleo/+bug/1797525/comments/14 as well, but it's not there yet!

Changed in tripleo:
importance: Critical → High
tags: removed: alert
Changed in tripleo:
assignee: Steve Baker (steve-stevebaker) → Sagi (Sergey) Shnaidman (sshnaidm)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The podman performance seems addressed for the podman 0.10.2-dev

https://github.com/containers/libpod/issues/1656#issuecomment-433109737

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on paunch (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/611911
Reason: The podman performance seems addressed for the podman 0.10.2-dev so we can let this poor thing go now

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/611312
Reason: it fails my local testing see https://pastebin.com/eYKhKwYA

Changed in tripleo:
milestone: stein-1 → stein-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/609963
Reason: https://review.openstack.org/#/c/614537/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/609941

Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
milestone: stein-3 → stein-rc1
Changed in tripleo:
milestone: stein-rc1 → train-1
Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
milestone: train-2 → train-3
Changed in tripleo:
milestone: train-3 → ussuri-1
Changed in tripleo:
assignee: Sagi (Sergey) Shnaidman (sshnaidm) → nobody
Changed in tripleo:
milestone: ussuri-1 → ussuri-2
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.