multiple container images get pullled

Bug #1763963 reported by james beedy
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Joseph Phillips

Bug Description

Seeing the lxd image get pulled down multiple times per machine http://paste.ubuntu.com/p/xsW7ZrTkB6/

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1763963] [NEW] container image is pulled down > 1 time per host

according to that status you are pulling down 2 different images. One is
grabbing the 'daily' image and the other is grabbing the 'released' image.

I'm not sure how you would get that separation. I know we add daily to the
search path, but IIRC we should only use it if we didn't find the alias in
released.

We also probably need to check that even if we are supporting parallel
provisioning we still critical section the part that finds and downloads a
given image.
Likely this only happening because both instances are starting at the same
time so we don't end up seeing the cached alias when we start the second
one.

John
=:->

On Sat, Apr 14, 2018, 19:45 james beedy <email address hidden> wrote:

> Public bug reported:
>
> Seeing the lxd image get pulled down multiple times per machine
> http://paste.ubuntu.com/p/xsW7ZrTkB6/
>
> ** Affects: juju
> Importance: Undecided
> Status: New
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1763963
>
> Title:
> container image is pulled down > 1 time per host
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1763963/+subscriptions
>

Revision history for this message
james beedy (jamesbeedy) wrote : Re: container image is pulled down > 1 time per host

@jameinel

"we still critical section the part that finds and downloads a given image"

+1

"Likely this only happening because both instances are starting at the same
time so we don't end up seeing the cached alias when we start the second
one."

There is another error that pops up in the message that says something about "alias not found" when the lxd are trying to provision, I'll get it in here next time I cross the path.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1763963] Re: container image is pulled down > 1 time per host

Is it listed as an error? That seems odd. We should search for
"juju/xenial/amd64" and then if we don't find it, hit
cloud-images.ubuntu.com to get it, and then create it locally with that
alias. If you're getting an error, then it sounds like something where we
are doing the download and then after completing that action, somehow the
alias we downloaded it as is not actually being created.

John
=:->

On Sun, Apr 15, 2018 at 6:54 AM, james beedy <email address hidden> wrote:

> @jameinel
>
> "we still critical section the part that finds and downloads a given
> image"
>
> +1
>
> "Likely this only happening because both instances are starting at the same
> time so we don't end up seeing the cached alias when we start the second
> one."
>
> There is another error that pops up in the message that says something
> about "alias not found" when the lxd are trying to provision, I'll get
> it in here next time I cross the path.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1763963
>
> Title:
> container image is pulled down > 1 time per host
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1763963/+subscriptions
>

Revision history for this message
james beedy (jamesbeedy) wrote : Re: container image is pulled down > 1 time per host
james beedy (jamesbeedy)
summary: - container image is pulled down > 1 time per host
+ multiple container images get pullled
Revision history for this message
John A Meinel (jameinel) wrote :

So one thing to be aware of,
http://paste.ubuntu.com/p/45GHJQwzPR/ lists 0/lxd/5 at 100%, but that is for the 'metadata' file.
the next one
http://paste.ubuntu.com/p/8dXCYc3kYS/ lists 0/lxd/5 at 97% but that is for the 'rootfs' file.
(there are 2 files that get downloaded for each image).
So it is perfectly expected that we would download 2x for a given container.

Now the next one:
http://paste.ubuntu.com/p/2cRQMGjhpr/

Says that 0/lxd/3 is trying to launch, but didn't find the alias, but 0/lxd/5 seems to still be finishing up its copy. (IIRC, after downloading the 2 files, the LXD agent does a bit of work to combine them and validate the final installed image.)

Now, what is really confusing is:
http://paste.ubuntu.com/p/wgJVtnTdHn/

Which seems to say that
 a) We clearly have *some* sort of image already downloaded, because 0/lxd/2 and 0/lxd/5 have started.
 b) After we successfully did all the work to download cloud-images.ubuntu.com/releases, rootfs, we are now downloading cloud-images.ubuntu.com/daily, rootfs. If we copied an image, why are we copying it again?

of course, the next thing is http://paste.ubuntu.com/p/X7HZrfBqjJ/ which then seems to exactly complain that we tried to create the same alias 2x.

There was a fairly recent change (2.3?) that changed our code to switch from looking for an image matching the request, and ensuring that existed locally, and then setting the alias, and then launching, to just calling launch with the potential alias.
Some of that fixed the problem of "auto-updates" because the former method disconnected the upstream alias with the local alias. (So it had the flag to keep the image up-to-date, but didn't have an upstream source to stay up-to-date with.)

I wonder if that change interacts poorly with the multiple-image sources that we used to use when searching for an image. (Because we started passing in multiple potential sources, LXD started interpreting that as meaning "download from all of them", rather than "download from the first one that matches".)

I don't see anything in these pastes that clearly states that we downloaded for 0/lxd/2 and then also downloaded for 0/lxd/5. A possible source of that ordering is that in:
http://paste.ubuntu.com/p/wgJVtnTdHn/

When we see 0/lxd/2 and 0/lxd/6 start, is because 0/lxd/5 has just finished downloading the released image, created the alias correctly, and they come in and immediately see the alias is ready to launch.

So there is definitely an issue where we are downloading both the 'released' and the 'daily' image.

That said, the original paste for this bug: http://paste.ubuntu.com/p/xsW7ZrTkB6/
Does clearly show 58/lxd/0 downloading the daily image, and 58/lxd/1 downloading the released image. Which would indicate we *can* get into the situation where we download it 2x.

Revision history for this message
John A Meinel (jameinel) wrote :

FWIW, I was able to mostly reproduce this with this bundle.yaml:

series: xenial
services:
  ul:
    charm: "cs:~jameinel/ubuntu-lite-7"
    num_units: 6
    to:
      - "0"
      - "lxd:0"
      - "lxd:0"
      - "lxd:0"
      - "lxd:0"
      - "lxd:0"
machines:
  "0":
    series: xenial
    constraints: "arch=amd64"

And just doing "juju deploy ./bundle.yaml".

I then ran:
  for i in `seq 100`; do juju status | tee -a out.txt; time sleep 1; done

And then trimming that log I get the attached log.
You can see that only 0/lxd/2 ever copies the 'releases' image, but for some reason it then says it cannot find the alias, and will be retried. And at that point 0/lxd/0 gets in and starts downloading the daily image.
And then once that finishes, it then fails with the "UNIQUE constraint failed".

Revision history for this message
John A Meinel (jameinel) wrote :

I believe Joe has started looking at this a bit with our refactoring of the LXD code. I don't know if it will make the 2.4beta2 release, but we should address this in 2.4

Changed in juju:
assignee: nobody → Joseph Phillips (manadart)
importance: Undecided → High
milestone: none → 2.4-beta2
status: New → Triaged
Revision history for this message
Joseph Phillips (manadart) wrote :

In the course of reworking the LXD container and provider code under https://github.com/juju/juju/pull/8656, I can verify from my system testing that images were:
 1) Cached with the correct alias.
 2) Acquired once per host.

I included testing along the same lines as John - multiple simultaneous unit deployments to containers on a single machine.

Changed in juju:
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.