newly created LXD container has apparent corrupt apt hashes

Bug #1803950 reported by Jeff Hillman
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
Medium
Unassigned

Bug Description

juju 2.4.5 via snap

Offline - air gapped environment

Created a local ubuntu mirror for bionic and xenial
Created a local LXD mirror and juju agents mirror

When deploying LXD containers for openstack cloud from a bundle, all charms fail inside o the LXD containers. We are seeing python error about installing apt packages.

Running apt update inside of the container gives the folowing errors (this is a secure environment and I cannot copy files off, so I had to retype this by hand)

---

Get:10 http://<mirror ip>:<port>/ubuntu bionic-security/universe Translations-en [58.4 kb]
Err:10 http://<mirror ip>:<port>/ubuntu bionic-security/universe translations-en

Get:11 http://<mirror ip>:<port>/ubuntu bionic-security/multiverse amd64 Packages [1448 b'
Err:11 http://<mirror ip>:>port>/ubuntu bionic-secufity/multiverse amd64 Packages

Fetched 242 kb in 0s (498kb/s)
Reading package lists... Done
E: Failed to fetch http://<mirror ip>:<port>/ubuntu/dists/bionic-updates/universe/binary-amd64/P
ackages.xz File has unexpected size (557052 != 571124). Mirror sync in Progress? [IP: <mirror ip> >port>]
   Hashes of expected file::
    - Filesize:571124 [weak]
    - SHA256:<hash>
    - SHA1:<hash>
    - MD5Sum:>hash>
   Release file created at: Thu, 01 Nov 2018 20:25:49 +000

---

Not just bionic-security gives this, but a few others do as well, i just didn't retype the entire thing.

Bare-metal hosts being deployed do not have this issue. We have set the container-inherit-properties at first to "ca-cert, apt-primary", and when that didn't work we also added "apt-sources, apt-security" to that list and no change.

If we run 'rm -rf /var/lib/apt/lists/*' and then 'apt clean', we can then run apt update clearnly and it works.

So tha tells me the sources.list being passed is good, but something is corrupting the apt cache as the container is being created.

I will be limited in what information I can pull out of this environment, but feel free to ask and I can try.

Tags: cpe-onsite
Revision history for this message
Jeff Hillman (jhillman) wrote :

We redeployed the cloud adding the cleaning of /var/lib/apt/lists and apt clean to juju cloud-init postruncmds and charms appear to have no issue upon initial LXD creation and installation.

Revision history for this message
Joseph Phillips (manadart) wrote :

Any chance you can get the contents of /var/lib/apt/lists/... for comparison of metal vs one of the containers?

Revision history for this message
Jeff Hillman (jhillman) wrote :

I can (from a visual perspective), but it will be some time. we have a deploy running and need to debug some other things. Since we have a work around this is a lower priority, but we will eventually circle back to it.

Thanks for investigating.

Revision history for this message
Jeff Hillman (jhillman) wrote :

Interestingly, not all machines do appear to work out of the box. aodh, for example, worked just fine.

Ceilometer LXD, however, required that us to go in and run 'apt update' ourselves before juju resolved would work to fix the install hook failure.

We're adding 'apt update' to the end of our postruncmds to work around this. Hopefully this afternoon I can get the requested information about diff's on /var/lib/apt/lists

Revision history for this message
Jeff Hillman (jhillman) wrote :

Ok, so here's what is different about the bare-mtal vs. LXD in /var/lib/apt/lists...

the ls of the directory seems to be the same, however a du on the bare-metal has a filesize total of 99696 and on the LXD container it is 140660

Looking at individual files we see:

the lxd container gets the deb-src repos whereas bare-metal does not

filessize for the entries in that dir a smaller, by a couple kilobytes for the container as compared to bare-metal, but it appears to only be for security and updates, the main repo files are the same size.

Revision history for this message
Jeff Hillman (jhillman) wrote :

FYI the previous workaround doesn't work. We're running into race conditions when trying to do either apt clean or apt update via cloud-init where the charm is already running apt commands.

Revision history for this message
Jeff Hillman (jhillman) wrote :

We tried this config in the juju model-config.

cloudinit-userdata: |
  apt:
    preserve_sources_list: false
    primary:
      - arches:
        - amd64
        uri: "http://<mirror ip>:<port>/ubuntu"
        search_dns: false
    security:
      - arches:
        - amd64
        uri: "http://<mirror ip>:<port>/ubuntu"
        search_dns: false
    conf: |
      APT {
        Get {
          Assume-Yes "true";
          Fix-Broken "true";
        }
      }

And it is still giving us the same problem. No matter what we do it always has the sources in the container. For this step we also disabled container-inherit-properties completely since I figured that cloud-init would take over, but ic an't see that it does.

Revision history for this message
Jeff Hillman (jhillman) wrote :

I realized we still had apt-mirror set in the model-config.yaml. we are removing that and re-deploying.

I did confirm via /var/lib/cloud/instances/<container-name>/user-data.txt that our apt config is in there.

It can also be seen in lxc config show <container-name>, but apt_mirror is right after it.

Revision history for this message
Jeff Hillman (jhillman) wrote :

2 things

1) I had ('s instead of {'s in my apt config for cloud-init, fixing that

2) regardless of what I set for apt config in cloud-init, unless I have container-inherit-properties set explicitly, the containers won't follow apt rules, even though I can see in /var/lib/cloud/instances/<container name>/user-data.txt.1 that the cloud config is there

fixing both of the above and testing.

Jeff Hillman (jhillman)
tags: added: field-critical
Revision history for this message
Nobuto Murata (nobuto) wrote :

@Jeff,

Can you please double-check if your local mirror is not corrupted? I know if you are pursuing other possibilities, but it's still good to make sure it's not broken.

Please execute the following commands and make sure SHA256SUM matches.

$ curl -s http://<mirror ip>:<port>/ubuntu/dists/bionic-updates/InRelease | grep universe/binary-amd64/Packages.xz

$ curl -s http://<mirror ip>:<port>/ubuntu/dists/bionic-updates/universe/binary-amd64/Packages.xz | sha256sum

For example with the upstream archive server:

$ curl -s http://archive.ubuntu.com/ubuntu/dists/bionic-updates/InRelease | grep universe/binary-amd64/Packages.xz
 758bb065ba93e9ae7aed834ef951c635 575200 universe/binary-amd64/Packages.xz
 64033879d6f69491fc5c0bf6410ccbc32ed75016 575200 universe/binary-amd64/Packages.xz
 d68f433c84a8d6beadc8a7413cafdab044ca79738289014a10ad93c001d844b9 575200 universe/binary-amd64/Packages.xz

$ curl -s http://archive.ubuntu.com/ubuntu/dists/bionic-updates/universe/binary-amd64/Packages.xz | sha256sum
d68f433c84a8d6beadc8a7413cafdab044ca79738289014a10ad93c001d844b9 -

Both matches with d68f433c84a8d6beadc8a7413cafdab044ca79738289014a10ad93c001d844b9.

Revision history for this message
Jeff Hillman (jhillman) wrote :

RE my comment #9, that still did not fix the problem.

My next step (after performing the hash checks nobuto asked for) is disable the maas-proxy. we don't need it on anyway.

if that doesn't work, as another work around test, will be to set enable-os-{refresh-update|upgrade} to False in the model and try the postruncmds again to see if it stops the race condition of dpkg being locked.

Revision history for this message
Jeff Hillman (jhillman) wrote :

We just ran the curl, on a bare-metal machine, and the checksums matched for both updates and security (security was a particular repo that is failing)

Revision history for this message
Jeff Hillman (jhillman) wrote :

Disabling MAAS proxy has no impact on this problem.

I am now moving to testing disabling the apt refresh and using the prevous method of postruncmds to remove /var/lib/apt/lists and manually update the cache. (apt update)

Revision history for this message
Richard Harding (rharding) wrote :

Marking incomplete as we're awaiting new word on latest testing and we don't have a root cause to be able to work towards at the moment. However, I did mark critical in that this is a blocking issue and once we get more info we should do everything we can to address.

Changed in juju:
status: New → Incomplete
importance: Undecided → Critical
Revision history for this message
Jeff Hillman (jhillman) wrote :

That did not work either. Per /var/log/cloud-init-output.log, the initial apt update before the postruncmds is also failing, which is causing some core components the cloud-init is asking for to fail installation.

If we can't find the root of this problem, then we need to at least find a way to inject the cleaning of the cache earlier int he cloud init tasks, preferrabley the first task.

I am open to suggestion.

Revision history for this message
Jeff Hillman (jhillman) wrote :

What more info is needed? I will do my best to gather whatever I can. I have run oit of tests to try at this point.

Revision history for this message
Dean Henrichsmeyer (dean) wrote :

@jhillman, this doesn't look like a Juju bug, this looks like an environment issue. Please work with @nobuto or your colleagues to verify the environment. Can you deploy a machine, start a container and run an apt update inside it? If you can't, it's definitely an environmental issue.

Revision history for this message
Jeff Hillman (jhillman) wrote : Re: [Bug 1803950] Re: newly created LXD container has apparent corrupt apt hashes

I'll give that a go. If it works at least we narrow it down to juju lxc
something. If not, I've got a bigger problem.

On Tue, Nov 20, 2018, 7:01 PM Dean Henrichsmeyer <<email address hidden>
wrote:

> @jhillman, this doesn't look like a Juju bug, this looks like an
> environment issue. Please work with @nobuto or your colleagues to verify
> the environment. Can you deploy a machine, start a container and run an
> apt update inside it? If you can't, it's definitely an environmental
> issue.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1803950
>
> Title:
> newly created LXD container has apparent corrupt apt hashes
>
> Status in juju:
> Incomplete
>
> Bug description:
> juju 2.4.5 via snap
>
> Offline - air gapped environment
>
> Created a local ubuntu mirror for bionic and xenial
> Created a local LXD mirror and juju agents mirror
>
> When deploying LXD containers for openstack cloud from a bundle, all
> charms fail inside o the LXD containers. We are seeing python error
> about installing apt packages.
>
> Running apt update inside of the container gives the folowing errors
> (this is a secure environment and I cannot copy files off, so I had to
> retype this by hand)
>
> ---
>
> Get:10 http://<mirror ip>:<port>/ubuntu bionic-security/universe
> Translations-en [58.4 kb]
> Err:10 http://<mirror ip>:<port>/ubuntu bionic-security/universe
> translations-en
>
> Get:11 http://<mirror ip>:<port>/ubuntu bionic-security/multiverse
> amd64 Packages [1448 b'
> Err:11 http://<mirror ip>:>port>/ubuntu bionic-secufity/multiverse
> amd64 Packages
>
> Fetched 242 kb in 0s (498kb/s)
> Reading package lists... Done
> E: Failed to fetch http://<mirror
> ip>:<port>/ubuntu/dists/bionic-updates/universe/binary-amd64/P
> ackages.xz File has unexpected size (557052 != 571124). Mirror sync in
> Progress? [IP: <mirror ip> >port>]
> Hashes of expected file::
> - Filesize:571124 [weak]
> - SHA256:<hash>
> - SHA1:<hash>
> - MD5Sum:>hash>
> Release file created at: Thu, 01 Nov 2018 20:25:49 +000
>
> ---
>
> Not just bionic-security gives this, but a few others do as well, i
> just didn't retype the entire thing.
>
> Bare-metal hosts being deployed do not have this issue. We have set
> the container-inherit-properties at first to "ca-cert, apt-primary",
> and when that didn't work we also added "apt-sources, apt-security" to
> that list and no change.
>
> If we run 'rm -rf /var/lib/apt/lists/*' and then 'apt clean', we can
> then run apt update clearnly and it works.
>
> So tha tells me the sources.list being passed is good, but something
> is corrupting the apt cache as the container is being created.
>
> I will be limited in what information I can pull out of this
> environment, but feel free to ask and I can try.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1803950/+subscriptions
>

Revision history for this message
Jeff Hillman (jhillman) wrote :

Following Dean's suggestion, and going by some suggestions from Nobuto, I have the following results.

connecting to a bare-metal that was deployed by juju, and has juju created LXD containers that are all affected by this problem. the following commands were run:

lxc image list - show juju/bionic/amd64

lxc launch juju/bionic/amd64 test - launches fine

lxc exec test bash

  within the container:

apt udpate - fails because it didn't get any of the juju cloud-init stuff (like inheriting apt config from bare-metal)

Manually changed /etc/apt/sources.list from archive|security.ubuntu.com to <mirror ip>:<port>

apt update - works flawlessly

Looking at /var/lib/cloud/user-data.txt.1 in the test instance is almost empty, whereas the juju machines have a thousand or so lines of commands and output that juju is doing.

If there is another way to verify the mirror other than what nobuto suggested i'm all ears.

To break this down:

bare-metal and KVM VM's work consistently with the mirror provided
containers created by hand, with the sources.list hand edited to point to the mirror (no other commands run) works fine

containers created by juju fail consitently

juju model settings are currently:

container-inherit-properties: "ca-certs, apt-primary, apt-sources, apt-security"
enable-os-refresh-update: False
enable-os-upgrade: False

we have tried apt-mirror by itself and with these settgsin in the past and it failed.

Revision history for this message
Jeff Hillman (jhillman) wrote :

Some preliminary tests of deploying more units (aodh in this case) to controller nodes show I might have a solid work around.

instead of:

 - rm -rf /var/lib/apt/lists/*
 - apt clean
 - apt update

being in postruncmd:, we now have it under preruncmd:

3 consecutive tests works, attempting a full deploy now.

Revision history for this message
Jeff Hillman (jhillman) wrote :

Doing a full deploy, it looks like my workaround is good. i'll remove the field-critical tag.

tags: removed: field-critical
Changed in juju:
milestone: none → 2.4.7
Revision history for this message
Jeff Hillman (jhillman) wrote :

UPDATE:

We discovered a MTU mismatch in our environment, specifically on the VLANs the containers were using.

We have since corrected this problem, and it solved a few other issues were were having, but sadly, this issue still remains.

Changed in juju:
milestone: 2.4.7 → 2.4.8
Revision history for this message
John A Meinel (jameinel) wrote :

Given this is no longer field-critical, is it still considered a Critical bug for Juju?

It does feel like there is something missing in Juju's support for inheriting container settings from the baremetal, but it isn't quite clear to me what the actual fix would be. Adding "rm -rf /var/lib..." doesn't seem like something we would do in a standard mechanism. Do we have a clear list of steps that we should be doing?

Revision history for this message
Jeff Hillman (jhillman) wrote :

I don't think anyone has ever been able to replicate this anywhere, it may have been specific to this air-gapped environment.

Tim Penhey (thumper)
Changed in juju:
importance: Critical → Medium
Changed in juju:
milestone: 2.4.8 → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.