Canonical Juju

8+ containers makes one get stuck in "pending" on joyent

Bug #1626725 reported by Aaron Bentley on 2016-09-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	James Tunnicliffe	Canonical Juju 2.1-beta1

Bug Description

When we add eight containers to a Joyent machine, one gets stuck in pending. Eventually, the test script raises AgentsNotStarted.

We are seeing this in our long-running industrial/reliability tests.

e.g. http://juju-ci.vapour.ws/job/industrial-test-joyent/184/consoleText

It happens almost every time, but not every time. It is usually the last container (e.g. 3/lxd/7), but not always. Sometimes it's the seventh or even the first.

It does not happen on AWS, even though AWS machines are no better (and in some regards worse) than Joyent machines in terms of their cpu/memory/storage.

I reproduced this using our juju-ci-tools industrial_test script.
./industrial_test.py parallel-joyent `jver 2.0-rc1-4405` density ~/sandbox/logz/ --single --attempts 1 --json-file results.json --new-agent-url https://us-east.manta.joyent.com/cpcjoyentsupport/public/juju-dist/parallel-testing/agents --agent-stream revision-build-4405

An example run is attached.

See original description

Tags:

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-09-22:

bug-log.txt Edit (312.0 KiB, text/plain)

tags:	added: jujuqa
description:	updated

Richard Harding (rharding) on 2016-09-28

Changed in juju:
milestone:	2.0-rc2 → 2.0.0
assignee:	nobody → Richard Harding (rharding)

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-03:

This job has gone, but the problem remains. Just did a bunch of add-machines and eventually machine 0 (the host) just stopped responding. Can ping it, but not SSH to it and the agent state shows as down.

On MAAS I added 50 LXDs and got bored of waiting for something bad to happen.

Nothing was crying out to me from the logs, but that isn't much of a surprise at this stage in the investigation.

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-03:

I ran a script that added a container every time juju status showed everything as started. I got up to waiting for 0/lxd/13.

There is only 400MB of space on /, but I can still write a text file.

In 0/lxc/13 cloud-init ran, but didn't find any configuration files. This may be because of a data race (juju hasn't written them yet) or something else.

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-03:

The API call to LXD does show the sending of the right systemd init job.

Just restarted this experiment and it failed again on 0/lxd/13 (14th container).

tune2fs -l /dev/vda1 shows 0 reserved blocks and >0 free blocks, we we aren't out of disk space.

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-04:

...but we may have been out of disk space sometimes. Just 'juju remove machine 0/lxd/n' with n 1..5 then did a juju add-machine lxd:0 and ran out of disk space. Maybe joys of sparse file systems? Something grew but didn't shrink?

Given this has happened twice on the 14th LXC when doing the slow and deliberate route and the machine is basically full I am going to leave the slow/reliable path and switch back to the fast path to try and identify what is happening there.

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-04:

OK, got it. From what I can tell the disk fills up and this causes the LXD ZFS pool to be taken offline due to I/O errors. At that point it looks like things shrink back to a size where you can log in, but you can't do anything with the containers because their disks are offline.

I tried starting 5 machines with 11 guests on each, which leaves enough disk space for all the containers to start, and they all started fine. So, seems like this isn't a bug, but not a helpful user experience.

From dmesg:

[ 1090.253680] WARNING: Pool 'lxd' has encountered an uncorrectable I/O failure and has been suspended.

http://paste.ubuntu.com/23274269/

root@c0b55d45-188c-45dc-8efd-17c6766a5425:~# zpool status -x
  pool: lxd
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: none requested
config:

NAME STATE READ WRITE CKSUM
lxd ONLINE 0 640 0
/var/lib/lxd/zfs.img ONLINE 0 1.30K 0

errors: 640 data errors, use '-v' for a list

This particular issue is documented here:
http://zfsonlinux.org/msg/ZFS-8000-HC/

Data errors took the pool offline.

Changed in juju:
status:	Triaged → Won't Fix
status:	Won't Fix → Invalid
assignee:	Richard Harding (rharding) → James Tunnicliffe (dooferlad)

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-10-04:

James, thank you for your investigation.

Maybe it's not possible to prevent this ZFS error from occurring, but I think it's still incorrect to list the status as "pending" if, in fact, it will never start. I think the status should indicate that user intervention is required, especially if (as it appears) other containers on this machine are affected by this issue.

Changed in juju:
status:	Invalid → Triaged

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-04:

My reproducer was:

Bootstrap on <cloud>
Add <machines>
For each machine, add <containers>

This is encoded in:
https://github.com/dooferlad/jujuWand/blob/abb8a297bd7837298c8c3cbbb536b757cd3931ab/add-lxd.py

This passed:
./add-lxd.py --controller joyent --guests=11 --hosts 5

Currently testing with --deploy (i.e. deploy the Ubuntu charm to a container n times rather than starting n containers).

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-04:

When deploying the Ubuntu charm instead of just adding containers I ran out of disk space after 8 containers. Tested with this:

./add-lxd.py --controller joyent --guests=11 --deploy

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-10-04:

#10

Here's another thing that I don't get: the Joyent machines hit this, and they have a root-disk of "51200M". The AWS machines didn't hit this, and they have a smaller root disk of "8192M".

AWS: arch=amd64 cores=1 cpu-power=300 mem=3840M root-disk=8192M
availability-zone=eu-west-1a
Joyent: arch=amd64 cores=1 mem=3840M root-disk=51200M

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-05:

#11

Ah, interesting. I only got 7.4G, which must be the default machine that Juju gets.

ubuntu@55830fc2-2f00-4adf-a770-78f8aa2302fb:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 868M 0 868M 0% /dev
tmpfs 175M 5.1M 170M 3% /run
/dev/vda1 7.4G 2.3G 5.2G 31% /
tmpfs 875M 0 875M 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 875M 0 875M 0% /sys/fs/cgroup
/dev/vdb 50G 52M 47G 1% /mnt
tmpfs 175M 0 175M 0% /run/user/1000
lxd/containers/juju-1593c9-0-lxd-0 97G 837M 96G 1% /var/lib/lxd/containers/juju-1593c9-0-lxd-0.zfs

With 97G per LXD plus some overhead for everything else it is entirely obvious where that space is going. I am surprised that LX[CD] doesn't fail early when there isn't enough space to create a container. Filing a bug now.

Revision history for this message

Richard Harding (rharding) wrote on 2016-10-05:

#12

So the space isn't all used right away. It does some deduping and I believe only uses up the space as it's required. I think that what's going on is that as the filesystem does its work, notices duplicate files/etc that the space allocation ends up flexing back/forth over time.

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-05:

#13

This is the root cause of at least one problem: https://bugs.launchpad.net/juju/+bug/1630571

If non-space related issues exist once that is fixed up then we can revisit. See also: https://github.com/lxc/lxd/issues/2458

Alexis Bruemmer (alexis-bruemmer) on 2016-10-05

Changed in juju:
milestone:	2.0.0 → 2.1.0

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-06:

#14

Download full text (4.1 KiB)

Oh, this is great:

2016-10-06 12:18:26 INFO juju.utils.packaging.manager utils.go:57 Running: apt-get --option=Dpkg::Options::=--force-confold --option=Dpkg::options::=--force-unsafe-io --assume-yes --quiet install --no-install-recommends zfsutils-linux
2016-10-06 12:18:37 INFO juju.utils.packaging.manager utils.go:98 Retrying: &{/usr/bin/apt-get [apt-get --option=Dpkg::Options::=--force-confold --option=Dpkg::options::=--force-unsafe-io --assume-yes --quiet install --no-install-recommends zfsutils-linux] [] <nil> Reading package lists...
  libuutil1linux libzfs2linux libzpool2linux python python-minimal python2.7
  python2.7-minimal zfs-doc
  nfs-kernel-server zfs-initramfs
  zfs-zed
  libuutil1linux libzfs2linux libzpool2linux python python-minimal python2.7
  python2.7-minimal zfs-doc zfsutils-linux
Err:8 http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu xenial-updates/main amd64 zfs-doc all 0.6.5.6-0ubuntu12
Err:12 http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu xenial-updates/main amd64 libzfs2linux amd64 0.6.5.6-0ubuntu12
Err:13 http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu xenial-updates/main amd64 zfsutils-linux amd64 0.6.5.6-0ubuntu12
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/zfs-doc_0.6.5.6-0ubuntu12_all.deb 404 Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libuutil1linux_0.6.5.6-0ubuntu12_amd64.deb 404 Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libnvpair1linux_0.6.5.6-0ubuntu12_amd64.deb 404 Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libzpool2linux_0.6.5.6-0ubuntu12_amd64.deb 404 Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libzfs2linux_0.6.5.6-0ubuntu12_amd64.deb 404 Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/zfsutils-linux_0.6.5.6-0ubuntu12_amd64.deb 404 Not Found
  libuutil1linux libzfs2linux libzpool2linux python python-minimal python2.7
  python2.7-minimal zfs-doc
  nfs-kernel-server zfs-initramfs
  zfs-zed
  libuutil1linux libzfs2linux libzpool2linux python python-minimal python2.7
  python2.7-minimal zfs-doc zfsutils-linux
Err:8 http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu xenial-updates/main amd64 zfs-doc all 0.6.5.6-0ubuntu12
Err:12 http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu xenial-updates/main amd64 libzfs2linux amd64 0.6.5.6-0ubuntu12
Err:13 http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu xenial-updates/main amd64 zfsutils-linux amd64 0.6.5.6-0ubuntu12
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/zfs-doc_0.6.5.6-0ubuntu12_all.deb 404 Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libuutil1linux_0.6.5.6-0ubuntu12_amd64.deb 404 Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libnvpair1linux_0.6.5.6-0ubuntu12_amd64.deb ...

Oh, this is great:

2016-10-06 12:18:26 INFO juju.utils.packaging.manager utils.go:57 Running: apt-get --option=Dpkg::Options::=--force-confold --option=Dpkg::options::=--force-unsafe-io --assume-yes --quiet install --no-install-recommends zfsutils-linux
2016-10-06 12:18:37 INFO juju.utils.packaging.manager utils.go:98 Retrying: &{/usr/bin/apt-get [apt-get --option=Dpkg::Options::=--force-confold --option=Dpkg::options::=--force-unsafe-io --assume-yes --quiet install --no-install-recommends zfsutils-linux] []  <nil> Reading package lists...
  libuutil1linux libzfs2linux libzpool2linux python python-minimal python2.7
  python2.7-minimal zfs-doc
  nfs-kernel-server zfs-initramfs
  zfs-zed
  libuutil1linux libzfs2linux libzpool2linux python python-minimal python2.7
  python2.7-minimal zfs-doc zfsutils-linux
Err:8 http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu xenial-updates/main amd64 zfs-doc all 0.6.5.6-0ubuntu12
Err:12 http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu xenial-updates/main amd64 libzfs2linux amd64 0.6.5.6-0ubuntu12
Err:13 http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu xenial-updates/main amd64 zfsutils-linux amd64 0.6.5.6-0ubuntu12
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/zfs-doc_0.6.5.6-0ubuntu12_all.deb  404  Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libuutil1linux_0.6.5.6-0ubuntu12_amd64.deb  404  Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libnvpair1linux_0.6.5.6-0ubuntu12_amd64.deb  404  Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libzpool2linux_0.6.5.6-0ubuntu12_amd64.deb  404  Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libzfs2linux_0.6.5.6-0ubuntu12_amd64.deb  404  Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/zfsutils-linux_0.6.5.6-0ubuntu12_amd64.deb  404  Not Found
  libuutil1linux libzfs2linux libzpool2linux python python-minimal python2.7
  python2.7-minimal zfs-doc
  nfs-kernel-server zfs-initramfs
  zfs-zed
  libuutil1linux libzfs2linux libzpool2linux python python-minimal python2.7
  python2.7-minimal zfs-doc zfsutils-linux
Err:8 http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu xenial-updates/main amd64 zfs-doc all 0.6.5.6-0ubuntu12
Err:12 http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu xenial-updates/main amd64 libzfs2linux amd64 0.6.5.6-0ubuntu12
Err:13 http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu xenial-updates/main amd64 zfsutils-linux amd64 0.6.5.6-0ubuntu12
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/zfs-doc_0.6.5.6-0ubuntu12_all.deb  404  Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libuutil1linux_0.6.5.6-0ubuntu12_amd64.deb  404  Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libnvpair1linux_0.6.5.6-0ubuntu12_amd64.deb  404  Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libzpool2linux_0.6.5.6-0ubuntu12_amd64.deb  404  Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/libzfs2linux_0.6.5.6-0ubuntu12_amd64.deb  404  Not Found
E: Failed to fetch http://eu-ams-1.joyent.clouds.archive.ubuntu.com/ubuntu/pool/main/z/zfs-linux/zfsutils-linux_0.6.5.6-0ubuntu12_amd64.deb  404  Not Found
2016-10-06 12:18:39 INFO juju.utils.packaging.manager utils.go:98 Retrying: &{/usr/bin/apt-get [apt-get --option=Dpkg::Options::=--force-confold --option=Dpkg::options::=--force-unsafe-io --assume-yes --quiet install --no-install-recommends zfsutils-linux] []

I had put a fix in place to limit space used by the ZFS pool to 90% of the free space on the root partition, but we silently failed to get the ZFS tools. LXD fell back to directory storage and things continued to work, but really this should be a failure, right?

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-06:

#15

Arg, and we don't perform an apt-get update before retrying, which would have fixed the problem!

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-06:

#16

Right. I have a patch that will create a zpool that uses 90% of the free space on the host's root file system. This sounds drastic, but since it is using a sparse file it won't use the space on the host FS until it is used in the container. This is an imperfect fix because the host file system could fill up to the point where the sparse file can't expand, but it is less broken than before.

Juju needs to grow monitoring and active management abilities to really fix this so the zpool that LXD is using can be allowed to grow to mostly fill the host disk, while the host doesn't use up more space than has been promised to the zpool. The right way to do this is to not use sparse files for the zpool and grow it 1GB (or whatever increment seems reasonable) at a time as needed.

At the same time I found that if you don't do os upgrades as part of bootstrap, apt-get update never gets run, which results in the above stale package problem. I have updated the apt wrapper code to perform an update when an install fails before retrying; this gets us out of the need for manual intervention to fix out or date package list issues.

James Tunnicliffe (dooferlad) on 2016-10-07

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-10:

#17

I just asked joyent for a machine:

juju add-machine --constraints "mem=16G cores=4 root-disk=200G"

This gave me a machine with 200G on /dev/sdb, not a larger root disk, so LXD still ran out of space. I don't know if we have a way of configuring the LXD storage location... investigating.

Revision history for this message

James Tunnicliffe (dooferlad) wrote on 2016-10-10:

#18

No, can't do anything about this one without changing Juju - we perform:

lxd init --auto --stoarge-backend xfs --storage-pool lxd --storage-create-loop <size>

...without any option to do anything else. https://github.com/lxc/lxd covers other options. Since it varies by provider how and where storage is mounted this is work that needs thinking through and scheduling.