LXD pool already exists

Bug #1738614 reported by John A Meinel
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Incomplete
Low
Unassigned

Bug Description

I'm trying to do some scale testing, creating lots of models in Juju concurrently, and trying it against the latest snap version of LXD.
After doing:

$ snap install lxd
$ export LXD_DIR=/var/snap/lxd/common/lxd
$ juju bootstrap lxd --debug
$ for j in `seq 0 2`; do for i in `seq 0 9`; do juju add-model m$j$i --no-switch --debug & done; time wait; done

Every so often one of them fails with:

ERROR failed to create new model: failed to create new model: adding default storage pools for "lxd": creating default pool "lxd-zfs": validating storage provider config: creating LXD storage pool "juju-zfs": creating storage pool "juju-zfs": Failed to create the ZFS pool: cannot create 'juju-zfs': pool already exists

14:06:37 DEBUG cmd supercommand.go:459 error stack:
failed to create new model: failed to create new model: adding default storage pools for "lxd": creating default pool "lxd-zfs": validating storage provider config: creating LXD storage pool "juju-zfs": creating storage pool "juju-zfs": Failed to create the ZFS pool: cannot create 'juju-zfs': pool already exists

github.com/juju/juju/rpc/client.go:149:
github.com/juju/juju/api/apiclient.go:925:
github.com/juju/juju/api/modelmanager/modelmanager.go:74:
github.com/juju/juju/cmd/juju/controller/addmodel.go:237:

I'm guessing the issue is with storage pools and the new LXD. We seem to only sometimes handle when the pool already exists. I don't know if it is a concurrent initialization problem. Maybe when we have 10 things creating the pool at the same time, one of them will fail its retry count?

Anyway, this always used to work, so it feels like something we should at least understand what we changed.

$ juju --version
2.3.1-xenial-amd64

(this is 2.3.1 but with Tim's txn log watcher patch merged into 2.3)

Revision history for this message
John A Meinel (jameinel) wrote :

If I add the models 1 at a time, none fail. It seems to have something specifically to do with adding them in parallel.

Revision history for this message
John A Meinel (jameinel) wrote :

I tried doing just 5 in parallel at the same time (rather than 10) and I still had about a 10% failure rate.

Revision history for this message
Eric Claude Jones (ecjones) wrote :

Hello John,

When creating a model, Juju interprets an HTTP 500 status code received from LXD's "create storage pool" endpoint as an "Already Created" error (which is safe to ignore) iff the controller state has already made note of that storage pool.

In essence, Juju is querying two non-atomically updated sources to determine if an error from LXD is safe to ignore. These sources are not updated atomically so creating models in parallel can fail just as you have seen above.

I can see a couple of solutions to this problem:

1) We can ask to have the LXD API updated to return a more specific HTTP code for "Already Exists" (many popular API's i.e Github return 422 for already exists)

2) We can search the error string returned by the LXD API to determine the cause of the error - which as of LXD 2.21 will contain "pool already exists" when the pool has already been created

I do understand that literature exists which advises against processing error message strings.

Revision history for this message
Eric Claude Jones (ecjones) wrote :

It might be good to note that a machine level mutex is probably not a solution here
and for a pair of reasons:

1) Almost the entire operation would need to be locked by the Mutex as it is currently implemented

2) While this would work with a machine level mutex solution:

parallel 'juju add-model m{1}{2} --no-switch --debug; time wait' ::: `seq 0 9` ::: `seq 0 9`

the following would not:

parallel -Smachine1, machine2 'juju add-model m{1}{2} --no-switch --debug; time wait' ::: `seq 0 9` ::: `seq 0 9`

Changed in juju:
assignee: nobody → Eric Claude Jones (ecjones)
Revision history for this message
Eric Claude Jones (ecjones) wrote :

See reference implementation: https://github.com/juju/juju/pull/8257

Revision history for this message
Eric Claude Jones (ecjones) wrote :

Juju is not actually the culprit here. lxd is actually allowing the race conditions against its own internal sqlite database and its communication with lxc. An issue has been opened.

https://github.com/lxc/lxd/issues/4150

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Marking as Invalid for 'juju' as per comment # 6. Please track the issue as per a linked github project.

Changed in juju:
status: Triaged → Invalid
Revision history for this message
Tim Penhey (thumper) wrote :

While LXD may indeed have an issue, it is Juju's responsibility to make things work even against flakey substrates.

What does Juju need to do to work in this situation?

Changed in juju:
status: Invalid → Incomplete
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1738614] Re: LXD pool already exists

Here's my understanding.

1) This was only a problem because I switched from Deb LXD to Snap LXD
which left me with a zfs pool juju-zfs but no LXD storage pool record. This
meant that CreatePool always fails.
2) Juju allows there to be no storage pool, (it just doesn't let you do
special attachments)
3) We were failing because the error you get changes when you're
concurrently calling Create and Get (I don't know if you need multiple
create in parallel)

So the ultimate issue is that the underlying provider was in a bad state.
I would argue that Juju shouldn't really be creating a "juju-zfs" pool
unless the user has asked for special storage. (we don't create ebs volumes
until you want them).
(Juju can't know how you want to use it, so we're just as likely to be
wasting space because you don't use it, or not using enough because you
want something really big)

It is nice to be able to just play with the storage primitives. But it does
feel like the pool shouldn't be allocated during AddModel.

Unless we were using that storage for the container images themselves. But
as I bootstrapped and then saw that juju-zfs was not in use, I don't think
that is the case. (we also create a juju-btrfs which appears to go unused
as well).

The "fix" is to fix my underlying provider (get LXD to either use the
existing pool or delete the pool and have us create it from scratch).
The ultimate fix feels more like asking why we're doing it in the first
place.

John
=:->

On Jan 9, 2018 7:10 AM, "Tim Penhey" <email address hidden> wrote:

> While LXD may indeed have an issue, it is Juju's responsibility to make
> things work even against flakey substrates.
>
> What does Juju need to do to work in this situation?
>
> ** Changed in: juju
> Status: Invalid => Incomplete
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1738614
>
> Title:
> LXD pool already exists
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1738614/+subscriptions
>

Revision history for this message
Liam Young (gnuoy) wrote :

I think I am hitting this bug, or at least a variant of it. I am trying my best to configure a lxd deployment that use directory backed storage. If I create 10 concurrent models, ~50% fail with:

$ sudo lxc storage list
+------------+-------------+--------+------------------------------------------------+---------+
| NAME | DESCRIPTION | DRIVER | SOURCE | USED BY |
+------------+-------------+--------+------------------------------------------------+---------+
| default | | dir | /var/snap/lxd/common/lxd/storage-pools/default | 3 |
+------------+-------------+--------+------------------------------------------------+---------+
| juju-btrfs | | btrfs | /var/snap/lxd/common/lxd/disks/juju-btrfs.img | 0 |
+------------+-------------+--------+------------------------------------------------+---------+

$ for i in `seq 0 9`; do juju add-model m$i --no-switch & done
...
ERROR failed to create new model: failed to create new model: adding default storage pools for "lxd": creating default pool "lxd-zfs": validating storage provider config: creating LXD storage pool "juju-zfs": creating storage pool "juju-zfs": the "zfs" tool is not enabled

I can work around this by pre-creating a dummy pool called juju-zfs:

$ sudo mkdir /var/snap/lxd/common/lxd/storage-pools/juju-zfs
$ sudo lxc storage create juju-zfs dir source=/var/snap/lxd/common/lxd/storage-pools/juju-zfs
Storage pool juju-zfs created
$ for i in `seq 0 9`; do juju add-model m$i --no-switch & done

all 10 models are created no problem

Revision history for this message
John A Meinel (jameinel) wrote :

What version of Juju and LXD?

On Mon, Jun 4, 2018, 08:21 Liam Young <email address hidden> wrote:

> I think I am hitting this bug, or at least a variant of it. I am trying
> my best to configure a lxd deployment that use directory backed storage.
> If I create 10 concurrent models, ~50% fail with:
>
> $ sudo lxc storage list
>
> +------------+-------------+--------+------------------------------------------------+---------+
> | NAME | DESCRIPTION | DRIVER | SOURCE
> | USED BY |
>
> +------------+-------------+--------+------------------------------------------------+---------+
> | default | | dir |
> /var/snap/lxd/common/lxd/storage-pools/default | 3 |
>
> +------------+-------------+--------+------------------------------------------------+---------+
> | juju-btrfs | | btrfs |
> /var/snap/lxd/common/lxd/disks/juju-btrfs.img | 0 |
>
> +------------+-------------+--------+------------------------------------------------+---------+
>
> $ for i in `seq 0 9`; do juju add-model m$i --no-switch & done
> ...
> ERROR failed to create new model: failed to create new model: adding
> default storage pools for "lxd": creating default pool "lxd-zfs":
> validating storage provider config: creating LXD storage pool "juju-zfs":
> creating storage pool "juju-zfs": the "zfs" tool is not enabled
>
> I can work around this by pre-creating a dummy pool called juju-zfs:
>
> $ sudo mkdir /var/snap/lxd/common/lxd/storage-pools/juju-zfs
> $ sudo lxc storage create juju-zfs dir
> source=/var/snap/lxd/common/lxd/storage-pools/juju-zfs
> Storage pool juju-zfs created
> $ for i in `seq 0 9`; do juju add-model m$i --no-switch & done
>
> all 10 models are created no problem
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1738614
>
> Title:
> LXD pool already exists
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1738614/+subscriptions
>

Revision history for this message
Richard Harding (rharding) wrote :
Revision history for this message
Simon Richardson (simonrichardson) wrote :
Changed in juju:
assignee: Eric Claude Jones (ecjones) → Simon Richardson (simonrichardson)
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
Changed in juju:
assignee: Simon Richardson (simonrichardson) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.