Canonical Juju

LXD pool already exists

Bug #1738614 reported by John A Meinel on 2017-12-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Incomplete	Low	Unassigned

Bug Description

I'm trying to do some scale testing, creating lots of models in Juju concurrently, and trying it against the latest snap version of LXD.
After doing:

$ snap install lxd
$ export LXD_DIR=/var/snap/lxd/common/lxd
$ juju bootstrap lxd --debug
$ for j in `seq 0 2`; do for i in `seq 0 9`; do juju add-model m$j$i --no-switch --debug & done; time wait; done

Every so often one of them fails with:

ERROR failed to create new model: failed to create new model: adding default storage pools for "lxd": creating default pool "lxd-zfs": validating storage provider config: creating LXD storage pool "juju-zfs": creating storage pool "juju-zfs": Failed to create the ZFS pool: cannot create 'juju-zfs': pool already exists

14:06:37 DEBUG cmd supercommand.go:459 error stack:
failed to create new model: failed to create new model: adding default storage pools for "lxd": creating default pool "lxd-zfs": validating storage provider config: creating LXD storage pool "juju-zfs": creating storage pool "juju-zfs": Failed to create the ZFS pool: cannot create 'juju-zfs': pool already exists

github.com/juju/juju/rpc/client.go:149:
github.com/juju/juju/api/apiclient.go:925:
github.com/juju/juju/api/modelmanager/modelmanager.go:74:
github.com/juju/juju/cmd/juju/controller/addmodel.go:237:

I'm guessing the issue is with storage pools and the new LXD. We seem to only sometimes handle when the pool already exists. I don't know if it is a concurrent initialization problem. Maybe when we have 10 things creating the pool at the same time, one of them will fail its retry count?

Anyway, this always used to work, so it feels like something we should at least understand what we changed.

$ juju --version
2.3.1-xenial-amd64

(this is 2.3.1 but with Tim's txn log watcher patch merged into 2.3)

Tags:

Revision history for this message

John A Meinel (jameinel) wrote on 2017-12-17:

If I add the models 1 at a time, none fail. It seems to have something specifically to do with adding them in parallel.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-12-19:

I tried doing just 5 in parallel at the same time (rather than 10) and I still had about a 10% failure rate.

Revision history for this message

Eric Claude Jones (ecjones) wrote on 2018-01-03:

Hello John,

When creating a model, Juju interprets an HTTP 500 status code received from LXD's "create storage pool" endpoint as an "Already Created" error (which is safe to ignore) iff the controller state has already made note of that storage pool.

In essence, Juju is querying two non-atomically updated sources to determine if an error from LXD is safe to ignore. These sources are not updated atomically so creating models in parallel can fail just as you have seen above.

I can see a couple of solutions to this problem:

1) We can ask to have the LXD API updated to return a more specific HTTP code for "Already Exists" (many popular API's i.e Github return 422 for already exists)

2) We can search the error string returned by the LXD API to determine the cause of the error - which as of LXD 2.21 will contain "pool already exists" when the pool has already been created

I do understand that literature exists which advises against processing error message strings.

Revision history for this message

Eric Claude Jones (ecjones) wrote on 2018-01-03:

It might be good to note that a machine level mutex is probably not a solution here
and for a pair of reasons:

1) Almost the entire operation would need to be locked by the Mutex as it is currently implemented

2) While this would work with a machine level mutex solution:

parallel 'juju add-model m{1}{2} --no-switch --debug; time wait' ::: `seq 0 9` ::: `seq 0 9`

the following would not:

parallel -Smachine1, machine2 'juju add-model m{1}{2} --no-switch --debug; time wait' ::: `seq 0 9` ::: `seq 0 9`

Changed in juju:
assignee:	nobody → Eric Claude Jones (ecjones)

Revision history for this message

Eric Claude Jones (ecjones) wrote on 2018-01-05:

See reference implementation: https://github.com/juju/juju/pull/8257

Revision history for this message

Eric Claude Jones (ecjones) wrote on 2018-01-09:

Juju is not actually the culprit here. lxd is actually allowing the race conditions against its own internal sqlite database and its communication with lxc. An issue has been opened.

https://github.com/lxc/lxd/issues/4150

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-01-09:

Marking as Invalid for 'juju' as per comment # 6. Please track the issue as per a linked github project.

Changed in juju:
status:	Triaged → Invalid

Revision history for this message

Tim Penhey (thumper) wrote on 2018-01-09:

While LXD may indeed have an issue, it is Juju's responsibility to make things work even against flakey substrates.

What does Juju need to do to work in this situation?

Changed in juju:
status:	Invalid → Incomplete

Revision history for this message

John A Meinel (jameinel) wrote on 2018-01-11: Re: [Bug 1738614] Re: LXD pool already exists

Here's my understanding.

1) This was only a problem because I switched from Deb LXD to Snap LXD
which left me with a zfs pool juju-zfs but no LXD storage pool record. This
meant that CreatePool always fails.
2) Juju allows there to be no storage pool, (it just doesn't let you do
special attachments)
3) We were failing because the error you get changes when you're
concurrently calling Create and Get (I don't know if you need multiple
create in parallel)

So the ultimate issue is that the underlying provider was in a bad state.
I would argue that Juju shouldn't really be creating a "juju-zfs" pool
unless the user has asked for special storage. (we don't create ebs volumes
until you want them).
(Juju can't know how you want to use it, so we're just as likely to be
wasting space because you don't use it, or not using enough because you
want something really big)

It is nice to be able to just play with the storage primitives. But it does
feel like the pool shouldn't be allocated during AddModel.

Unless we were using that storage for the container images themselves. But
as I bootstrapped and then saw that juju-zfs was not in use, I don't think
that is the case. (we also create a juju-btrfs which appears to go unused
as well).

The "fix" is to fix my underlying provider (get LXD to either use the
existing pool or delete the pool and have us create it from scratch).
The ultimate fix feels more like asking why we're doing it in the first
place.

John
=:->

On Jan 9, 2018 7:10 AM, "Tim Penhey" <email address hidden> wrote:

> While LXD may indeed have an issue, it is Juju's responsibility to make
> things work even against flakey substrates.
>
> What does Juju need to do to work in this situation?
>
> ** Changed in: juju
> Status: Invalid => Incomplete
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1738614
>
> Title:
> LXD pool already exists
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1738614/+subscriptions
>

Here's my understanding.

It is nice to be able to just play with the storage primitives. But it does
feel like the pool shouldn't be allocated during AddModel.

John
=:->

On Jan 9, 2018 7:10 AM, "Tim Penhey" <tim.penhey@canonical.com> wrote:

> While LXD may indeed have an issue, it is Juju's responsibility to make
> things work even against flakey substrates.
>
> What does Juju need to do to work in this situation?
>
> ** Changed in: juju
>        Status: Invalid => Incomplete
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1738614
>
> Title:
>   LXD pool already exists
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1738614/+subscriptions
>

Revision history for this message

Liam Young (gnuoy) wrote on 2018-06-04:

#10

I think I am hitting this bug, or at least a variant of it. I am trying my best to configure a lxd deployment that use directory backed storage. If I create 10 concurrent models, ~50% fail with:

$ for i in `seq 0 9`; do juju add-model m$i --no-switch & done
...
ERROR failed to create new model: failed to create new model: adding default storage pools for "lxd": creating default pool "lxd-zfs": validating storage provider config: creating LXD storage pool "juju-zfs": creating storage pool "juju-zfs": the "zfs" tool is not enabled

I can work around this by pre-creating a dummy pool called juju-zfs:

$ sudo mkdir /var/snap/lxd/common/lxd/storage-pools/juju-zfs
$ sudo lxc storage create juju-zfs dir source=/var/snap/lxd/common/lxd/storage-pools/juju-zfs
Storage pool juju-zfs created
$ for i in `seq 0 9`; do juju add-model m$i --no-switch & done

all 10 models are created no problem

Revision history for this message

John A Meinel (jameinel) wrote on 2018-06-04:

#11

What version of Juju and LXD?

On Mon, Jun 4, 2018, 08:21 Liam Young <email address hidden> wrote:

> I think I am hitting this bug, or at least a variant of it. I am trying
> my best to configure a lxd deployment that use directory backed storage.
> If I create 10 concurrent models, ~50% fail with:
>
> $ sudo lxc storage list
>
> +------------+-------------+--------+------------------------------------------------+---------+
> | NAME | DESCRIPTION | DRIVER | SOURCE
> | USED BY |
>
> +------------+-------------+--------+------------------------------------------------+---------+
> | default | | dir |
> /var/snap/lxd/common/lxd/storage-pools/default | 3 |
>
> +------------+-------------+--------+------------------------------------------------+---------+
> | juju-btrfs | | btrfs |
> /var/snap/lxd/common/lxd/disks/juju-btrfs.img | 0 |
>
> +------------+-------------+--------+------------------------------------------------+---------+
>
> $ for i in `seq 0 9`; do juju add-model m$i --no-switch & done
> ...
> ERROR failed to create new model: failed to create new model: adding
> default storage pools for "lxd": creating default pool "lxd-zfs":
> validating storage provider config: creating LXD storage pool "juju-zfs":
> creating storage pool "juju-zfs": the "zfs" tool is not enabled
>
> I can work around this by pre-creating a dummy pool called juju-zfs:
>
> $ sudo mkdir /var/snap/lxd/common/lxd/storage-pools/juju-zfs
> $ sudo lxc storage create juju-zfs dir
> source=/var/snap/lxd/common/lxd/storage-pools/juju-zfs
> Storage pool juju-zfs created
> $ for i in `seq 0 9`; do juju add-model m$i --no-switch & done
>
> all 10 models are created no problem
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1738614
>
> Title:
> LXD pool already exists
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1738614/+subscriptions
>

What version of Juju and LXD?

On Mon, Jun 4, 2018, 08:21 Liam Young <liam.young@canonical.com> wrote:

> I think I am hitting this bug, or at least a variant of it. I am trying
> my best to configure a lxd deployment that use directory backed storage.
> If I create 10 concurrent models, ~50% fail with:
>
> $ sudo lxc storage list
>
> +------------+-------------+--------+------------------------------------------------+---------+
> |    NAME    | DESCRIPTION | DRIVER |                     SOURCE
>            | USED BY |
>
> +------------+-------------+--------+------------------------------------------------+---------+
> | default    |             | dir    |
> /var/snap/lxd/common/lxd/storage-pools/default | 3       |
>
> +------------+-------------+--------+------------------------------------------------+---------+
> | juju-btrfs |             | btrfs  |
> /var/snap/lxd/common/lxd/disks/juju-btrfs.img  | 0       |
>
> +------------+-------------+--------+------------------------------------------------+---------+
>
> $ for i in `seq 0 9`; do juju add-model m$i --no-switch & done
> ...
> ERROR failed to create new model: failed to create new model: adding
> default storage pools for "lxd": creating default pool "lxd-zfs":
> validating storage provider config: creating LXD storage pool "juju-zfs":
> creating storage pool "juju-zfs": the "zfs" tool is not enabled
>
> I can work around this by pre-creating a dummy pool called juju-zfs:
>
> $ sudo mkdir /var/snap/lxd/common/lxd/storage-pools/juju-zfs
> $ sudo lxc storage create juju-zfs dir
> source=/var/snap/lxd/common/lxd/storage-pools/juju-zfs
> Storage pool juju-zfs created
> $ for i in `seq 0 9`; do juju add-model m$i --no-switch & done
>
> all 10 models are created no problem
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1738614
>
> Title:
>   LXD pool already exists
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1738614/+subscriptions
>

Revision history for this message

Richard Harding (rharding) wrote on 2018-09-04:

#12

It looks like we've hit this in our CI
http://10.125.0.203:8080/job/nw-updateseries-amd64-lxd/428/console

Revision history for this message

Simon Richardson (simonrichardson) wrote on 2019-06-19:

#13

So I'm running into this a lot with the new python-libjuju integration tests.

http://ci.jujucharms.com/job/integration-tests-pylibjuju/13/console
http://ci.jujucharms.com/job/integration-tests-pylibjuju/12/console

etc.

Changed in juju:
assignee:	Eric Claude Jones (ecjones) → Simon Richardson (simonrichardson)

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

#14

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	High → Low
tags:	added: expirebugs-bot

Simon Richardson (simonrichardson) on 2023-05-03

Changed in juju:
assignee:	Simon Richardson (simonrichardson) → nobody

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

lxd #4150
[closed Bug Easy] Edit

Bug watches keep track of this bug in other bug trackers.