Storage provisioner timeouts spawning extra volumes on AWS

Bug #1479546 reported by Adam Israel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Andrew Wilkins

Bug Description

I'm testing integration of the new storage support in 1.24, and I've run into an issue deploying to Amazon where it appears the storage provisioner is timing out and then re-requesting a new volume be created.

I attempted to create this three-node cluster; by the time I killed the environment, it had successfully created 12 SSD volumes per Amazon's console but none of the units had passed the allocation state.

Environment:
Vagrant VM w/trusty
Juju 1.24.3-trusty-amd64

Steps to recreate:
juju switch amazon
juju bootstrap
juju deploy -n3 --repository=/charms local:trusty/cassandra --storage data=ebs-ssd,10G --constraints "instance-type=i2.8xlarge"

debug-log: http://pastebin.ubuntu.com/11962738/

The charm used (though I suspect it's more the instance type+storage request at fault): https://code.launchpad.net/~aisrael/charms/trusty/cassandra/storage

A screenshot of the Volumes console, several minutes after the juju environment was destroyed:
http://i.imgur.com/xGf6Pqz.png

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Storage is a bit "experimental" in 1.24; 1.25 is much more solid. Can you try with master? I'll see if I can repro with 1.24 later today.

Revision history for this message
Adam Israel (aisrael) wrote :

Yep, I'll build master and give it another try and report back.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

BTW: just took a look at the charm changes, and wanted to point out that mounting in "storage-attached" is not enough. If the agent is restarted, "storage-attached" is not re-run. I would suggest using "type: filesystem" for the storage, unless you need to be able to control the filesystem type and other bits.

Andrew Wilkins (axwalk)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Andrew Wilkins (axwalk)
milestone: none → 1.25.1
Revision history for this message
Adam Israel (aisrael) wrote :

Thanks for the tip! I figured that the storage-attached hook would need to do something in order to be persistent -- adding the mount to fstab, perhaps.

I went initially with block because I do want to control the filesystem type (ext3/4 vs. xfs, perhaps), as well as various mount options, like noatime. It all adds a little bit of complexity, but goes far in helping do benchmarking and performance-tuning. Are these something "type: filesystem" can do?

Still working on getting master built; I'll have more on that when I start my day.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

So there is a serious issue here: if the provisioning of one volume fails, other volumes cannot be provisioned. I will look into making this more robust.

The other issue is fixed in 1.25: destroying a volume will first ensure the volume is detached.

Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Andrew Wilkins (axwalk) wrote :

> I went initially with block because I do want to control the filesystem type (ext3/4 vs. xfs, perhaps), as well as various mount options, like noatime. It all adds a little bit of complexity, but goes far in helping do benchmarking and performance-tuning. Are these something "type: filesystem" can do?

Not at the moment. All you get with "type: filesystem" is ext4. Juju will make sure it's formatted and mounted before installing, and will ensure the mount is maintained whenever the agent restarts. Control over filesystems and mount options was originally in the plan, but was removed due to complexity around required vs. preferred options. If you find yourself repeating this work a lot, it may be worth us reconsidering that position.

Adding an fstab entry would be fine. Just be careful about persistent naming; block device names can change across machine reboots. I suggest using the filesystem UUID.

Revision history for this message
John George (jog) wrote :
Revision history for this message
Andrew Wilkins (axwalk) wrote :

John, that's a separate issue; that error does not involve the storage provisioner. We probably need to wait a bit longer for the volume to be associated.

Revision history for this message
Adam Israel (aisrael) wrote :

Hi Andrew,

Confirming that 1.25 trunk fixes the spawning of extra volumes, but still times out while attempting to create one.

Ian Booth (wallyworld)
Changed in juju-core:
milestone: 1.25.1 → 1.25.0
Revision history for this message
Andrew Wilkins (axwalk) wrote :

Adam, if you could test again with master that would be great. It should be fixed now.

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Adam Israel (aisrael) wrote :

Hey Andrew,

I pulled master this morning and re-ran. I'm still seeing timeouts occur, unfortunately. `juju debug-log` started spewing:

machine-0: 2015-08-10 15:26:18 ERROR juju.worker.storageprovisioner storageprovisioner.go:173 processing pending volumes: creating volumes: creating volumes from source "ebs": attaching vol-746fec95 to i-b7e05865: timed out waiting for volume vol-746fec95 to become available

The AWS console shows that I have three standard 8Gb. I checked it again a couple minutes later and a new volume has appeared; my 10G SSD. I let that spin for several minutes, waiting for amazon to bring the volume online, and when I checked again it was being deleted and re-initialized.

juju switch amazon
juju bootstrap
juju deploy -n3 --repository=/charms local:trusty/cassandra --storage data=ebs-ssd,10G --constraints "instance-typi2.8xlarge"

I ran this, deploying three units, and a second time with just the one. I see the SSD being initialized, and then deleted. Here's the relevant debug-log from a one-unit deployment:

machine-0: 2015-08-10 16:02:50 ERROR juju.state.unit unit.go:738 unit cassandra/0 cannot get assigned machine: unit "cassandra/0" is not assigned to a machine
machine-0: 2015-08-10 16:03:08 ERROR juju.worker.storageprovisioner volumes.go:407 failed to create volume 0: cannot attach to non-running instance i-7a9a22a8
machine-0: 2015-08-10 16:03:44 ERROR juju.worker.storageprovisioner volumes.go:407 failed to create volume 0: attaching vol-800a8961 to i-7a9a22a8: timed out waiting for volume vol-800a8961 to become available
machine-0: 2015-08-10 16:04:50 ERROR juju.worker.storageprovisioner volumes.go:407 failed to create volume 0: attaching vol-5c098abd to i-7a9a22a8: timed out waiting for volume vol-5c098abd to become available
machine-0: 2015-08-10 16:06:56 ERROR juju.worker.storageprovisioner volumes.go:407 failed to create volume 0: attaching vol-6e088b8f to i-7a9a22a8: timed out waiting for volume vol-6e088b8f to become available
machine-0: 2015-08-10 16:11:01 ERROR juju.worker.storageprovisioner volumes.go:407 failed to create volume 0: attaching vol-1f0c8ffe to i-7a9a22a8: timed out waiting for volume vol-1f0c8ffe to become available

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Indeed, confirmed there's still something not quite right. Investigating.

Changed in juju-core:
status: Fix Committed → In Progress
Revision history for this message
Andrew Wilkins (axwalk) wrote :
Revision history for this message
Adam Israel (aisrael) wrote :

Hey Andrew,

Success! The volume mounted on the first try, and I've been able to use it as a block device, formatting and mounting it.

I do get this message in debug-log, immediately after the volume switched to "in-use", after each time the config-changed hook finished, and periodically while idle:

machine-1[3941]: 2015-08-11 06:09:23 ERROR juju.worker.storageprovisioner volumes.go:481 attaching volume: querying instance details: Credential must have exactly 5 slash-delimited elements, e.g. keyid/date/region/service/term, got 'not' (AuthFailure)

Everything seems to be working, other than that message. Thanks for all your work tracking this down!

Revision history for this message
Adam Israel (aisrael) wrote :

Side note:

$ juju destroy-environment -y amazon
ERROR failed to destroy environment "amazon"

If the environment is unusable, then you may run

    juju destroy-environment --force

to forcefully destroy the environment. Upon doing so, review
your environment provider console for any resources that need
to be cleaned up. Using force will also by-pass destroy-envrionment block.

ERROR environment destruction failed: destroying environment: failed to destroy environment: Environment cannot be destroyed until all persistent volumes have been destroyed.
Run "juju storage list" to display persistent storage volumes.

--

There's a typo: destroy-envrionment should be destroy-environment

Also, It's not clear from `juju storage help` or any subsequent storage commands *how* to destroy a volume. I found a workaround mentioned on the wip storage doc page (run destroy-environment with --force). Is it safe to assume that functionality is targeted for 1.25.0, and that the -detach hook, when fired, should unmount that device?

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Great, thanks for confirming.

The error message you noted is due to lp:1483492, which I just noticed today as well. It's not actively harmful, it's just log spam.

The destroy-environment issue should be removed soon. At the moment, you must destroy the machines that the volumes are attached to; this will cause the volumes to be destroyed. Since we destroy all volumes when the environment is destroyed anyway, I'll be removing that check in destroy-environment.

Andrew Wilkins (axwalk)
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Andrew Wilkins (axwalk) wrote :

FYI, on master destroy-environment will no longer be prevented by the presence of volumes.

Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.