juju-core

juju does not consider whether it has permission to an availability zone

Bug #1380557 reported by Evan on 2014-10-13

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Fix Released	High	Ian Booth	juju-core 1.21-alpha2
	1.20	Fix Released	High	Ian Booth	juju-core 1.20.11

Bug Description

I've split our Bootstack cloud into two aggregates such that we can have different cpu_allocation_ratios for each:
- One aggregate named 'development' that's using the default 'nova' AZ. Everyone has access to this.
- The other using an AZ and aggregate named 'production'. This is locked down to specific tenants using filter_tenant_id.

Juju sees that there are two AZs and, once the bootstrap is up on 'development', tries to deploy all services to 'production' despite not having access to it. This fails:

| fault | {u'message': u'No valid host was found. ', u'code': 500, u'created': u'2014-10-13T08:54:26Z'} |

Weirdly it seems intent on using the production AZ. Even if I pre-populate the environment using `juju add-machine zone=nova --constraints="mem=1024M"`, juju ignores the ready instances and tries to spawn more under production. Even when explicitly placing the bootstrap on the nova AZ (--to zone=nova) and again pre-populating there, it tries to place units on production.

Tags:

Evan (ev) on 2014-10-13

tags:

added: ubuntu-engineering

Revision history for this message

Evan (ev) wrote on 2014-10-13:

I've just confirmed this bug does not occur on juju 1.18.4. The deployment sticks to the nova AZ.

Revision history for this message

Evan (ev) wrote on 2014-10-14:

We're working around this right now by using juju 1.18 for deployments to the nova AZ, and 1.20 for deployments to the production AZ.

John George (jog) on 2014-10-15

Changed in juju-core:
importance:	Undecided → High
status:	New → Triaged
tags:	added: add-machine

Jorge Niedbalski (niedbalski) on 2014-10-15

tags:

added: cts

Curtis Hovey (sinzui) on 2014-10-15

Changed in juju-core:
milestone:	none → 1.21-alpha2

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-10-16:

I think it's possible to do something similar to what we do in the ec2 provider, where we check the availability zone's health. In this case, we'd check the host aggregate's metadata for filter_tenant_id metadata.

It'd probably be easier an more future-proof if we just tried each of the AZs in turn, like in the ec2 provider, though.

Ian Booth (wallyworld) on 2014-10-21

Changed in juju-core:
assignee:	nobody → Ian Booth (wallyworld)
status:	Triaged → In Progress

Ian Booth (wallyworld) on 2014-10-21

Changed in juju-core:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2014-10-23

Changed in juju-core:
status:	Fix Committed → Fix Released

Revision history for this message

Paul Larson (pwlars) wrote on 2014-11-11:

I'm on 1.20.11 on trusty and still getting the error: {u'message': u'No valid host was found. ', u'code': 500, u'created'...
I'm having to use the workaround of juju add-machine zone=production and then deploy my services by hand to the machine

Revision history for this message

Evan (ev) wrote on 2014-11-11:

To clarify Paul's comment, with the two aggregates mentioned in the bug description (nova and production), he's able to explicitly bootstrap to production, as was always the case. However, when he tries to deploy further units it fails. It's still trying to place them on nova and not continuing on to production.

Revision history for this message

Ian Booth (wallyworld) wrote on 2014-11-12:

Would it be possible to get the state server log attached? There could be a few reasons why this has failed. The openstack provider attempts to list the availability zones using the "os-availability-zone" api. It then cycles through these and if they are marked as available, will try allocating a new instance on each one until it succeeds. But the openstack implementation could return a not implemented error. The logic used to do the placement is the same as used for the EC2 provider.

Assuming the zones extension is enabled and there are valid availability zones, the log file should contain messages like:

"no valid hosts available in zone <zonename>, trying another availability zone"

It may be that we need to get access to the cloud you are trying to deploy on to be able to reproduce the problem.

Revision history for this message

Paul Larson (pwlars) wrote on 2014-11-17:

juju-debug.log Edit (19.1 KiB, text/plain)

I think this should have what you need. Let me know if you need anything else.

Revision history for this message

Paul Larson (pwlars) wrote on 2014-11-17:

juju-debug2.log Edit (93.5 KiB, text/html)

Here's another deployment log with juju set-env "logging-config=<root>=DEBUG;unit=DEBUG"

Revision history for this message

Ian Booth (wallyworld) wrote on 2014-11-18:

The latest attached log file contains the following:

The bootstrap machine was started as instance bfd3e2ef-ba7b-4fdb-961c-65c6d55a8b06

A new machine was provisioned and was started as instance 1fc27598-849a-4346-b776-44101863b281
This would have been in response to a request to deply charm cs:precise/apache2-25

The logs did not show any issues with availability zones, nor what zone was chosen as I don't think we log that. However, that the instance was started means it got past the zone selection part.

What then happens is that Juju will poll openstack to obtain address information about the newly created instance. It would call the servers/1fc27598-849a-4346-b776-44101863b281 API. The result of this call was "no servers found". That means that the openstack cloud did not respond to that api call with an instance in ACTIVE or BUILD state. That could also explain the 500 error observed.

So, it looks like Juju has managed to start an instance on which to place charm apache2-25, but the openstack cloud never marks that instance as active. Some more diagnostic would on the cloud and instance itself would be required to find out why. On what basis do we think this is still an issue with availability zones? It may be this is a new issue.

One other point is apparent - the instance above to host the apache charm is tagged by juju as machine-2. I'm not sure what happened to machine-1. There's nothing in the logs.

Revision history for this message

Paul Larson (pwlars) wrote on 2014-11-19:

#10

Right, as I mentioned, this is a demo environment that I can safely reproduce in without affecting our production services. The machine-1 was the first attempt, when you asked me to add the debug settings, I deployed another with that set.

We still see the issue in production environment as well too, and always have to run 'juju add-machine zone=production' before trying to deploy or we see this problem.

If it helps, I'm happy to tear down this whole environment and run it from scratch. Just let me know whatever additional information you need me to gather.

Revision history for this message

Ian Booth (wallyworld) wrote on 2014-11-21:

#11

Would it be possible to give us credentials to access the demo environment and steps to reproduce? We can then experiment and add ad hoc debugging to find out what is happening. It's hard to see exactly what's going on with the level of debugging currently available in the code.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.