juju does not consider whether it has permission to an availability zone

Bug #1380557 reported by Evan
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Ian Booth
1.20
Fix Released
High
Ian Booth

Bug Description

I've split our Bootstack cloud into two aggregates such that we can have different cpu_allocation_ratios for each:
 - One aggregate named 'development' that's using the default 'nova' AZ. Everyone has access to this.
 - The other using an AZ and aggregate named 'production'. This is locked down to specific tenants using filter_tenant_id.

Juju sees that there are two AZs and, once the bootstrap is up on 'development', tries to deploy all services to 'production' despite not having access to it. This fails:

| fault | {u'message': u'No valid host was found. ', u'code': 500, u'created': u'2014-10-13T08:54:26Z'} |

Weirdly it seems intent on using the production AZ. Even if I pre-populate the environment using `juju add-machine zone=nova --constraints="mem=1024M"`, juju ignores the ready instances and tries to spawn more under production. Even when explicitly placing the bootstrap on the nova AZ (--to zone=nova) and again pre-populating there, it tries to place units on production.

Evan (ev)
tags: added: ubuntu-engineering
Revision history for this message
Evan (ev) wrote :

I've just confirmed this bug does not occur on juju 1.18.4. The deployment sticks to the nova AZ.

Revision history for this message
Evan (ev) wrote :

We're working around this right now by using juju 1.18 for deployments to the nova AZ, and 1.20 for deployments to the production AZ.

John George (jog)
Changed in juju-core:
importance: Undecided → High
status: New → Triaged
tags: added: add-machine
tags: added: cts
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: none → 1.21-alpha2
Revision history for this message
Andrew Wilkins (axwalk) wrote :

I think it's possible to do something similar to what we do in the ec2 provider, where we check the availability zone's health. In this case, we'd check the host aggregate's metadata for filter_tenant_id metadata.

It'd probably be easier an more future-proof if we just tried each of the AZs in turn, like in the ec2 provider, though.

Ian Booth (wallyworld)
Changed in juju-core:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Ian Booth (wallyworld)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Paul Larson (pwlars) wrote :

I'm on 1.20.11 on trusty and still getting the error: {u'message': u'No valid host was found. ', u'code': 500, u'created'...
I'm having to use the workaround of juju add-machine zone=production and then deploy my services by hand to the machine

Revision history for this message
Evan (ev) wrote :

To clarify Paul's comment, with the two aggregates mentioned in the bug description (nova and production), he's able to explicitly bootstrap to production, as was always the case. However, when he tries to deploy further units it fails. It's still trying to place them on nova and not continuing on to production.

Revision history for this message
Ian Booth (wallyworld) wrote :

Would it be possible to get the state server log attached? There could be a few reasons why this has failed. The openstack provider attempts to list the availability zones using the "os-availability-zone" api. It then cycles through these and if they are marked as available, will try allocating a new instance on each one until it succeeds. But the openstack implementation could return a not implemented error. The logic used to do the placement is the same as used for the EC2 provider.

Assuming the zones extension is enabled and there are valid availability zones, the log file should contain messages like:

"no valid hosts available in zone <zonename>, trying another availability zone"

It may be that we need to get access to the cloud you are trying to deploy on to be able to reproduce the problem.

Revision history for this message
Paul Larson (pwlars) wrote :

I think this should have what you need. Let me know if you need anything else.

Revision history for this message
Paul Larson (pwlars) wrote :

Here's another deployment log with juju set-env "logging-config=<root>=DEBUG;unit=DEBUG"

Revision history for this message
Ian Booth (wallyworld) wrote :

The latest attached log file contains the following:

The bootstrap machine was started as instance bfd3e2ef-ba7b-4fdb-961c-65c6d55a8b06

A new machine was provisioned and was started as instance 1fc27598-849a-4346-b776-44101863b281
This would have been in response to a request to deply charm cs:precise/apache2-25

The logs did not show any issues with availability zones, nor what zone was chosen as I don't think we log that. However, that the instance was started means it got past the zone selection part.

What then happens is that Juju will poll openstack to obtain address information about the newly created instance. It would call the servers/1fc27598-849a-4346-b776-44101863b281 API. The result of this call was "no servers found". That means that the openstack cloud did not respond to that api call with an instance in ACTIVE or BUILD state. That could also explain the 500 error observed.

So, it looks like Juju has managed to start an instance on which to place charm apache2-25, but the openstack cloud never marks that instance as active. Some more diagnostic would on the cloud and instance itself would be required to find out why. On what basis do we think this is still an issue with availability zones? It may be this is a new issue.

One other point is apparent - the instance above to host the apache charm is tagged by juju as machine-2. I'm not sure what happened to machine-1. There's nothing in the logs.

Revision history for this message
Paul Larson (pwlars) wrote :

Right, as I mentioned, this is a demo environment that I can safely reproduce in without affecting our production services. The machine-1 was the first attempt, when you asked me to add the debug settings, I deployed another with that set.

We still see the issue in production environment as well too, and always have to run 'juju add-machine zone=production' before trying to deploy or we see this problem.

If it helps, I'm happy to tear down this whole environment and run it from scratch. Just let me know whatever additional information you need me to gather.

Revision history for this message
Ian Booth (wallyworld) wrote :

Would it be possible to give us credentials to access the demo environment and steps to reproduce? We can then experiment and add ad hoc debugging to find out what is happening. It's hard to see exactly what's going on with the level of debugging currently available in the code.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.