OpenStack provider doesn't try another AZ if the scheduler fails to find a valid host
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju |
High
|
Heather Lanigan | ||
| | juju-core |
High
|
Unassigned | ||
| | 1.25 |
Undecided
|
Unassigned | ||
Bug Description
Juju's OpenStack provider doesn't fall back to an alternate AZ if an instance in the picked AZ fails to find a host to start on.
In my case, my development tenant isn't allowed to start instances in one of the two AZs. Juju tries to distribute a unit onto the bad AZ, the instance gets created successfully, but is then quickly set to ERROR when nova-scheduler fails to find any valid hosts. There's no way to deploy services in this environment without a manual "juju add-machine zone=foo" and "juju deploy bar --to 1234".
The provider is OpenStack (icehouse from trusty-updates) and my client is juju-core 1.21.3-
Console transcript: http://
machine-0.log: http://
environments.yaml: http://
| William Grant (wgrant) wrote : | #1 |
| Changed in juju-core: | |
| status: | New → Triaged |
| importance: | Undecided → High |
| milestone: | none → 1.23 |
| Changed in juju-core: | |
| milestone: | 1.23 → none |
| importance: | High → Medium |
| tags: | added: openstack-provider |
| Changed in juju-core: | |
| importance: | Medium → High |
Did a cursory look into this to see how difficult it would be to fix. It looks like we attempt to properly handle the case of 1 AZ failing properly, but if this bug is still valid, we obviously don't do so successfully.
Once we determine why that is, this should indeed be an easy one to fix. The thing which may be at fault is this: https:/
The code which appears to try and do the correct thing is this: https:/
| Changed in juju-core: | |
| assignee: | nobody → Katherine Cox-Buday (cox-katherine-e) |
Per James, they last saw this in early 1.2x's. We need reconfirmation that this is happening with fresh logs.
| Changed in juju-core: | |
| status: | Triaged → Incomplete |
| Anastasia (anastasia-macmood) wrote : | #4 |
@wgrant - Please confirm that this is still an issue on 1.25.x and 2.0.betaX.
If it's still an issue, please provide newer logs \o/
| Changed in juju-core: | |
| assignee: | Katherine Cox-Buday (cox-katherine-e) → nobody |
| tags: | added: canonical-is |
| Anastasia (anastasia-macmood) wrote : | #5 |
From elmo: this is a very hard bug to test - it requires breaking a cloud.
So, unless something has been explicitly done to cater for this scenario, we need to get an Openstack that would allow us to reproduce - where tenant isn't allowed to start instances in one of the two AZs as per bug description.
| Changed in juju-core: | |
| status: | Incomplete → Triaged |
| Changed in juju-core: | |
| milestone: | none → 2.0.0 |
| Richard Harding (rharding) wrote : Re: [Bug 1425808] Re: OpenStack provider doesn't try another AZ if the scheduler fails to find a valid host | #6 |
We should be able to work with folks on an Orangebox to help validate this.
I don't think we've actually done anything to mitigate this to date
however.
On Wed, Aug 3, 2016 at 7:41 AM Anastasia <email address hidden>
wrote:
> >From elmo: this is a very hard bug to test - it requires breaking a
> cloud.
>
> So, unless something has been explicitly done to cater for this
> scenario, we need to get an Openstack that would allow us to reproduce -
> where tenant isn't allowed to start instances in one of the two AZs as
> per bug description.
>
> ** Changed in: juju-core/1.25
> Status: Incomplete => Triaged
>
> ** Changed in: juju-core
> Status: Incomplete => Triaged
>
> ** Changed in: juju-core
> Milestone: None => 2.0.0
>
> ** Changed in: juju-core/1.25
> Milestone: None => 1.25.7
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> https:/
>
> Title:
> OpenStack provider doesn't try another AZ if the scheduler fails to
> find a valid host
>
> To manage notifications about this bug go to:
> https:/
>
| affects: | juju-core → juju |
| Changed in juju: | |
| milestone: | 2.0.0 → none |
| milestone: | none → 2.0.0 |
| Changed in juju-core: | |
| importance: | Undecided → High |
| status: | New → Triaged |
| Changed in juju-core: | |
| status: | Triaged → Won't Fix |
| Changed in juju: | |
| milestone: | 2.0.0 → 2.0.1 |
| William Grant (wgrant) wrote : | #7 |
This is still present in 1.25.6. If one AZ fills up, the Nova instance will fall into error quickly and the Juju machine remains pending forever.
| Changed in juju: | |
| milestone: | 2.0.1 → none |
| Matt Jarvis (matt-jarvis) wrote : | #8 |
I think this is still present in 2.0.0
I have an OpenStack public cloud where there is an AZ with a tenant filter on it. I've managed to bootstrap by specifying the --to flag, but when I try and deploy a bundle, the instances error out trying to deploy into the unavailable AZ, and then everything just sits in pending.
| Changed in juju: | |
| milestone: | none → 2.2.0 |
| Changed in juju: | |
| milestone: | 2.2-beta1 → 2.2-beta2 |
| Changed in juju: | |
| milestone: | 2.2-beta2 → 2.2-beta3 |
| John A Meinel (jameinel) wrote : | #9 |
are we getting a clear error that it is failing because there are 'no available hosts in this az', or are we just getting an 'unable to provision instances at this time' failure?
| Changed in juju: | |
| status: | Triaged → Incomplete |
| milestone: | 2.2-beta3 → none |
| Heather Lanigan (hmlanigan) wrote : | #10 |
Part of this issue is resolved with: https:/
It will allow the openstack provider to look at the fault message while the instance is in
an error state to determine if error was caused by 'No valid host'. It was tested in an openstack
producing the No valid host message, however there were not multiple AZ to complete testing.
There is existing code to try another AZ if the current one tried fails with 'No valid host'.
| Matt Jarvis (matt-jarvis) wrote : | #11 |
The exact error from OpenStack is "No valid host was found. There are not enough hosts available."
| Tim Kuhlman (timkuhlman) wrote : | #12 |
I believe the problem here is that Nova processes the request asynchronously and Juju doesn't check back on the error. The initial server creation request returns a success for the server creation and then as William mentioned when the nova-scheduler starts working it transitions to an error state. The error handling in Juju only handles the case where the error is returned from Nova as part of the original request and so Juju never picks up on the error.
The first step of fixing this is to have goose return the fault when getting server details.
https:/
Then Juju can grab the server details and check for the fault.
https:/
There is just one major problem I have not been able to get a test environment working that replicates the problem in order to validate my fix. I'll keep working on that but any help in replicating it would be appreciated.
| Heather Lanigan (hmlanigan) wrote : | #13 |
@timkuhlman,
Agreed. The updates to goose to return fault data in ServerDetail was merged today via https:/
The PR for the juju piece is currently under review.
| Heather Lanigan (hmlanigan) wrote : | #14 |
Here is the juju pr: https:/
| Heather Lanigan (hmlanigan) wrote : | #15 |
I was able to partially reproduce/simulate the "No valid host" try another AZ code path with juju built with the above PRs:
17:41:43 INFO juju.provider.
17:41:54 INFO juju.provider.
17:41:54 INFO juju.provider.
17:41:54 INFO juju.provider.
17:41:54 INFO juju.provider.
17:42:05 ERROR juju.cmd.
Unfortunately the instance in availability zone "second" failed to build due to how I was able to trigger the error. The debug output at least confirms that the code to try another AZ if the one tried failed with "No valid host" is working, even if we don't have a clean reproducer yet.
| Heather Lanigan (hmlanigan) wrote : | #16 |
Without an openstack where this issue is reproducible, we've gotten as far as possible towards resolution with the commit of PR 7300. We have confirmed that the juju provider will correctly find "No valid host" errors and try another AZ if that error is hit.
| Changed in juju: | |
| assignee: | nobody → Heather Lanigan (hmlanigan) |
| milestone: | none → 2.2-beta4 |
| status: | Incomplete → In Progress |
| status: | In Progress → Fix Committed |
| Matt Jarvis (matt-jarvis) wrote : | #17 |
I have a cloud on which I can reproduce this. Are there nightly builds of Juju I can install ? I did try building it but had some issues with dependencies. I am at ODS in Boston until next week but can test when I get back.
| Heather Lanigan (hmlanigan) wrote : | #18 |
@matt-jarvis, wonderful! I believe the 'edge' channel of the juju snap is a nightly build.
| Matt Jarvis (matt-jarvis) wrote : | #19 |
@hmlanigan - so with juju 2.2-rc1 from snap the availability zone problem seems to be fixed :
16:01:16 INFO juju.provider.
16:01:17 INFO juju.provider.
16:01:17 INFO juju.provider.
16:01:18 INFO juju.provider.
16:01:18 INFO juju.provider.
However, I now have an issue with assigning a floating IP :
16:01:39 DEBUG juju.provider.
16:01:40 ERROR juju.cmd.
16:01:40 DEBUG juju.cmd.
There is definitely an external network defined in that tenant, and it has floating IP's available. I will have a dig into the code and see if I can work out what it's looking for at that point.
| Matt Jarvis (matt-jarvis) wrote : | #20 |
@hmlanigan I suspect from a brief look at the code that this may also be availability zone related, although I haven't had a chance to dig into it
matt@frankenpad
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
404 Not Found
The resource could not be found.
Neutron server returns request_ids: ['req-c12d2048-
matt@frankenpad
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+------
| id | name | subnets |
+------
| 6751cb30-
| | | b839c2c8-
| | | e71eb7f6-
+------
There is definitely an external network there as you can see.
| Heather Lanigan (hmlanigan) wrote : | #21 |
@matt-jarvis, sounds like you're hitting https:/
| Matt Jarvis (matt-jarvis) wrote : | #22 |
Yup, sounds like it.
| Changed in juju: | |
| status: | Fix Committed → Fix Released |


(Ignore the fact that both the good and the bad instances appear to be in the "nova" AZ. juju-bootstack- ci-machine- 1 is in "production", but an OpenStack bug causes instances with scheduler failures to show up in the default AZ.)