OpenStack provider doesn't try another AZ if the scheduler fails to find a valid host

Bug #1425808 reported by William Grant
44
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Heather Lanigan
juju-core
Won't Fix
High
Unassigned
1.25
Won't Fix
Undecided
Unassigned

Bug Description

Juju's OpenStack provider doesn't fall back to an alternate AZ if an instance in the picked AZ fails to find a host to start on.

In my case, my development tenant isn't allowed to start instances in one of the two AZs. Juju tries to distribute a unit onto the bad AZ, the instance gets created successfully, but is then quickly set to ERROR when nova-scheduler fails to find any valid hosts. There's no way to deploy services in this environment without a manual "juju add-machine zone=foo" and "juju deploy bar --to 1234".

The provider is OpenStack (icehouse from trusty-updates) and my client is juju-core 1.21.3-0ubuntu1~15.04.1~juju1 from ppa:juju/stable.

Console transcript: http://pastebin.ubuntu.com/10422240/
machine-0.log: http://pastebin.ubuntu.com/10422270/
environments.yaml: http://paste.ubuntu.com/10422300/

Revision history for this message
William Grant (wgrant) wrote :

(Ignore the fact that both the good and the bad instances appear to be in the "nova" AZ. juju-bootstack-ci-machine-1 is in "production", but an OpenStack bug causes instances with scheduler failures to show up in the default AZ.)

Andrew Wilkins (axwalk)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.23
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.23 → none
importance: High → Medium
tags: added: openstack-provider
Changed in juju-core:
importance: Medium → High
Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

Did a cursory look into this to see how difficult it would be to fix. It looks like we attempt to properly handle the case of 1 AZ failing properly, but if this bug is still valid, we obviously don't do so successfully.

Once we determine why that is, this should indeed be an easy one to fix. The thing which may be at fault is this: https://github.com/juju/juju/blob/master/provider/openstack/provider.go#L1074

The code which appears to try and do the correct thing is this: https://github.com/juju/juju/blob/master/provider/openstack/provider.go#L1010-L1019

Changed in juju-core:
assignee: nobody → Katherine Cox-Buday (cox-katherine-e)
Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

Per James, they last saw this in early 1.2x's. We need reconfirmation that this is happening with fresh logs.

Changed in juju-core:
status: Triaged → Incomplete
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@wgrant - Please confirm that this is still an issue on 1.25.x and 2.0.betaX.

If it's still an issue, please provide newer logs \o/

Changed in juju-core:
assignee: Katherine Cox-Buday (cox-katherine-e) → nobody
Paul Gear (paulgear)
tags: added: canonical-is
Revision history for this message
Anastasia (anastasia-macmood) wrote :

From elmo: this is a very hard bug to test - it requires breaking a cloud.

So, unless something has been explicitly done to cater for this scenario, we need to get an Openstack that would allow us to reproduce - where tenant isn't allowed to start instances in one of the two AZs as per bug description.

Changed in juju-core:
status: Incomplete → Triaged
Changed in juju-core:
milestone: none → 2.0.0
Revision history for this message
Richard Harding (rharding) wrote : Re: [Bug 1425808] Re: OpenStack provider doesn't try another AZ if the scheduler fails to find a valid host

We should be able to work with folks on an Orangebox to help validate this.
I don't think we've actually done anything to mitigate this to date
however.

On Wed, Aug 3, 2016 at 7:41 AM Anastasia <email address hidden>
wrote:

> >From elmo: this is a very hard bug to test - it requires breaking a
> cloud.
>
> So, unless something has been explicitly done to cater for this
> scenario, we need to get an Openstack that would allow us to reproduce -
> where tenant isn't allowed to start instances in one of the two AZs as
> per bug description.
>
> ** Changed in: juju-core/1.25
> Status: Incomplete => Triaged
>
> ** Changed in: juju-core
> Status: Incomplete => Triaged
>
> ** Changed in: juju-core
> Milestone: None => 2.0.0
>
> ** Changed in: juju-core/1.25
> Milestone: None => 1.25.7
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> https://bugs.launchpad.net/bugs/1425808
>
> Title:
> OpenStack provider doesn't try another AZ if the scheduler fails to
> find a valid host
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1425808/+subscriptions
>

affects: juju-core → juju
Changed in juju:
milestone: 2.0.0 → none
milestone: none → 2.0.0
Changed in juju-core:
importance: Undecided → High
status: New → Triaged
Changed in juju-core:
status: Triaged → Won't Fix
Changed in juju:
milestone: 2.0.0 → 2.0.1
Revision history for this message
William Grant (wgrant) wrote :

This is still present in 1.25.6. If one AZ fills up, the Nova instance will fall into error quickly and the Juju machine remains pending forever.

Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0.1 → none
Revision history for this message
Matt Jarvis (matt-jarvis) wrote :

I think this is still present in 2.0.0

I have an OpenStack public cloud where there is an AZ with a tenant filter on it. I've managed to bootstrap by specifying the --to flag, but when I try and deploy a bundle, the instances error out trying to deploy into the unavailable AZ, and then everything just sits in pending.

Changed in juju:
milestone: none → 2.2.0
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.2-beta1 → 2.2-beta2
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.2-beta2 → 2.2-beta3
Revision history for this message
John A Meinel (jameinel) wrote :

are we getting a clear error that it is failing because there are 'no available hosts in this az', or are we just getting an 'unable to provision instances at this time' failure?

Changed in juju:
status: Triaged → Incomplete
milestone: 2.2-beta3 → none
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Part of this issue is resolved with: https://github.com/juju/juju/pull/7300

It will allow the openstack provider to look at the fault message while the instance is in
an error state to determine if error was caused by 'No valid host'. It was tested in an openstack
producing the No valid host message, however there were not multiple AZ to complete testing.

There is existing code to try another AZ if the current one tried fails with 'No valid host'.

Revision history for this message
Matt Jarvis (matt-jarvis) wrote :

The exact error from OpenStack is "No valid host was found. There are not enough hosts available."

Revision history for this message
Tim Kuhlman (timkuhlman) wrote :

I believe the problem here is that Nova processes the request asynchronously and Juju doesn't check back on the error. The initial server creation request returns a success for the server creation and then as William mentioned when the nova-scheduler starts working it transitions to an error state. The error handling in Juju only handles the case where the error is returned from Nova as part of the original request and so Juju never picks up on the error.

The first step of fixing this is to have goose return the fault when getting server details.
https://github.com/tkuhlman/goose/commit/9fc1d747db6d01dfbabecfccf87f6b6bf765b457

Then Juju can grab the server details and check for the fault.
https://github.com/tkuhlman/juju/commit/7dccfcad814421372b8b2139a1ed47588a816dbb

There is just one major problem I have not been able to get a test environment working that replicates the problem in order to validate my fix. I'll keep working on that but any help in replicating it would be appreciated.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

@timkuhlman,

Agreed. The updates to goose to return fault data in ServerDetail was merged today via https://github.com/go-goose/goose/pull/45. I have a config where I can get "No valid host was found" errors to test part of the code, though there are no additional availability zones.

The PR for the juju piece is currently under review.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

I was able to partially reproduce/simulate the "No valid host" try another AZ code path with juju built with the above PRs:

17:41:43 INFO juju.provider.openstack provider.go:1117 trying to build instance in availability zone "nova"
17:41:54 INFO juju.provider.openstack provider.go:1097 Instance "acfd0706-85b8-4da2-b506-3460d90d5474" in ERROR state with fault "No valid host was found. There are not enough hosts available."
17:41:54 INFO juju.provider.openstack provider.go:1098 Deleting instance "acfd0706-85b8-4da2-b506-3460d90d5474" in ERROR state
17:41:54 INFO juju.provider.openstack provider.go:1126 failed to build instance in availability zone "nova"
17:41:54 INFO juju.provider.openstack provider.go:1117 trying to build instance in availability zone "second"
17:42:05 ERROR juju.cmd.juju.commands bootstrap.go:491 failed to bootstrap model: cannot start bootstrap instance: cannot run instance: No valid host was found. There are not enough hosts available.

Unfortunately the instance in availability zone "second" failed to build due to how I was able to trigger the error. The debug output at least confirms that the code to try another AZ if the one tried failed with "No valid host" is working, even if we don't have a clean reproducer yet.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Without an openstack where this issue is reproducible, we've gotten as far as possible towards resolution with the commit of PR 7300. We have confirmed that the juju provider will correctly find "No valid host" errors and try another AZ if that error is hit.

Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)
milestone: none → 2.2-beta4
status: Incomplete → In Progress
status: In Progress → Fix Committed
Revision history for this message
Matt Jarvis (matt-jarvis) wrote :

I have a cloud on which I can reproduce this. Are there nightly builds of Juju I can install ? I did try building it but had some issues with dependencies. I am at ODS in Boston until next week but can test when I get back.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

@matt-jarvis, wonderful! I believe the 'edge' channel of the juju snap is a nightly build.

Revision history for this message
Matt Jarvis (matt-jarvis) wrote :

@hmlanigan - so with juju 2.2-rc1 from snap the availability zone problem seems to be fixed :

16:01:16 INFO juju.provider.openstack provider.go:1141 trying to build instance in availability zone "Maintenance"
16:01:17 INFO juju.provider.openstack provider.go:1121 Instance "8dd7580e-6d03-48bd-8ec2-84c86b767450" in ERROR state with fault "No valid host was found. There are not enough hosts available."
16:01:17 INFO juju.provider.openstack provider.go:1122 Deleting instance "8dd7580e-6d03-48bd-8ec2-84c86b767450" in ERROR state
16:01:18 INFO juju.provider.openstack provider.go:1150 failed to build instance in availability zone "Maintenance"
16:01:18 INFO juju.provider.openstack provider.go:1141 trying to build instance in availability zone "Production"

However, I now have an issue with assigning a floating IP :

16:01:39 DEBUG juju.provider.openstack provider.go:1188 allocating public IP address for openstack node
16:01:40 ERROR juju.cmd.juju.commands bootstrap.go:492 failed to bootstrap model: cannot start bootstrap instance: cannot allocate a public IP as needed: could not find an external network in availablity zone
16:01:40 DEBUG juju.cmd.juju.commands bootstrap.go:493 (error details: [{github.com/juju/juju/cmd/juju/commands/bootstrap.go:584: failed to bootstrap model} {github.com/juju/juju/provider/common/bootstrap.go:50: } {github.com/juju/juju/provider/common/bootstrap.go:185: cannot start bootstrap instance} {github.com/juju/juju/provider/openstack/provider.go:1190: cannot allocate a public IP as needed} {github.com/juju/juju/provider/openstack/networking.go:168: could not find an external network in availablity zone}])

There is definitely an external network defined in that tenant, and it has floating IP's available. I will have a dig into the code and see if I can work out what it's looking for at that point.

Revision history for this message
Matt Jarvis (matt-jarvis) wrote :

@hmlanigan I suspect from a brief look at the code that this may also be availability zone related, although I haven't had a chance to dig into it

matt@frankenpad:~/openstack$ neutron availability-zone-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
404 Not Found

The resource could not be found.

Neutron server returns request_ids: ['req-c12d2048-83d9-4ffa-8261-1269c6a8beac']

matt@frankenpad:~/openstack$ neutron net-external-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+----------+---------------------------------------+
| id | name | subnets |
+--------------------------------------+----------+---------------------------------------+
| 6751cb30-0aef-4d7e-94c3-ee2a09e705eb | external | 2af591ca-48ac-42b7-afc6-e691b3aa4c8a |
| | | b839c2c8-94b9-4445-858d-1800b5fe3bbb |
| | | e71eb7f6-400b-4d1f-a65b-9315ade67fe7 |
+--------------------------------------+----------+---------------------------------------+

There is definitely an external network there as you can see.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

@matt-jarvis, sounds like you're hitting https://bugs.launchpad.net/juju/+bug/1689683

Revision history for this message
Matt Jarvis (matt-jarvis) wrote :

Yup, sounds like it.

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.