failed to deploy bundle with "suitable availability zone for machine <num> not found"

Bug #1860083 reported by Yoshi Kadokawa on 2020-01-17
54
This bug affects 9 people
Affects Status Importance Assigned to Milestone
juju
High
Harry Pidcock

Bug Description

It looks like when the order of machines list is not in order for zones, it fails to deploy the machines with an error message "suitable availability zone for machine <num> not found".

It is easily reproducible on aws.
The following bundle will fail on machine '2'.

series: bionic
machines:
  "0":
    constraints: zones=ap-northeast-1a
  "1":
    constraints: zones=ap-northeast-1d
  "2":
    constraints: zones=ap-northeast-1a
applications:
  ubuntu1-1a:
    charm: cs:ubuntu
    num_units: 1
    to:
    - '0'
  ubuntu2-1d:
    charm: cs:ubuntu
    num_units: 1
    to:
    - '1'
  ubuntu3-1a:
    charm: cs:ubuntu
    num_units: 1
    to:
    - '2'

However, this bundle will succeed.

series: bionic
machines:
  "0":
    constraints: zones=ap-northeast-1a
  "1":
    constraints: zones=ap-northeast-1a
  "2":
    constraints: zones=ap-northeast-1d
applications:
  ubuntu1-1a:
    charm: cs:ubuntu
    num_units: 1
    to:
    - '0'
  ubuntu2-1d:
    charm: cs:ubuntu
    num_units: 1
    to:
    - '2'
  ubuntu3-1a:
    charm: cs:ubuntu
    num_units: 1
    to:
    - '1'

Are there actually a limitation in how you order in machines list?

Pedro Guimarães (pguimaraes) wrote :

Hi, we had a similar issue when deploying on top of vSphere.
We were using applying "zones" constraints were totally ignored.
The only way to have it correctly deployed on that provider was to use:

juju add-machine zone=ZONE_!

Mark Maglana (mmaglana) wrote :
Mark Maglana (mmaglana) wrote :

I'm still going through the logs for any hints but it would be good to get some guidance on what I should look out for.

Richard Harding (rharding) wrote :

I don't know that there's anything to this other than it looks like a bug in our resolution of zones as we process the bundle. We'll have to take a peek at it and make sure to correct it.

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
Nobuto Murata (nobuto) wrote :

I'm collecting more test cases in:
https://git.launchpad.net/~nobuto/+git/juju-testcase-zones/tree/
as it's getting harder and harder to avoid this issue in the field deployments.

Nobuto Murata (nobuto) wrote :

Subscribing ~field-high.

The current workaround is reordering machines in a bundle or to remove some machines and units from a bundle until Juju succeeds. But it's not straightforward to come up with a workaround, and it's really hard to debug which part of a bundle is "wrong" to the current build of Juju.

We have created 4 test cases in total (two cases succeed, and another two cases fail with Juju 2.7.2), and we expect all of 4 cases should pass for reliable deployments in the field.
https://git.launchpad.net/~nobuto/+git/juju-testcase-zones/tree/

Charmed Kubernetes deployments on top of clouds (OpenStack mainly) is critical to assure rack-level and AZ-level redundancy for etcd and kubernetes-master units for high availability. It's not a confirmed theory, but it seems reproducible on other cloud providers such as AWS (in the bug description) so it doesn't looks like a single cloud provider issue.

I have a testbed of OpenStack. So let me know if you need further logs and how to increase the log level for this issue. There are some logs in the link already though.

Tim Penhey (thumper) on 2020-02-20
Changed in juju:
importance: Medium → High
milestone: none → 2.7.4
Ian Booth (wallyworld) on 2020-02-20
Changed in juju:
assignee: nobody → Harry Pidcock (hpidcock)
Harry Pidcock (hpidcock) wrote :

One issue I've identified is apne1-az3 (ap-northeast-1a for our AWS account) appears to be a small AZ or has reduced capacity, or is hardware is being decommissioned. So it appears to be returning `Your requested instance type (c4.large) is not supported in your requested Availability Zone (ap-northeast-1a)` even though it supports that instance type. Quick googling says AWS returns this error when there is no capacity available for that instance type.

Can you please check what az-id your ap-northeast-1a is?

Nobuto Murata (nobuto) wrote :

@Harry,

Here is my mapping:
AZ Name | AZ ID
ap-northeast-1a apne1-az4
ap-northeast-1c apne1-az1
ap-northeast-1d apne1-az2

And the issue is reproducible with this bundle (same as the one in the bug description)
https://git.launchpad.net/~nobuto/+git/juju-testcase-zones/tree/reproducible-aws.yaml

Nobuto Murata (nobuto) wrote :

$ juju machines
Machine State DNS Inst id Series AZ Message
0 started 52.198.87.176 i-01883e9bb48c46d6d bionic ap-northeast-1a running
1 started 3.112.110.182 i-00ad81037b3090e53 bionic ap-northeast-1d running
2 down pending bionic suitable availability zone for machine 2 not found

Harry Pidcock (hpidcock) wrote :

I think this is a 3-in-1 bug (at least through my testing).

But I believe found the main culprit: We don't handle constraints in AZ retries.
I'm working on a fix now. I've also added additional logging that should highlight retries in the logs.

The other two issues are:
- apne1-az3 capacity/unsupported instance types. Nothing I can do about this.
- We select instance-type that match constraints, but we use AWS pricing data, which is only granular down to region, not AZ. I've fixed this already, we now only select instance-types (for AWS) that are supported across all AZs in a region (unless the user passes an instance-type constraint).

Harry Pidcock (hpidcock) on 2020-02-27
Changed in juju:
status: Triaged → Fix Committed
Nobuto Murata (nobuto) wrote :

@Harry,

I tried to verify the fix in the pull request #11239, but the issue is still reproducible to me. Am I missing something? I put my notes here fwiw:
https://git.launchpad.net/~nobuto/+git/juju-testcase-zones/tree/validation-pull-11239-failed.md

Harry Pidcock (hpidcock) wrote :

@Nobuto

Thank-you for the follow-up.

Can you please verify what instance type was selected?
Also, does your region have a default VPC?

Nobuto Murata (nobuto) wrote :

> Can you please verify what instance type was selected?

It's c4.large.

juju-default-machine-1 i-087da7ab2ee2843e2 c4.large ap-northeast-1d running
juju-controller-machine-0 i-08f44586efaf16a6d c4.large ap-northeast-1a running
juju-default-machine-0 i-0d075da9dbb188ac3 c4.large ap-northeast-1a running

> Also, does your region have a default VPC?

I believe yes. I've got this from the dashboard fwiw.

Subnet ID State VPC IPv4 CIDR Available IPv4 Addresses IPv6 CIDR Availability Zone Availability Zone ID
subnet-2fc58974 available vpc-425d0d25 172.31.0.0/20 4091 - ap-northeast-1c apne1-az1
subnet-a4b1a3ed available vpc-425d0d25 172.31.32.0/20 4089 - ap-northeast-1a apne1-az4
subnet-d145a0fa available vpc-425d0d25 172.31.16.0/20 4090 - ap-northeast-1d apne1-az2

Changed in juju:
milestone: 2.7.4 → 2.8-beta1
Nobuto Murata (nobuto) wrote :

I've tested a build with the patch set in pull request #11296. It works as expected both on AWS with 2 test cases and on OpenStack with 4 test cases. Thanks for the great work there and looking forward to getting those fixes in the coming stable builds.

https://git.launchpad.net/~nobuto/+git/juju-testcase-zones/tree/validation-pull-11296-aws.md
https://git.launchpad.net/~nobuto/+git/juju-testcase-zones/tree/validation-pull-11296-openstack.md

Nobuto Murata (nobuto) wrote :

Escalating this to ~field-critical on behalf of @pguimaraes.

The fix[1] has been merged into 2.7 branch once, but I understand there were some urgent bugs and the branch had to cut out for a release of 2.7.4 without the fix.

Pedro needs this fix to unblock his project, but the snap builds with the fix disappeared from all of snap channels because of the rebase happened to 2.7 branch. We would like to request this fix as a part of 2.7.5 right after 2.7.4 or actually a part of 2.7.4 if that's possible.

[1] https://github.com/juju/juju/commit/e2fd905ea6c8d5c9a4b1b6c10837a4b97a8dcd52

Ian Booth (wallyworld) wrote :

The commit landed in the 2.7 branch so the change of milestone to 2.8-beta1 seems like a mistake.

The fix won't make 2.7.4 as that's already gone to Solutions QA for testing and is due to be released tomorrow. It would take another day to retool things after that and by then we're entering the timeframe of the weekend and that would delay 2.7.4 till next week.

But it will be in 2.7.5. It will hit the 2.7.5 edge snap as soon as 2.7.4 goes out (hopefully within a day). Would that be sufficient to unblock things pending a formal 2.7.5 release?

Changed in juju:
milestone: 2.8-beta1 → 2.7.5
Richard Harding (rharding) wrote :

Yes, the move to 2.8-beta was a mistake as I was moving bugs to 2.7.5 to get the hotfix 2.7.4 setup. Apologies for the confusion.

Nobuto Murata (nobuto) wrote :

> The fix won't make 2.7.4 as that's already gone to Solutions QA for testing and is due to be released tomorrow. It would take another day to retool things after that and by then we're entering the timeframe of the weekend and that would delay 2.7.4 till next week.
>
> But it will be in 2.7.5. It will hit the 2.7.5 edge snap as soon as 2.7.4 goes out (hopefully within a day). Would that be sufficient to unblock things pending a formal 2.7.5 release?

It sounds reasonable to me. I will leave it to @pguimaraes for further comments since his project is the most urgent one needing the patch.

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers