retry-provisioning doesn't retry failed deployments on MAAS

Bug #1645422 reported by Adam Collard on 2016-11-28
66
This bug affects 12 people
Affects Status Importance Assigned to Milestone
juju
High
Unassigned

Bug Description

Using MAAS 2.1.2 (bzr 5555) and Juju 2.0.1:

I tried deploying 6 units of Ubuntu, each with a LXD container also running Ubuntu. Two of the machines failed to deploy (because of bug 1635560 but unimportant - just note that it's transient). When I tried to retry-provisioning nothing happened.

⟫ juju status
Model Controller Cloud/Region Version
default hare hare 2.0.1

App Version Status Scale Charm Store Rev OS Notes
ubuntu 16.04 waiting 8/12 ubuntu jujucharms 8 ubuntu

Unit Workload Agent Machine Public address Ports Message
ubuntu/0 active idle 0 10.2.0.54 ready
ubuntu/1* active idle 1 10.2.0.55 ready
ubuntu/2 active idle 2 10.2.0.56 ready
ubuntu/3 active idle 3 10.2.0.57 ready
ubuntu/4 waiting allocating 4 10.2.0.52 waiting for machine
ubuntu/5 waiting allocating 5 10.2.0.53 waiting for machine
ubuntu/6 active idle 0/lxd/0 10.2.0.61 ready
ubuntu/7 active idle 1/lxd/0 10.2.0.58 ready
ubuntu/8 active idle 2/lxd/0 10.2.0.60 ready
ubuntu/9 active idle 3/lxd/0 10.2.0.59 ready
ubuntu/10 waiting allocating 4/lxd/0 waiting for machine
ubuntu/11 waiting allocating 5/lxd/0 waiting for machine

Machine State DNS Inst id Series AZ
0 started 10.2.0.54 4y3hbp xenial Raphael
0/lxd/0 started 10.2.0.61 juju-d0b4d0-0-lxd-0 xenial
1 started 10.2.0.55 4y3hbq xenial default
1/lxd/0 started 10.2.0.58 juju-d0b4d0-1-lxd-0 xenial
2 started 10.2.0.56 abnf8x xenial Raphael
2/lxd/0 started 10.2.0.60 juju-d0b4d0-2-lxd-0 xenial
3 started 10.2.0.57 x7nfeg xenial default
3/lxd/0 started 10.2.0.59 juju-d0b4d0-3-lxd-0 xenial
4 down 10.2.0.52 4y3h7x xenial Raphael
4/lxd/0 pending pending xenial
5 down 10.2.0.53 4y3h7y xenial default
5/lxd/0 pending pending xenial

⟫ juju retry-provisioning 5 --debug
18:07:46 INFO juju.cmd supercommand.go:63 running juju [2.0.1 gc go1.6.2]
18:07:46 DEBUG juju.cmd supercommand.go:64 args: []string{"juju", "retry-provisioning", "5", "--debug"}
18:07:46 INFO juju.juju api.go:72 connecting to API addresses: [10.2.0.51:17070]
18:07:46 INFO juju.api apiclient.go:530 dialing "wss://10.2.0.51:17070/model/5a113b53-5bf4-42cd-8d8f-4dd933d0b4d0/api"
18:07:47 INFO juju.api apiclient.go:466 connection established to "wss://10.2.0.51:17070/model/5a113b53-5bf4-42cd-8d8f-4dd933d0b4d0/api"
18:07:47 DEBUG juju.juju api.go:263 API hostnames unchanged - not resolving
18:07:47 INFO cmd supercommand.go:465 command finished

Changed in juju:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 2.1.0
Curtis Hovey (sinzui) wrote :

Why can't juju automatically retry-provisioning? It knows many cases where provisioning failed. juju is retrying hooks now; users rarely need to retry.

Curtis Hovey (sinzui) on 2016-12-01
tags: added: maas-provider retry-privisioning
Changed in juju:
importance: Critical → High
Anastasia (anastasia-macmood) wrote :

Removing 2.1 milestone as we will not be addressing this issue in 2.1.

tags: added: retry-provisioning
removed: retry-privisioning
Changed in juju:
milestone: 2.1-rc2 → none
Sandor Zeestraten (szeestraten) wrote :

I hit this today on Juju 2.1.1 and MAAS 2.1.3.
retry-provisioning does nothing and the machine is just down/pending.

John A Meinel (jameinel) wrote :

I believe the underlying issue is that maas has handed us an 'instance-id' which I think means that we think we have a concrete instance that is running. Which is different from failing-to-get-an-instance. Its possible retry-provisioning handles the latter but not the former.

John A Meinel (jameinel) wrote :

I should also note that MAAS doesn't hand back 'instance for the request you made' but always hands back an exact identifier for a specific machine. So we have to be a bit careful that 'retry-provisioning' properly decommissions the existing instance id, and can cope with being given back exactly the same instance ID a second time, but this time meaning it is trying again.

tags: added: 4010
tags: added: cdo-qa foundation-engine
tags: added: foundations-engine
removed: foundation-engine
tags: removed: foundations-engine
Dmitrii Shcherbakov (dmitriis) wrote :

The same for tags updated after 'juju deploy'.

Retry-provisioning should re-query machine metadata if said so in my view. This is a manual action and you probably know what you are doing.

Instead, one has to remove-machine --force and add-unit again.

tags: added: cpe-onsite

fwiw, I think the internal issue is that MAAS has already given us an
instance-id, so we think the machine is provisioned. Normally for providers
'juju retry-provisioning' probably does do some of what you want, but only
when an instance hasn't yet been assigned.

On Mon, Oct 23, 2017 at 6:59 AM, Dmitrii Shcherbakov <
<email address hidden>> wrote:

> The same for tags updated after 'juju deploy'.
>
> Retry-provisioning should re-query machine metadata if said so in my
> view. This is a manual action and you probably know what you are doing.
>
> Instead, one has to remove-machine --force and add-unit again.
>
> ** Tags added: cpe-onsite
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1645422
>
> Title:
> retry-provisioning doesn't retry failed deployments on MAAS
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1645422/+subscriptions
>

Dmitrii Shcherbakov (dmitriis) wrote :

I think we just need to define what it means to "provision" better.

Conceptually, I would use the following definition:

provisioning = <matching a machine by constraints & other criteria> + <successfully deploying once and installing a machine agent>

At least for MAAS it is intuitive in my view.

If I have to reconfigure a machine, doing retry-provisioning also makes sense but with the following logic:

1. get a machine ID;
2. a deployment has failed either automatically or via a manual action before machine/unit agents have started;
3. a user has released the machine in MAAS;
4. reconfigured the machine/swapped out hardware etc.
5. a manual retry-provisioning detected that a given ID is no longer allocated and tried to allocate a new ID.

The target idea here would be that one could write an orchestrator/automation to talk to Juju, see if a deployment has failed, check MAAS to determine if we can recover from a failure, retry-provisioning without affecting a Juju model unit-wise or application-wise.

If a node is not suitable it would be marked as broken by an orchestrator in MAAS and a different node would be picked without making remove-machine --force && add-unit steps.

tags: added: canonical-bootstack
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers