retry-provisioning doesn't retry failed deployments on MAAS

Bug #1645422 reported by Adam Collard
114
This bug affects 22 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

Using MAAS 2.1.2 (bzr 5555) and Juju 2.0.1:

I tried deploying 6 units of Ubuntu, each with a LXD container also running Ubuntu. Two of the machines failed to deploy (because of bug 1635560 but unimportant - just note that it's transient). When I tried to retry-provisioning nothing happened.

⟫ juju status
Model Controller Cloud/Region Version
default hare hare 2.0.1

App Version Status Scale Charm Store Rev OS Notes
ubuntu 16.04 waiting 8/12 ubuntu jujucharms 8 ubuntu

Unit Workload Agent Machine Public address Ports Message
ubuntu/0 active idle 0 10.2.0.54 ready
ubuntu/1* active idle 1 10.2.0.55 ready
ubuntu/2 active idle 2 10.2.0.56 ready
ubuntu/3 active idle 3 10.2.0.57 ready
ubuntu/4 waiting allocating 4 10.2.0.52 waiting for machine
ubuntu/5 waiting allocating 5 10.2.0.53 waiting for machine
ubuntu/6 active idle 0/lxd/0 10.2.0.61 ready
ubuntu/7 active idle 1/lxd/0 10.2.0.58 ready
ubuntu/8 active idle 2/lxd/0 10.2.0.60 ready
ubuntu/9 active idle 3/lxd/0 10.2.0.59 ready
ubuntu/10 waiting allocating 4/lxd/0 waiting for machine
ubuntu/11 waiting allocating 5/lxd/0 waiting for machine

Machine State DNS Inst id Series AZ
0 started 10.2.0.54 4y3hbp xenial Raphael
0/lxd/0 started 10.2.0.61 juju-d0b4d0-0-lxd-0 xenial
1 started 10.2.0.55 4y3hbq xenial default
1/lxd/0 started 10.2.0.58 juju-d0b4d0-1-lxd-0 xenial
2 started 10.2.0.56 abnf8x xenial Raphael
2/lxd/0 started 10.2.0.60 juju-d0b4d0-2-lxd-0 xenial
3 started 10.2.0.57 x7nfeg xenial default
3/lxd/0 started 10.2.0.59 juju-d0b4d0-3-lxd-0 xenial
4 down 10.2.0.52 4y3h7x xenial Raphael
4/lxd/0 pending pending xenial
5 down 10.2.0.53 4y3h7y xenial default
5/lxd/0 pending pending xenial

⟫ juju retry-provisioning 5 --debug
18:07:46 INFO juju.cmd supercommand.go:63 running juju [2.0.1 gc go1.6.2]
18:07:46 DEBUG juju.cmd supercommand.go:64 args: []string{"juju", "retry-provisioning", "5", "--debug"}
18:07:46 INFO juju.juju api.go:72 connecting to API addresses: [10.2.0.51:17070]
18:07:46 INFO juju.api apiclient.go:530 dialing "wss://10.2.0.51:17070/model/5a113b53-5bf4-42cd-8d8f-4dd933d0b4d0/api"
18:07:47 INFO juju.api apiclient.go:466 connection established to "wss://10.2.0.51:17070/model/5a113b53-5bf4-42cd-8d8f-4dd933d0b4d0/api"
18:07:47 DEBUG juju.juju api.go:263 API hostnames unchanged - not resolving
18:07:47 INFO cmd supercommand.go:465 command finished

Changed in juju:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 2.1.0
Revision history for this message
Curtis Hovey (sinzui) wrote :

Why can't juju automatically retry-provisioning? It knows many cases where provisioning failed. juju is retrying hooks now; users rarely need to retry.

Curtis Hovey (sinzui)
tags: added: maas-provider retry-privisioning
Changed in juju:
importance: Critical → High
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Removing 2.1 milestone as we will not be addressing this issue in 2.1.

tags: added: retry-provisioning
removed: retry-privisioning
Changed in juju:
milestone: 2.1-rc2 → none
Revision history for this message
Sandor Zeestraten (szeestraten) wrote :

I hit this today on Juju 2.1.1 and MAAS 2.1.3.
retry-provisioning does nothing and the machine is just down/pending.

Revision history for this message
John A Meinel (jameinel) wrote :

I believe the underlying issue is that maas has handed us an 'instance-id' which I think means that we think we have a concrete instance that is running. Which is different from failing-to-get-an-instance. Its possible retry-provisioning handles the latter but not the former.

Revision history for this message
John A Meinel (jameinel) wrote :

I should also note that MAAS doesn't hand back 'instance for the request you made' but always hands back an exact identifier for a specific machine. So we have to be a bit careful that 'retry-provisioning' properly decommissions the existing instance id, and can cope with being given back exactly the same instance ID a second time, but this time meaning it is trying again.

tags: added: 4010
tags: added: cdo-qa foundation-engine
tags: added: foundations-engine
removed: foundation-engine
tags: removed: foundations-engine
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

The same for tags updated after 'juju deploy'.

Retry-provisioning should re-query machine metadata if said so in my view. This is a manual action and you probably know what you are doing.

Instead, one has to remove-machine --force and add-unit again.

tags: added: cpe-onsite
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1645422] Re: retry-provisioning doesn't retry failed deployments on MAAS

fwiw, I think the internal issue is that MAAS has already given us an
instance-id, so we think the machine is provisioned. Normally for providers
'juju retry-provisioning' probably does do some of what you want, but only
when an instance hasn't yet been assigned.

On Mon, Oct 23, 2017 at 6:59 AM, Dmitrii Shcherbakov <
<email address hidden>> wrote:

> The same for tags updated after 'juju deploy'.
>
> Retry-provisioning should re-query machine metadata if said so in my
> view. This is a manual action and you probably know what you are doing.
>
> Instead, one has to remove-machine --force and add-unit again.
>
> ** Tags added: cpe-onsite
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1645422
>
> Title:
> retry-provisioning doesn't retry failed deployments on MAAS
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1645422/+subscriptions
>

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

I think we just need to define what it means to "provision" better.

Conceptually, I would use the following definition:

provisioning = <matching a machine by constraints & other criteria> + <successfully deploying once and installing a machine agent>

At least for MAAS it is intuitive in my view.

If I have to reconfigure a machine, doing retry-provisioning also makes sense but with the following logic:

1. get a machine ID;
2. a deployment has failed either automatically or via a manual action before machine/unit agents have started;
3. a user has released the machine in MAAS;
4. reconfigured the machine/swapped out hardware etc.
5. a manual retry-provisioning detected that a given ID is no longer allocated and tried to allocate a new ID.

The target idea here would be that one could write an orchestrator/automation to talk to Juju, see if a deployment has failed, check MAAS to determine if we can recover from a failure, retry-provisioning without affecting a Juju model unit-wise or application-wise.

If a node is not suitable it would be marked as broken by an orchestrator in MAAS and a different node would be picked without making remove-machine --force && add-unit steps.

tags: added: canonical-bootstack
Revision history for this message
Frode Nordahl (fnordahl) wrote :

This is still an issue with Juju 2.7.8 and MAAS 2.8.2.

My occurrence is a transient MAAS failed deployment because of *reasons*, and I want Juju to retry so that I can get a working machine.

I see from bug discussion history that there is some disagreement about what retry-provisioning means or does, and I guess I'll add to the scale that to me I expected it to mean that Juju could re-use the machine slot it has in its model and either fill it with a new instance or reach out to maas and do a release+deploy dance with the instance ID it already has.

Right now nothing happens and there is zero feedback to the user.

Frode Nordahl (fnordahl)
tags: added: ps5
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Typo in comment #9 juju version is 2.8.7

Revision history for this message
Pen Gale (pengale) wrote :

Bumping importance to Medium to accurately reflect that this is a legitimate issue, but is not in scope for the current roadmap.

(I agree that it would be very nice to fix.)

Changed in juju:
importance: High → Medium
Revision history for this message
Boris Lukashev (rageltman) wrote :

This is a legitimate issue for us as well (currently contributing to descent into madness) - without this, failed nodes get re-numbered, and targeted placements of units aiming to be on those nodes get wonky despite the --map-machines flag on iterative overlays (up to 5 here for openstack ha with vault and a bunch of other stuff).
The problem also exists with LXD units - Juju has no way to retry those, and that's entirely within its scope of control.

Revision history for this message
Simon Déziel (sdeziel) wrote :

I can confirm this with juju 2.9.11 interacting with MAAS 3.1.0~alpha1.

Simon Déziel (sdeziel)
tags: added: lxd-cloud
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This Medium-priority bug has not been updated in 60 days, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
Revision history for this message
Adam Vest (foxmulder2004) wrote :

Can confirm that this is still an issue in juju 2.9.45/MAAS 3.3.4. This bug having been reported 7 years ago, is the "retry-provisioning" option known to work on ANY clouds? Should consideration be given to perhaps removing this feature from juju entirely?

Revision history for this message
Boris Lukashev (rageltman) wrote :

... having chewed my way out of the straight-jacket, i find myself back at this issue on current/stable revisions of juju and maas. Frustrating to say the least.

Revision history for this message
Ian Booth (wallyworld) wrote :

retry-provisioning works if the cloud reports an error that juju considers a provisioning error. For maas, juju currently can only look at the error string returned if something fails, and when the error is reported as "failed deployment", juju considers that machine eligible for retry.

The code which handles maas errors is several years old - it is apparent it needs updating.

Changed in juju:
importance: Low → High
tags: removed: expirebugs-bot
Revision history for this message
Ian Booth (wallyworld) wrote :

In your most recent case, can you share the relevant juju log lines where the actual maas error is logged so we can see what it is?

Revision history for this message
Federico Bosi (rhxto) wrote (last edit ):

It still happens in maas 3.5.
My machines look like this when the TFTP connection breaks when downloading initrd.
2 down my:ipv6::address aldo ubuntu@22.04 gen10 Failed deployment: Performing PXE boot

juju retry-provisioning 2 does nothing.

Adding the --debug and --verbose flags just shows it connecting to the controller and disconnecting:
juju retry-provisioning 2 --debug --verbose
INFO juju.cmd supercommand.go:56 running juju [3.3.3.1 944e4076456009e6f220bb22c3e71e4ce2020c03 gc go1.21.0]
DEBUG juju.cmd supercommand.go:57 args: []string{"/root/dev/juju/_build/linux_amd64/bin/juju", "retry-provisioning", "2", "--debug", "--verbose"}
INFO juju.juju api.go:86 connecting to API addresses: [[controller-ipv6-1]:17070 [controller-ipv6-2]:17070 controller-ipv4:17070]
DEBUG juju.api apiclient.go:1172 successfully dialed "wss://controller-ipv4:17070/model/8c66a8a0-f08a-42aa-8f90-0e2e7b9ec519/api"
INFO juju.api apiclient.go:707 connection established to "wss://controller-ipv4:17070/model/8c66a8a0-f08a-42aa-8f90-0e2e7b9ec519/api"
DEBUG juju.api monitor.go:35 RPC connection died
INFO cmd supercommand.go:556 command finished

The version is weird because it's a local build I made to fix another bug, the MAAS code is untouched.

Revision history for this message
Ian Booth (wallyworld) wrote :

Can we get the relevant juju log lines where the actual maas error is logged so we can see what it is?
Does juju status --format yaml show the full error message received from maas? Once we know that, we can figure out why juju doesn't consider the error retryable.

Revision history for this message
Federico Bosi (rhxto) wrote :

This is an example:
"1":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 28 Mar 2024 16:26:15Z
    dns-name: my:ipv6_1
    ip-addresses:
    - my:ipv6_1
    - my:ipv6_2
    instance-id: fs8dqm
    display-name: giovanni
    machine-status:
      current: provisioning error
      message: 'Failed deployment: Loading ephemeral'
      since: 28 Mar 2024 16:26:15Z
    modification-status:
      current: idle
      since: 28 Mar 2024 16:09:37Z
    base:
      name: ubuntu
      channel: "22.04"
    containers:
...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.