MAAS node has "failed deployment", juju just says "pending" when using juju add-machine

Bug #1472711 reported by Andreas Hasenack
36
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Horacio Durán

Bug Description

Looks like https://bugs.launchpad.net/juju-core/+bug/1376246 wasn't fully fixed, or it regressed.

I have a case with maas 1.7.5+bzr3369-0ubuntu1~trusty1 and juju 1.24.1 (tools and client) where 3 maas nodes failed, yet juju just reports them as pending:
  "1":
    agent-state: pending
    dns-name: barley.scapestack
    instance-id: /MAAS/api/1.0/nodes/node-65d52b5c-546c-11e4-821d-2c59e54ace74/
    series: trusty
    hardware: arch=amd64 cpu-cores=4 mem=16384M
  "2":
    agent-state: pending
    dns-name: kyogre.scapestack
    instance-id: /MAAS/api/1.0/nodes/node-9f389a16-1b7a-11e5-846c-2c59e54ace74/
    series: trusty
    hardware: arch=amd64 cpu-cores=12 mem=16384M
  "3":
    agent-state: pending
    dns-name: bibarel.scapestack
    instance-id: /MAAS/api/1.0/nodes/node-2ff8a47e-1b7b-11e5-846c-2c59e54ace74/
    series: trusty
    hardware: arch=amd64 cpu-cores=12 mem=16384M

But in MAAS:
$ maas admin-atlas nodes list hostname=barley hostname=kyogre hostname=bibarel|grep status
        "status": 6,
        "substatus": 11,
        "status": 6,
        "substatus": 11,
        "status": 6,
        "substatus": 11,

Full juju status and maas nodes list output attached.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: none → 1.25.0
status: New → Triaged
importance: Undecided → High
Revision history for this message
Curtis Hovey (sinzui) wrote :

We can see the fix is delivered in https://github.com/juju/juju/commit/9a7f42182e9311cfa0775488468d80a416856deb . Maybe the fix is incomplete, or amore recent change broke the new behaviour.

tags: added: maas-provider
Revision history for this message
Jesse Meek (waigani) wrote :

gomaasapi relies on the 'deployment_status' method on the 'nodes' endpoint to determine the status of deployment. Could you please make a request to that endpoint and paste the results back here? https://maas.ubuntu.com/docs/api.html#nodes

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

$ maas admin-atlas nodes deployment-status nodes=node-60c21846-546c-11e4-821d-2c59e54ace74 nodes=node-37170c68-546c-11e4-821d-2c59e54ace74
{
    "node-60c21846-546c-11e4-821d-2c59e54ace74": "Failed deployment",
    "node-37170c68-546c-11e4-821d-2c59e54ace74": "Deployed"
}

And that status is not reflected in juju:

$ juju status
environment: scapestack-trusty
machines:
  "0":
    agent-state: started
    agent-version: 1.24.2
    dns-name: correja.scapestack
    instance-id: /MAAS/api/1.0/nodes/node-37170c68-546c-11e4-821d-2c59e54ace74/
    series: trusty
    containers:
(...)
  "1":
    agent-state: pending
    dns-name: squier.scapestack
    instance-id: /MAAS/api/1.0/nodes/node-60c21846-546c-11e4-821d-2c59e54ace74/
    series: trusty
    hardware: arch=amd64 cpu-cores=4 mem=16384M

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

With debug mode, so you can verify I'm making the call you requested, even though I'm using the maas cli:
$ maas admin-atlas nodes deployment-status nodes=node-60c21846-546c-11e4-821d-2c59e54ace74 nodes=node-37170c68-546c-11e4-821d-2c59e54ace74 -d
200 OK

   Content-Location: http://10.96.0.10/MAAS/api/1.0/nodes/?op=deployment_status&nodes=node-60c21846-546c-11e4-821d-2c59e54ace74&nodes=node-37170c68-546c-11e4-821d-2c59e54ace74
       Content-Type: application/json; charset=utf-8
               Date: Thu, 09 Jul 2015 13:35:29 GMT
             Server: Apache/2.4.7 (Ubuntu)
             Status: 200
  Transfer-Encoding: chunked
               Vary: Authorization,Cookie
    X-Frame-Options: SAMEORIGIN
    X-Maas-Api-Hash: 5b869d7d239dedd467c6481ff504b65f28577342

{
    "node-60c21846-546c-11e4-821d-2c59e54ace74": "Failed deployment",
    "node-37170c68-546c-11e4-821d-2c59e54ace74": "Deployed"
}

Revision history for this message
Jesse Meek (waigani) wrote :

Please ignore the last comment. The PR just modified the test service. In an effort to debug this I mocked out the gomaasapi testservice to return exactly the same json as above.

We call the deployment_status endpoint here: provider/maas/environ.go:1036 deploymentStatusCall
and interpret the results here: provider/maas/environ.go:1008 waitForNodeDeployment

In my testing, it correctly interpreted the first node to have failed. I am not able to produce a failing test. Are there any logs of Juju making the API call?

I'm unassigning myself as it's EOD and I'm off for a few days.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Download full text (3.8 KiB)

Here are apache logs from since I ran "juju add-machine" to select a node I prepared to fail:

10.96.10.86 - - [10/Jul/2015:20:10:46 +0000] "GET /MAAS/api/1.0/nodes/?agent_name=4a4b73ac-1417-4b66-869e-9e335f194486&op=list HTTP/1.1" 200 844 "-" "Go 1.1 package http"
10.96.10.86 - - [10/Jul/2015:20:10:48 +0000] "POST /MAAS/api/1.0/nodes/?op=acquire HTTP/1.1" 200 825 "-" "Go 1.1 package http"
10.96.10.86 - - [10/Jul/2015:20:10:49 +0000] "GET /MAAS/api/1.0/nodes/node-65d52b5c-546c-11e4-821d-2c59e54ace74/?op=details HTTP/1.1" 200 9149 "-" "Go 1.1 package http"
10.96.10.86 - - [10/Jul/2015:20:10:49 +0000] "GET /MAAS/api/1.0/networks/?node=node-65d52b5c-546c-11e4-821d-2c59e54ace74 HTTP/1.1" 200 559 "-" "Go 1.1 package http"
10.96.10.86 - - [10/Jul/2015:20:10:49 +0000] "GET /MAAS/api/1.0/networks/maas-eth0/?op=list_connected_macs HTTP/1.1" 200 898 "-" "Go 1.1 package http"
10.96.10.86 - - [10/Jul/2015:20:10:49 +0000] "GET /MAAS/api/1.0/networks/maas-eth0/?op=list_connected_macs HTTP/1.1" 200 898 "-" "Go 1.1 package http"
10.96.10.86 - - [10/Jul/2015:20:10:49 +0000] "POST /MAAS/api/1.0/nodes/node-65d52b5c-546c-11e4-821d-2c59e54ace74/?op=start HTTP/1.1" 200 842 "-" "Go 1.1 package http"
10.96.10.86 - - [10/Jul/2015:20:11:01 +0000] "GET /MAAS/api/1.0/nodes/?agent_name=4a4b73ac-1417-4b66-869e-9e335f194486&id=node-65d52b5c-546c-11e4-821d-2c59e54ace74&op=list HTTP/1.1" 200 855 "-" "Go 1.1 package http"
10.96.10.86 - - [10/Jul/2015:20:11:17 +0000] "GET /MAAS/api/1.0/nodes/?agent_name=4a4b73ac-1417-4b66-869e-9e335f194486&id=node-65d52b5c-546c-11e4-821d-2c59e54ace74&op=list HTTP/1.1" 200 855 "-" "Go 1.1 package http"
10.96.10.86 - - [10/Jul/2015:20:11:49 +0000] "GET /MAAS/api/1.0/nodes/?agent_name=4a4b73ac-1417-4b66-869e-9e335f194486&id=node-65d52b5c-546c-11e4-821d-2c59e54ace74&op=list HTTP/1.1" 200 855 "-" "Go 1.1 package http"
10.96.10.86 - - [10/Jul/2015:20:12:51 +0000] "GET /MAAS/api/1.0/nodes/?agent_name=4a4b73ac-1417-4b66-869e-9e335f194486&id=node-5c4d39da-546c-11e4-821d-2c59e54ace74&op=list HTTP/1.1" 200 844 "-" "Go 1.1 package http"
10.96.10.86 - - [10/Jul/2015:20:12:54 +0000] "GET /MAAS/api/1.0/nodes/?agent_name=4a4b73ac-1417-4b66-869e-9e335f194486&id=node-65d52b5c-546c-11e4-821d-2c59e54ace74&op=list HTTP/1.1" 200 855 "-" "Go 1.1 package http"
10.96.10.86 - - [10/Jul/2015:20:15:02 +0000] "GET /MAAS/api/1.0/nodes/?agent_name=4a4b73ac-1417-4b66-869e-9e335f194486&id=node-65d52b5c-546c-11e4-821d-2c59e54ace74&op=list HTTP/1.1" 200 856 "-" "Go 1.1 package http"
10.96.10.86 - - [10/Jul/2015:20:17:07 +0000] "GET /MAAS/api/1.0/nodes/?agent_name=4a4b73ac-1417-4b66-869e-9e335f194486&id=node-5c4d39da-546c-11e4-821d-2c59e54ace74&op=list HTTP/1.1" 200 844 "-" "Go 1.1 package http"

At this point, the node has failed already:
$ maas admin-atlas nodes deployment-status nodes=node-65d52b5c-546c-11e4-821d-2c59e54ace74
{
    "node-65d52b5c-546c-11e4-821d-2c59e54ace74": "Failed deployment"
}

Not to juju:
$ juju status
environment: scapestack-trusty
machines:
  "0":
    agent-state: started
    agent-version: 1.24.2
    dns-name: sekine.scapestack
    instance-id: /MAAS/api/1.0/nodes/node-5c4d39da-546c-11e4-821d-2c59e54ace74/
    series: trusty
    hardware: arch=amd64 ...

Read more...

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Here is a packet capture, run on the MAAS node, with the filter "host <ip-of-juju-bootstrap>".

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The server is MAAS 1.7, btw, not 1.8, if that matters.

Revision history for this message
Ian Booth (wallyworld) wrote :

It looks like the use of the deployment_status API is currently restricted to bootstrap, which is synchronous. Thus when the bootstrap node is started, the deployment status is polled until it becomes Deployed or Failed Deployment.

With StartInstance() however, this is an async call to the provder. Thus a node is acquired and started and the call returns; the node status stays Pending until the node's machine agent starts and updates the status. This has always been Juju behaviour. There's no mechanism in Juju currently to recognise that StartInstance() may not immediately return an error. So for all providers, including MAAS, if StartInstance() initially returns OK, but then subsequently fails, the Juju status will stay Pending. The status only goes to error if the StartInstance() call itself fails.

To address this issue, it may be necessary to scope the work as part of the current Observability initiative, as it's not as far as I can see a MAAS specific issue but rather a fundamental Juju limitation.

Curtis Hovey (sinzui)
tags: added: feature
no longer affects: juju-core/1.24
Ante Karamatić (ivoks)
tags: added: cpec
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.0 → 1.25.1
tags: added: landscape
David Britton (dpb)
tags: added: kanban-cross-team
tags: removed: kanban-cross-team
Changed in juju-core:
milestone: 1.25.1 → 1.26.0
Revision history for this message
Cheryl Jennings (cherylj) wrote :

This should be addressed in the observability work I'm doing for 2.0

Changed in juju-core:
milestone: 1.26.0 → 2.0-alpha4
Changed in juju-core:
milestone: 2.0-alpha4 → 2.0-alpha3
summary: - MAAS node has "failed deployment", juju just says "pending"
+ MAAS node has "failed deployment", juju just says "pending" when using
+ juju add-machine
tags: added: kanban-cross-team
Changed in juju-core:
milestone: 2.0-alpha3 → 2.0-beta3
Changed in juju-core:
status: Triaged → Fix Committed
assignee: nobody → Horacio Durán (hduran-8)
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta3 → none
milestone: none → 2.0-beta3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.