Bug #1604962 “node set to “failed deployment” for no visible rea...” : Bugs : MAAS

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-07-20:

#1

hayward-11-logs.tgz Edit (6.3 MiB, application/x-tar)

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2016-07-20:

#2

Hi Jason,

Can you confirm a couple of things please?

- What's the curtin version
- What's the cloud-init version that your image is using?

Thanks!

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-07-20: Re: [Bug 1604962] Re: node set to "failed deployment" for no visible reason

#3

This is with curtin 0.1.0~bzr399-0ubuntu1~16.04.1 and
cloud-init 0.7.7~bzr1246-0ubuntu1~16.04.1

On Wed, Jul 20, 2016 at 4:20 PM, Andres Rodriguez <email address hidden>
wrote:

> Hi Jason,
>
> Can you confirm a couple of things please?
>
> - What's the curtin version
> - What's the cloud-init version that your image is using?
>
> Thanks!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1604962
>
> Title:
> node set to "failed deployment" for no visible reason
>
> Status in MAAS:
> New
>
> Bug description:
> A node reached the end of installation and was marked Failed
> Deployment, but I can't discover the reason it was marked that.
>
> I've attached logs from the MAAS server, and the event log, and the
> install log.
>
> In the event log we see this:
> {
> "type": "Node changed status",
> "level": "INFO",
> "node": "4y3hdg",
> "hostname": "hayward-11",
> "id": 1436389,
> "description": "From 'Deploying' to 'Failed deployment'",
> "created": "Wed, 20 Jul. 2016 01:01:00"
> },
> {
> "type": "Node installation",
> "level": "DEBUG",
> "node": "4y3hdg",
> "hostname": "hayward-11",
> "id": 1436388,
> "description": "'cloudinit' running modules for final",
> "created": "Wed, 20 Jul. 2016 01:01:00"
> },
>
> In the install log we see this:
> Length: unspecified [text/plain]
> Saving to: '/dev/null'
>
> 0K
> 138K=0s
>
> 2016-07-20 01:00:50 (138 KB/s) - '/dev/null' saved [2]
>
> curtin: Installation finished.
>
>
> This is from this run:
>
> http://10.245.162.43:8080/job/pipeline_deploy/8151/console
>
> This is with juju 2.0 beta 12 and maas 2.0 RC2.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1604962/+subscriptions
>

This is with curtin 0.1.0~bzr399-0ubuntu1~16.04.1 and
cloud-init 0.7.7~bzr1246-0ubuntu1~16.04.1

On Wed, Jul 20, 2016 at 4:20 PM, Andres Rodriguez <andreserl@ubuntu-pe.org>
wrote:

> Hi Jason,
>
> Can you confirm a couple of things please?
>
>  - What's the curtin version
>  - What's the cloud-init version that your image is using?
>
> Thanks!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1604962
>
> Title:
>   node set to "failed deployment" for no visible reason
>
> Status in MAAS:
>   New
>
> Bug description:
>   A node reached the end of installation and was marked Failed
>   Deployment, but I can't discover the reason it was marked that.
>
>   I've attached logs from the MAAS server, and the event log, and the
>   install log.
>
>   In the event log we see this:
>           {
>               "type": "Node changed status",
>               "level": "INFO",
>               "node": "4y3hdg",
>               "hostname": "hayward-11",
>               "id": 1436389,
>               "description": "From 'Deploying' to 'Failed deployment'",
>               "created": "Wed, 20 Jul. 2016 01:01:00"
>           },
>           {
>               "type": "Node installation",
>               "level": "DEBUG",
>               "node": "4y3hdg",
>               "hostname": "hayward-11",
>               "id": 1436388,
>               "description": "'cloudinit' running modules for final",
>               "created": "Wed, 20 Jul. 2016 01:01:00"
>           },
>
>   In the install log we see this:
>   Length: unspecified [text/plain]
>   Saving to: '/dev/null'
>
>        0K
>   138K=0s
>
>   2016-07-20 01:00:50 (138 KB/s) - '/dev/null' saved [2]
>
>   curtin: Installation finished.
>
>
>   This is from this run:
>
>   http://10.245.162.43:8080/job/pipeline_deploy/8151/console
>
>   This is with juju 2.0 beta 12 and maas 2.0 RC2.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1604962/+subscriptions
>

Revision history for this message

Larry Michel (lmic) wrote on 2016-07-20:

#4

I am not sure whether this is related, but I saw a node yesterday which was marked as failed deployment in juju status but on the maas server it was in deployed state in maas. I thought the issue was with juju but in light of this bug, I am wondering whether maas could return failed deployment initially to juju prior to the node being marked as deployed. I will monitor to see whether I can recreate to capture logs.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-07-20:

#5

Larry - thanks. That's different from this bug, where MAAS clearly had the node marked as failed deployment, not just in juju status output. If you see that separate issue again, please file a separate bug.

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2016-07-20:

#6

I think the reason of the failure is due to:

Jul 20 02:18:55 hayward-11 pollinate[2033]: WARNING: Network communication failed [0]\n % Total % Received % Xferd Average Speed Time Time Time Current#012 Dload Upload Total Spent Left Speed#012#015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 002:18:52.147979 * Trying 91.189.94.10...#012#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
Jul 20 02:18:55 hayward-11 pollinate[2033]: Jul 20 02:18:55 hayward-11 <13>Jul 20 02:18:55 pollinate[2033]: WARNING: Network communication failed [0]\n % Total % Received % Xferd Average Speed Time Time Time Current
Jul 20 02:18:55 hayward-11 pollinate[2033]: Dload Upload Total Spent Left Speed
Jul 20 02:18:55 hayward-11 pollinate[2033]: #015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 002:18:52.147979 * Trying 91.189.94.10...
Jul 20 02:18:55 hayward-11 pollinate[2033]: #015 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0

cloud-init cannot reach the pollinate server and I'm guessing that's what causing MAAS to mark the machine as failed deployment. However, I don't see a message being posted to MAAS about this that would actually cause this failure... nor I see anything in the event log....

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2016-07-20:

#7

however, this is an older log

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2016-07-21:

#8

Ok, so I debugged this a bit, can came to something in the logs which may show what the issue is, or might not. In the failed node I see:

Jul 20 16:55:21 hayward-11 [CLOUDINIT] util.py[DEBUG]: cloud-init mode 'modules' took 153.244 seconds (153.24)
Jul 20 16:55:21 hayward-11 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final
Jul 20 16:55:21 hayward-11 [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg' with {'allow_redirects': True, 'url': 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg', 'method': 'POST', 'headers': {'Authorization': 'OAuth oauth_nonce="158422460373674945521469033721", oauth_timestamp="1469033721", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="QNpkS3KULgYwj9m6Xu", oauth_token="RbHTw3U4qqWaqtzyK9", oauth_signature="%26ZfXD8J77vZU5UZjdwLPwbzqXZf59na76"'}} configuration
Jul 20 16:55:21 hayward-11 [CLOUDINIT] url_helper.py[DEBUG]: Read from http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg (200, 2b) after 1 attempts
Jul 20 16:55:21 hayward-11 cloud-init[2242]: Cloud-init v. 0.7.7 running 'modules:final' at Wed, 20 Jul 2016 16:52:48 +0000. Up 120.72 seconds.
Jul 20 16:55:21 hayward-11 cloud-init[2242]: Cloud-init v. 0.7.7 finished at Wed, 20 Jul 2016 16:55:21 +0000. Datasource DataSourceMAAS [http://10.244.192.10:5240/MAAS/metadata/curtin]. Up 273.60 seconds

Where as in a node that successfully deployed in my local cluster:

Jul 20 23:52:19 node05 [CLOUDINIT] util.py[DEBUG]: cloud-init mode 'modules' took 63.922 seconds (63.91)
Jul 20 23:52:19 node05 [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://10.90.90.254:5240/MAAS/metadata/status/4y3had' with {'allow_redirects': True, 'url': 'http://10.90.90.254:5240/MAAS/metadata/status/4y3had', 'method': 'POST', 'headers': {'Authorization': 'OAuth oauth_nonce="50531188898511683361469058739", oauth_timestamp="1469058739", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="QbbS8Ybs8pgz9fN7Ub", oauth_token="rQWr6gYvafJvULaWjp", oauth_signature="%26U4yvEWxbSzgEwQFrYhNQ4KHX2RK27zjx"'}} configuration
Jul 20 23:52:19 node05 [CLOUDINIT] url_helper.py[DEBUG]: Read from http://10.90.90.254:5240/MAAS/metadata/status/4y3had (200, 2b) after 1 attempts
Jul 20 23:52:19 node05 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final
Jul 20 23:52:19 node05 cloud-init[1141]: Cloud-init v. 0.7.7 running 'modules:final' at Wed, 20 Jul 2016 23:51:15 +0000. Up 28.21 seconds.
Jul 20 23:52:19 node05 cloud-init[1141]: Cloud-init v. 0.7.7 finished at Wed, 20 Jul 2016 23:52:19 +0000. Datasource DataSourceMAAS [http://10.90.90.254:5240/MAAS/metadata/curtin]. Up 91.87 seconds

idk if this has really any relation, but if we see in the failed node output, we see "url_helper.py[DEBUG]: [0/1] open" after "Jul 20 23:52:19 node05 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final", while in the successful run is the other way around.

Ok, so I debugged this a bit, can came to something in the logs which may show what the issue is, or might not. In the failed node I see:

Jul 20 16:55:21 hayward-11 [CLOUDINIT] util.py[DEBUG]: cloud-init mode 'modules' took 153.244 seconds (153.24)
Jul 20 16:55:21 hayward-11 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final
Jul 20 16:55:21 hayward-11 [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg' with {'allow_redirects': True, 'url': 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg', 'method': 'POST', 'headers': {'Authorization': 'OAuth oauth_nonce="158422460373674945521469033721", oauth_timestamp="1469033721", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="QNpkS3KULgYwj9m6Xu", oauth_token="RbHTw3U4qqWaqtzyK9", oauth_signature="%26ZfXD8J77vZU5UZjdwLPwbzqXZf59na76"'}} configuration
Jul 20 16:55:21 hayward-11 [CLOUDINIT] url_helper.py[DEBUG]: Read from http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg (200, 2b) after 1 attempts
Jul 20 16:55:21 hayward-11 cloud-init[2242]: Cloud-init v. 0.7.7 running 'modules:final' at Wed, 20 Jul 2016 16:52:48 +0000. Up 120.72 seconds.
Jul 20 16:55:21 hayward-11 cloud-init[2242]: Cloud-init v. 0.7.7 finished at Wed, 20 Jul 2016 16:55:21 +0000. Datasource DataSourceMAAS [http://10.244.192.10:5240/MAAS/metadata/curtin].  Up 273.60 seconds

Where as in a node that successfully deployed in my local cluster:

Jul 20 23:52:19 node05 [CLOUDINIT] util.py[DEBUG]: cloud-init mode 'modules' took 63.922 seconds (63.91)
Jul 20 23:52:19 node05 [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://10.90.90.254:5240/MAAS/metadata/status/4y3had' with {'allow_redirects': True, 'url': 'http://10.90.90.254:5240/MAAS/metadata/status/4y3had', 'method': 'POST', 'headers': {'Authorization': 'OAuth oauth_nonce="50531188898511683361469058739", oauth_timestamp="1469058739", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="QbbS8Ybs8pgz9fN7Ub", oauth_token="rQWr6gYvafJvULaWjp", oauth_signature="%26U4yvEWxbSzgEwQFrYhNQ4KHX2RK27zjx"'}} configuration
Jul 20 23:52:19 node05 [CLOUDINIT] url_helper.py[DEBUG]: Read from http://10.90.90.254:5240/MAAS/metadata/status/4y3had (200, 2b) after 1 attempts
Jul 20 23:52:19 node05 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final
Jul 20 23:52:19 node05 cloud-init[1141]: Cloud-init v. 0.7.7 running 'modules:final' at Wed, 20 Jul 2016 23:51:15 +0000. Up 28.21 seconds.
Jul 20 23:52:19 node05 cloud-init[1141]: Cloud-init v. 0.7.7 finished at Wed, 20 Jul 2016 23:52:19 +0000. Datasource DataSourceMAAS [http://10.90.90.254:5240/MAAS/metadata/curtin].  Up 91.87 seconds

idk if this has really any relation, but if we see in the failed node output, we see "url_helper.py[DEBUG]: [0/1] open" after "Jul 20 23:52:19 node05 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final", while in the successful run is the other way around.

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2016-07-21:

#9

Matching the failure wth this:

Jul 20 01:01:00 maas2-integration maas.node: [INFO] hayward-11: Status transition from DEPLOYING to FAILED_DEPLOYMENT
Jul 20 01:01:00 maas2-integration maas.node: [ERROR] hayward-11: Marking node failed: Installation failed (refer to the installation log for more information)., that would be that cloud-init send a message saying that something "FAILED" for maas to mark it as failed, which might as well be something related to the fact that we see the access to the metadata *after*:

"Jul 20 16:55:21 hayward-11 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final"

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2016-07-21:

#10

Jul 20 16:55:21 hayward-11 [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg' with {'allow_redirects': True, 'url': 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg', 'method': 'POST', 'headers': {'Authorization': 'OAuth oauth_nonce="158422460373674945521469033721", oauth_timestamp="1469033721", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="QNpkS3KULgYwj9m6Xu", oauth_token="RbHTw3U4qqWaqtzyK9", oauth_signature="%26ZfXD8J77vZU5UZjdwLPwbzqXZf59na76"'}} configuration

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2016-07-21:

#11

Actually, it may be even be this:

Jul 20 16:52:47 hayward-11 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-config/config-runcmd: SUCCESS: config-runcmd ran successfully
Jul 20 16:52:47 hayward-11 [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg' with {'headers': {'Authorization': 'OAuth oauth_nonce="118413150381654594601469033567", oauth_timestamp="1469033567", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="QNpkS3KULgYwj9m6Xu", oauth_token="RbHTw3U4qqWaqtzyK9", oauth_signature="%26ZfXD8J77vZU5UZjdwLPwbzqXZf59na76"'}, 'url': 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg', 'allow_redirects': True, 'method': 'POST'} configuration
Jul 20 16:52:47 hayward-11 pollinate[2090]: WARNING: Network communication failed [0]\n % Total % Received % Xferd Average Speed Time Time Time Current#012 Dload Upload Total Spent Left Speed#012#015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 016:52:44.818437 * Trying 91.189.94.10...#012#015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
Jul 20 16:52:47 hayward-11 pollinate[2090]: Jul 20 16:52:47 hayward-11 <13>Jul 20 16:52:47 pollinate[2090]: WARNING: Network communication failed [0]\n % Total % Received % Xferd Average Speed Time Time Time Current
Jul 20 16:52:47 hayward-11 pollinate[2090]: Dload Upload Total Spent Left Speed
Jul 20 16:52:47 hayward-11 pollinate[2090]: #015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 016:52:44.818437 * Trying 91.189.94.10...
Jul 20 16:52:47 hayward-11 pollinate[2090]: #015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0

Actually, it may be even be this:

Jul 20 16:52:47 hayward-11 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-config/config-runcmd: SUCCESS: config-runcmd ran successfully
Jul 20 16:52:47 hayward-11 [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg' with {'headers': {'Authorization': 'OAuth oauth_nonce="118413150381654594601469033567", oauth_timestamp="1469033567", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="QNpkS3KULgYwj9m6Xu", oauth_token="RbHTw3U4qqWaqtzyK9", oauth_signature="%26ZfXD8J77vZU5UZjdwLPwbzqXZf59na76"'}, 'url': 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg', 'allow_redirects': True, 'method': 'POST'} configuration
Jul 20 16:52:47 hayward-11 pollinate[2090]: WARNING: Network communication failed [0]\n  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current#012                                 Dload  Upload   Total   Spent    Left  Speed#012#015  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     016:52:44.818437 *   Trying 91.189.94.10...#012#015  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0#015  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0#015  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
Jul 20 16:52:47 hayward-11 pollinate[2090]: Jul 20 16:52:47 hayward-11 <13>Jul 20 16:52:47 pollinate[2090]: WARNING: Network communication failed [0]\n  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Jul 20 16:52:47 hayward-11 pollinate[2090]:                                  Dload  Upload   Total   Spent    Left  Speed
Jul 20 16:52:47 hayward-11 pollinate[2090]: #015  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     016:52:44.818437 *   Trying 91.189.94.10...
Jul 20 16:52:47 hayward-11 pollinate[2090]: #015  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0#015  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0#015  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0

Revision history for this message

Scott Moser (smoser) wrote on 2016-07-21:

#12

Hi,
to debug this please follow
https://gist.github.com/smoser/2610e9b78b8d7b54319675d9e3986a1b

stop the node from rebooting, and then grab /var/log/cloud-init*
and /run/cloud-init* and /var/log/cloud

there is probably a WARN message in /var/log/cloud-init.log that indicates the failure.

ideally collect a /tmp/install.log that would be there if curtin ran.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-07-21:

#13

Scott,

That isn't going to work. This is a failure in an automated system - one of
hundreds of installs a day. maas/curtin/cloud-init should be capturing the
cause of the failure and logging it - the fact that it isn't is a bug that
needs to be fixed. Please investigate why that failure isn't being logged
to MAAS and how it can be in an automated fashion. Why is the WARN message
from cloud-init.log not being pushed to MAAS? Why isn't cloud-init.log
being pushed to MAAS? Is there some way to automate that?

On Thu, Jul 21, 2016 at 8:38 AM, Scott Moser <email address hidden> wrote:

> Hi,
> to debug this please follow
> https://gist.github.com/smoser/2610e9b78b8d7b54319675d9e3986a1b
>
> stop the node from rebooting, and then grab /var/log/cloud-init*
> and /run/cloud-init* and /var/log/cloud
>
> there is probably a WARN message in /var/log/cloud-init.log that
> indicates the failure.
>
> ideally collect a /tmp/install.log that would be there if curtin ran.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1604962
>
> Title:
> node set to "failed deployment" for no visible reason
>
> Status in MAAS:
> New
> Status in MAAS 2.0 series:
> New
> Status in MAAS trunk series:
> New
>
> Bug description:
> A node reached the end of installation and was marked Failed
> Deployment, but I can't discover the reason it was marked that.
>
> I've attached logs from the MAAS server, and the event log, and the
> install log.
>
> In the event log we see this:
> {
> "type": "Node changed status",
> "level": "INFO",
> "node": "4y3hdg",
> "hostname": "hayward-11",
> "id": 1436389,
> "description": "From 'Deploying' to 'Failed deployment'",
> "created": "Wed, 20 Jul. 2016 01:01:00"
> },
> {
> "type": "Node installation",
> "level": "DEBUG",
> "node": "4y3hdg",
> "hostname": "hayward-11",
> "id": 1436388,
> "description": "'cloudinit' running modules for final",
> "created": "Wed, 20 Jul. 2016 01:01:00"
> },
>
> In the install log we see this:
> Length: unspecified [text/plain]
> Saving to: '/dev/null'
>
> 0K
> 138K=0s
>
> 2016-07-20 01:00:50 (138 KB/s) - '/dev/null' saved [2]
>
> curtin: Installation finished.
>
>
> This is from this run:
>
> http://10.245.162.43:8080/job/pipeline_deploy/8151/console
>
> This is with juju 2.0 beta 12 and maas 2.0 RC2.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1604962/+subscriptions
>

Scott,

That isn't going to work. This is a failure in an automated system - one of
hundreds of installs a day. maas/curtin/cloud-init should be capturing the
cause of the failure and logging it - the fact that it isn't is a bug that
needs to be fixed. Please investigate why that failure isn't being logged
to MAAS and how it can be in an automated fashion. Why is the WARN message
from cloud-init.log not being pushed to MAAS? Why isn't cloud-init.log
being pushed to MAAS? Is there some way to automate that?

On Thu, Jul 21, 2016 at 8:38 AM, Scott Moser <smoser@ubuntu.com> wrote:

> Hi,
> to debug this please follow
> https://gist.github.com/smoser/2610e9b78b8d7b54319675d9e3986a1b
>
> stop the node from rebooting, and then grab /var/log/cloud-init*
> and /run/cloud-init* and /var/log/cloud
>
> there is probably a WARN message in /var/log/cloud-init.log that
> indicates the failure.
>
> ideally collect a /tmp/install.log that would be there if curtin ran.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1604962
>
> Title:
>   node set to "failed deployment" for no visible reason
>
> Status in MAAS:
>   New
> Status in MAAS 2.0 series:
>   New
> Status in MAAS trunk series:
>   New
>
> Bug description:
>   A node reached the end of installation and was marked Failed
>   Deployment, but I can't discover the reason it was marked that.
>
>   I've attached logs from the MAAS server, and the event log, and the
>   install log.
>
>   In the event log we see this:
>           {
>               "type": "Node changed status",
>               "level": "INFO",
>               "node": "4y3hdg",
>               "hostname": "hayward-11",
>               "id": 1436389,
>               "description": "From 'Deploying' to 'Failed deployment'",
>               "created": "Wed, 20 Jul. 2016 01:01:00"
>           },
>           {
>               "type": "Node installation",
>               "level": "DEBUG",
>               "node": "4y3hdg",
>               "hostname": "hayward-11",
>               "id": 1436388,
>               "description": "'cloudinit' running modules for final",
>               "created": "Wed, 20 Jul. 2016 01:01:00"
>           },
>
>   In the install log we see this:
>   Length: unspecified [text/plain]
>   Saving to: '/dev/null'
>
>        0K
>   138K=0s
>
>   2016-07-20 01:00:50 (138 KB/s) - '/dev/null' saved [2]
>
>   curtin: Installation finished.
>
>
>   This is from this run:
>
>   http://10.245.162.43:8080/job/pipeline_deploy/8151/console
>
>   This is with juju 2.0 beta 12 and maas 2.0 RC2.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1604962/+subscriptions
>

Adam Stokes (adam-stokes) on 2016-07-22

tags:

added: conjure

Revision history for this message

Scott Moser (smoser) wrote on 2016-07-22:

#14

Currently, the problem I believe that we're seeing is due to the way in which curtin implements its reboot.
Basically, when curtin is done with install, it backgrounds a processes that does:
sleep 5
reboot

As I understand this problem, the race condition is that the 'reboot' is happening before cloud-init is finished. reboot tells systemd to kill all processes and shutdown. systemd kills cloud-init, cloud-init exits (and reports) failure because it is not done.

If that is the case, then the situtation can be avoided simply by adding a longer 'delay' to the curtin configuration that maas sends.

In /etc/maas/preseeds/curtin_userdata, there is a section here that looks like:
power_state:
mode: reboot

You can change this to say:
  power_state:
    mode: reboot
    delay: 30

Revision history for this message

Scott Moser (smoser) wrote on 2016-07-22:

#15

Does this issue happen with 1.9?

Revision history for this message

Scott Moser (smoser) wrote on 2016-07-22:

#16

further investigation and working with Adam indicate that my assesment in comment 14 is probably correct.
Looking at cloud-init logs, we see things like:

Jul 22 14:46:53 ubuntu [CLOUDINIT] handlers.py[DEBUG]: start: init-network/config-ubuntu-init-switch: running config-ubuntu-init-switch with frequency once-per-instance
Jul 22 14:46:53 ubuntu [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://192.168.10.114:5240/MAAS/metadata/status/4y3h84' with {'allow_redirects': True, 'url': 'http://192.168.10.114:5240/MAAS/metadata/status/4y3h84', 'headers': {'Authorization': 'OAuth oauth_nonce="107803251774003364951469198813", oauth_timestamp="1469198813", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="KCpd3JJTTbtFMjKz7D", oauth_token="9kJS9LRLK7ZaQELPCv", oauth_signature="%26JXB592yRU6RaWGa7GTLegfqWN8H6B9ZE"'}, 'method': 'POST'} configuration
Jul 22 14:47:00 ubuntu [CLOUDINIT] url_helper.py[DEBUG]: Read from http://192.168.10.114:5240/MAAS/metadata/status/4y3h84 (200, 2b) after 1 attempts
Jul 22 14:47:00 ubuntu [CLOUDINIT] util.py[DEBUG]: Writing to /var/lib/cloud/instances/4y3h84/sem/config_ubuntu_init_switch - wb: [420] 25 bytes
Jul 22 14:47:00 ubuntu [CLOUDINIT] helpers.py[DEBUG]: Running co

there, the event post to maas took 7 seconds.

So what happened was curtin finished, cloud-init was in the middle of posting an event, and curtin's reboot fired. cloud-init got the kill signal from systemd and reported its failure.

We can definitely look to improve how curtin interacts with cloud-init for rebooting so that this doesn't happen, but its not good if an event takes 7 seconds to post. Say an post event took 2 seconds, and a deploy-install-boot reported 120 events.. that'd be 4 minutes of wall clock spent waiting for api responses.

I dont' have any good ideas on how we could handle that. We background / batch off the status posts and go, but probably should still wait for them to come back and check their result at some point rather than just ignoring the possible post failure.

further investigation and working with Adam indicate that my assesment in comment 14 is probably correct.
Looking at cloud-init logs, we see things like:

Jul 22 14:46:53 ubuntu [CLOUDINIT] handlers.py[DEBUG]: start: init-network/config-ubuntu-init-switch: running config-ubuntu-init-switch with frequency once-per-instance
Jul 22 14:46:53 ubuntu [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://192.168.10.114:5240/MAAS/metadata/status/4y3h84' with {'allow_redirects': True, 'url': 'http://192.168.10.114:5240/MAAS/metadata/status/4y3h84', 'headers': {'Authorization': 'OAuth oauth_nonce="107803251774003364951469198813", oauth_timestamp="1469198813", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="KCpd3JJTTbtFMjKz7D", oauth_token="9kJS9LRLK7ZaQELPCv", oauth_signature="%26JXB592yRU6RaWGa7GTLegfqWN8H6B9ZE"'}, 'method': 'POST'} configuration
Jul 22 14:47:00 ubuntu [CLOUDINIT] url_helper.py[DEBUG]: Read from http://192.168.10.114:5240/MAAS/metadata/status/4y3h84 (200, 2b) after 1 attempts
Jul 22 14:47:00 ubuntu [CLOUDINIT] util.py[DEBUG]: Writing to /var/lib/cloud/instances/4y3h84/sem/config_ubuntu_init_switch - wb: [420] 25 bytes
Jul 22 14:47:00 ubuntu [CLOUDINIT] helpers.py[DEBUG]: Running co

there, the event post to maas took 7 seconds.

So what happened was curtin finished, cloud-init was in the middle of posting an event, and curtin's reboot fired.  cloud-init got the kill signal from systemd and reported its failure.

We can definitely look to improve how curtin interacts with cloud-init for rebooting so that this doesn't happen, but its not good if an event takes 7 seconds to post.  Say an post event took 2 seconds, and a deploy-install-boot reported 120 events.. that'd be 4 minutes of wall clock spent waiting for api responses.

I dont' have any good ideas on how we could handle that.  We background / batch off the status posts and go, but probably should still wait for them to come back and check their result at some point rather than just ignoring the possible post failure.

Changed in curtin:
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2016-07-24:

#17

Hi Scott,

So in that case, I have a question. Why do we have to tell curtin to reboot at all if at the end of the day, cloud-init still needs to run and do stuff? Couldn't cloud-init simply reboot the machine when it is done ? We could tell cloud-init that instead of curtin ? What do you think?

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2016-07-24:

#18

This would work, no? http://paste.ubuntu.com/20677754/

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2016-07-24:

#19

@Andres

The above paste should work and fix the race condition, have you tried deploying with that change?

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2016-07-24: Re: [Bug 1604962] [NEW] node set to "failed deployment" for no visible reason

#20

I have tried it and seems to work just fine. The question I have for you is
why did we add the power off stuff in the curtin pressed rather than
cloud-Init? Does precise need it as cloud-unit doesn't provide the power
off stanza?

On Sunday, July 24, 2016, Blake Rouse <email address hidden> wrote:

> @Andres
>
> The above paste should work and fix the race condition, have you tried
> deploying with that change?
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1604962
>
> Title:
> node set to "failed deployment" for no visible reason
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1604962/+subscriptions
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message

Scott Moser (smoser) wrote on 2016-07-25:

#21

You added it to curtin for a few reasons.
a.) cloud-init in precise does not support the poweroff stanza (we could SRU that).
b.) maas prior to 2.0 only sent a single configuration "part" to cloud-init. That part was the curtin blob to execute (started with '#!') so there was no channel to carry the cloud-init poweroff stanza.

Scott Moser (smoser) on 2016-07-27

description:

updated

Andres Rodriguez (andreserl) on 2016-08-22

Changed in maas:
status:	Fix Committed → Fix Released

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-09-22:

#22

cloud-init-logs.txt Edit (35.2 KiB, text/plain)

We started seeing this with maas 2.0 and added the workaround from comment #14 (delay: 30). That fixed the problem for us.

I'm attaching the output of tail -f /var/log/cloud-init* while the installation was happening.

Revision history for this message

Jon Grimm (jgrimm) wrote on 2016-09-22:

#23

Scott is looking at ways to make curtin a bit more robust in this area. Rather than waiting an arbitrary timeout value (5 seconds currently), will attempt to watch for when things are done. Note: the question of why maas is so busy/limited responsiveness is a different question(bug?), we are just teaching curtin to be a bit smarter than an arbitrary timeout value.

Revision history for this message

Larry Michel (lmic) wrote on 2016-11-04:

#24

I have been hitting this even with the workaround in place:

Node changed status - From 'Deploying' to 'Failed deployment' Fri, 04 Nov. 2016 17:10:16
Marking node failed - Installation failed (refer to the installation log for more information). Fri, 04 Nov. 2016 17:10:16
Node installation failure - 'cloudinit' running modules for final Fri, 04 Nov. 2016 17:10:16
Installation complete - Node disabled netboot Fri, 04 Nov. 2016 17:09:26
PXE Request - installation Fri, 04 Nov. 2016 16:59:24
Node powered on Fri, 04 Nov. 2016 16:56:12
Powering node on Fri, 04 Nov. 2016 16:56:07
User starting deployment - (oil) Fri, 04 Nov. 2016 16:56:04
User acquiring node - (oil) Fri, 04 Nov. 2016 16:56:02

I hit it yesterday on one of the systems and tried to increase to 45 seconds:

power_state:
mode: reboot
delay: 45

But seeing it again.

This is with maas 2.1.

ubuntu@maas2-production:~$ dpkg -l|grep curtin
ii curtin-common 0.1.0~bzr418-0ubuntu1~ubuntu16.04.1 all Library and tools for curtin installer
ii python-curtin 0.1.0~bzr418-0ubuntu1~ubuntu16.04.1 all Library and tools for curtin installer
ii python3-curtin 0.1.0~bzr418-0ubuntu1~ubuntu16.04.1 all Library and tools for curtin installer
ubuntu@maas2-production:~$ dpkg -l|grep maas
ii maas 2.1.0+bzr5480-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS client and command-line interface
ii maas-common 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.1.0+bzr5480-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.1.0+bzr5480-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.1.0+bzr5480-0ubuntu1~16.04.1 all Region Controller for MAAS
ii python3-django-maas 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

I have been hitting this even with the workaround in place:

Node changed status - From 'Deploying' to 'Failed deployment'	Fri, 04 Nov. 2016 17:10:16
Marking node failed - Installation failed (refer to the installation log for more information).	Fri, 04 Nov. 2016 17:10:16
Node installation failure - 'cloudinit' running modules for final	Fri, 04 Nov. 2016 17:10:16
Installation complete - Node disabled netboot	Fri, 04 Nov. 2016 17:09:26
PXE Request - installation	Fri, 04 Nov. 2016 16:59:24
Node powered on	Fri, 04 Nov. 2016 16:56:12
Powering node on	Fri, 04 Nov. 2016 16:56:07
User starting deployment - (oil)	Fri, 04 Nov. 2016 16:56:04
User acquiring node - (oil)	Fri, 04 Nov. 2016 16:56:02

I hit it yesterday on one of the systems and tried to increase to 45 seconds:

power_state:
  mode: reboot
  delay: 45

But seeing it again.

This is with maas 2.1.

ubuntu@maas2-production:~$ dpkg -l|grep curtin
ii  curtin-common                         0.1.0~bzr418-0ubuntu1~ubuntu16.04.1 all          Library and tools for curtin installer
ii  python-curtin                         0.1.0~bzr418-0ubuntu1~ubuntu16.04.1 all          Library and tools for curtin installer
ii  python3-curtin                        0.1.0~bzr418-0ubuntu1~ubuntu16.04.1 all          Library and tools for curtin installer
ubuntu@maas2-production:~$ dpkg -l|grep maas
ii  maas                                  2.1.0+bzr5480-0ubuntu1~16.04.1      all          "Metal as a Service" is a physical cloud and IPAM
ii  maas-cli                              2.1.0+bzr5480-0ubuntu1~16.04.1      all          MAAS client and command-line interface
ii  maas-common                           2.1.0+bzr5480-0ubuntu1~16.04.1      all          MAAS server common files
ii  maas-dhcp                             2.1.0+bzr5480-0ubuntu1~16.04.1      all          MAAS DHCP server
ii  maas-dns                              2.1.0+bzr5480-0ubuntu1~16.04.1      all          MAAS DNS server
ii  maas-proxy                            2.1.0+bzr5480-0ubuntu1~16.04.1      all          MAAS Caching Proxy
ii  maas-rack-controller                  2.1.0+bzr5480-0ubuntu1~16.04.1      all          Rack Controller for MAAS
ii  maas-region-api                       2.1.0+bzr5480-0ubuntu1~16.04.1      all          Region controller API service for MAAS
ii  maas-region-controller                2.1.0+bzr5480-0ubuntu1~16.04.1      all          Region Controller for MAAS
ii  python3-django-maas                   2.1.0+bzr5480-0ubuntu1~16.04.1      all          MAAS server Django web framework (Python 3)
ii  python3-maas-client                   2.1.0+bzr5480-0ubuntu1~16.04.1      all          MAAS python API client (Python 3)
ii  python3-maas-provisioningserver       2.1.0+bzr5480-0ubuntu1~16.04.1      all          MAAS server provisioning libraries (Python 3)

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2016-11-04:

#25

Download full text (3.4 KiB)

Larry,

Compare the versions of cloud-unit in the images VS before and so on

On Friday, November 4, 2016, Larry Michel <email address hidden>
wrote:

> I have been hitting this even with the workaround in place:
>
> Node changed status - From 'Deploying' to 'Failed deployment' Fri, 04
> Nov. 2016 17:10:16
> Marking node failed - Installation failed (refer to the installation log
> for more information). Fri, 04 Nov. 2016 17:10:16
> Node installation failure - 'cloudinit' running modules for final
> Fri, 04 Nov. 2016 17:10:16
> Installation complete - Node disabled netboot Fri, 04 Nov. 2016 17:09:26
> PXE Request - installation Fri, 04 Nov. 2016 16:59:24
> Node powered on Fri, 04 Nov. 2016 16:56:12
> Powering node on Fri, 04 Nov. 2016 16:56:07
> User starting deployment - (oil) Fri, 04 Nov. 2016 16:56:04
> User acquiring node - (oil) Fri, 04 Nov. 2016 16:56:02
>
> I hit it yesterday on one of the systems and tried to increase to 45
> seconds:
>
> power_state:
> mode: reboot
> delay: 45
>
> But seeing it again.
>
> This is with maas 2.1.
>
> ubuntu@maas2-production:~$ dpkg -l|grep curtin
> ii curtin-common 0.1.0~bzr418-0ubuntu1~ubuntu16.04.1
> all Library and tools for curtin installer
> ii python-curtin 0.1.0~bzr418-0ubuntu1~ubuntu16.04.1
> all Library and tools for curtin installer
> ii python3-curtin 0.1.0~bzr418-0ubuntu1~ubuntu16.04.1
> all Library and tools for curtin installer
> ubuntu@maas2-production:~$ dpkg -l|grep maas
> ii maas 2.1.0+bzr5480-0ubuntu1~16.04.1
> all "Metal as a Service" is a physical cloud and IPAM
> ii maas-cli 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS client and command-line interface
> ii maas-common 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS server common files
> ii maas-dhcp 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS DHCP server
> ii maas-dns 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS DNS server
> ii maas-proxy 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS Caching Proxy
> ii maas-rack-controller 2.1.0+bzr5480-0ubuntu1~16.04.1
> all Rack Controller for MAAS
> ii maas-region-api 2.1.0+bzr5480-0ubuntu1~16.04.1
> all Region controller API service for MAAS
> ii maas-region-controller 2.1.0+bzr5480-0ubuntu1~16.04.1
> all Region Controller for MAAS
> ii python3-django-maas 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS server Django web framework (Python 3)
> ii python3-maas-client 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS python API client (Python 3)
> ii python3-maas-provisioningserver 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS server provisioning libraries (Python 3)
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launc...

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	Critical	Blake Rouse	MAAS 2.2.0
	2.1	Fix Released	Critical	Blake Rouse	MAAS 2.1.3

Changed in landscape:
milestone:	none → 16.11
tags:	added: landscape

MAAS

node set to "failed deployment" for no visible reason

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches