node set to "failed deployment" for no visible reason

Bug #1604962 reported by Jason Hobbs on 2016-07-20
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
MAAS
Critical
Blake Rouse
2.1
Critical
Blake Rouse
Trunk
Critical
Blake Rouse

Bug Description

A node reached the end of installation and was marked Failed Deployment, but I can't discover the reason it was marked that.

I've attached logs from the MAAS server, and the event log, and the install log.

In the event log we see this:
        {
            "type": "Node changed status",
            "level": "INFO",
            "node": "4y3hdg",
            "hostname": "hayward-11",
            "id": 1436389,
            "description": "From 'Deploying' to 'Failed deployment'",
            "created": "Wed, 20 Jul. 2016 01:01:00"
        },
        {
            "type": "Node installation",
            "level": "DEBUG",
            "node": "4y3hdg",
            "hostname": "hayward-11",
            "id": 1436388,
            "description": "'cloudinit' running modules for final",
            "created": "Wed, 20 Jul. 2016 01:01:00"
        },

In the install log we see this:
Length: unspecified [text/plain]
Saving to: '/dev/null'

     0K 138K=0s

2016-07-20 01:00:50 (138 KB/s) - '/dev/null' saved [2]

curtin: Installation finished.

This is from this run:

http://10.245.162.43:8080/job/pipeline_deploy/8151/console

This is with juju 2.0 beta 12 and maas 2.0 RC2.

Related Bugs:
 * bug 1606999: reporting messages can slow down operations greatly
 * bug 1674734: [curtin] if invoked by init, should wait until cloud-init is finished for reboot

Related branches

Jason Hobbs (jason-hobbs) wrote :
Andres Rodriguez (andreserl) wrote :

Hi Jason,

Can you confirm a couple of things please?

 - What's the curtin version
 - What's the cloud-init version that your image is using?

Thanks!

This is with curtin 0.1.0~bzr399-0ubuntu1~16.04.1 and
cloud-init 0.7.7~bzr1246-0ubuntu1~16.04.1

On Wed, Jul 20, 2016 at 4:20 PM, Andres Rodriguez <email address hidden>
wrote:

> Hi Jason,
>
> Can you confirm a couple of things please?
>
> - What's the curtin version
> - What's the cloud-init version that your image is using?
>
> Thanks!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1604962
>
> Title:
> node set to "failed deployment" for no visible reason
>
> Status in MAAS:
> New
>
> Bug description:
> A node reached the end of installation and was marked Failed
> Deployment, but I can't discover the reason it was marked that.
>
> I've attached logs from the MAAS server, and the event log, and the
> install log.
>
> In the event log we see this:
> {
> "type": "Node changed status",
> "level": "INFO",
> "node": "4y3hdg",
> "hostname": "hayward-11",
> "id": 1436389,
> "description": "From 'Deploying' to 'Failed deployment'",
> "created": "Wed, 20 Jul. 2016 01:01:00"
> },
> {
> "type": "Node installation",
> "level": "DEBUG",
> "node": "4y3hdg",
> "hostname": "hayward-11",
> "id": 1436388,
> "description": "'cloudinit' running modules for final",
> "created": "Wed, 20 Jul. 2016 01:01:00"
> },
>
> In the install log we see this:
> Length: unspecified [text/plain]
> Saving to: '/dev/null'
>
> 0K
> 138K=0s
>
> 2016-07-20 01:00:50 (138 KB/s) - '/dev/null' saved [2]
>
> curtin: Installation finished.
>
>
> This is from this run:
>
> http://10.245.162.43:8080/job/pipeline_deploy/8151/console
>
> This is with juju 2.0 beta 12 and maas 2.0 RC2.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1604962/+subscriptions
>

Larry Michel (lmic) wrote :

I am not sure whether this is related, but I saw a node yesterday which was marked as failed deployment in juju status but on the maas server it was in deployed state in maas. I thought the issue was with juju but in light of this bug, I am wondering whether maas could return failed deployment initially to juju prior to the node being marked as deployed. I will monitor to see whether I can recreate to capture logs.

Jason Hobbs (jason-hobbs) wrote :

Larry - thanks. That's different from this bug, where MAAS clearly had the node marked as failed deployment, not just in juju status output. If you see that separate issue again, please file a separate bug.

Andres Rodriguez (andreserl) wrote :

I think the reason of the failure is due to:

Jul 20 02:18:55 hayward-11 pollinate[2033]: WARNING: Network communication failed [0]\n % Total % Received % Xferd Average Speed Time Time Time Current#012 Dload Upload Total Spent Left Speed#012#015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 002:18:52.147979 * Trying 91.189.94.10...#012#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
Jul 20 02:18:55 hayward-11 pollinate[2033]: Jul 20 02:18:55 hayward-11 <13>Jul 20 02:18:55 pollinate[2033]: WARNING: Network communication failed [0]\n % Total % Received % Xferd Average Speed Time Time Time Current
Jul 20 02:18:55 hayward-11 pollinate[2033]: Dload Upload Total Spent Left Speed
Jul 20 02:18:55 hayward-11 pollinate[2033]: #015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 002:18:52.147979 * Trying 91.189.94.10...
Jul 20 02:18:55 hayward-11 pollinate[2033]: #015 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0

cloud-init cannot reach the pollinate server and I'm guessing that's what causing MAAS to mark the machine as failed deployment. However, I don't see a message being posted to MAAS about this that would actually cause this failure... nor I see anything in the event log....

Andres Rodriguez (andreserl) wrote :

however, this is an older log

Andres Rodriguez (andreserl) wrote :

Ok, so I debugged this a bit, can came to something in the logs which may show what the issue is, or might not. In the failed node I see:

Jul 20 16:55:21 hayward-11 [CLOUDINIT] util.py[DEBUG]: cloud-init mode 'modules' took 153.244 seconds (153.24)
Jul 20 16:55:21 hayward-11 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final
Jul 20 16:55:21 hayward-11 [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg' with {'allow_redirects': True, 'url': 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg', 'method': 'POST', 'headers': {'Authorization': 'OAuth oauth_nonce="158422460373674945521469033721", oauth_timestamp="1469033721", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="QNpkS3KULgYwj9m6Xu", oauth_token="RbHTw3U4qqWaqtzyK9", oauth_signature="%26ZfXD8J77vZU5UZjdwLPwbzqXZf59na76"'}} configuration
Jul 20 16:55:21 hayward-11 [CLOUDINIT] url_helper.py[DEBUG]: Read from http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg (200, 2b) after 1 attempts
Jul 20 16:55:21 hayward-11 cloud-init[2242]: Cloud-init v. 0.7.7 running 'modules:final' at Wed, 20 Jul 2016 16:52:48 +0000. Up 120.72 seconds.
Jul 20 16:55:21 hayward-11 cloud-init[2242]: Cloud-init v. 0.7.7 finished at Wed, 20 Jul 2016 16:55:21 +0000. Datasource DataSourceMAAS [http://10.244.192.10:5240/MAAS/metadata/curtin]. Up 273.60 seconds

Where as in a node that successfully deployed in my local cluster:

Jul 20 23:52:19 node05 [CLOUDINIT] util.py[DEBUG]: cloud-init mode 'modules' took 63.922 seconds (63.91)
Jul 20 23:52:19 node05 [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://10.90.90.254:5240/MAAS/metadata/status/4y3had' with {'allow_redirects': True, 'url': 'http://10.90.90.254:5240/MAAS/metadata/status/4y3had', 'method': 'POST', 'headers': {'Authorization': 'OAuth oauth_nonce="50531188898511683361469058739", oauth_timestamp="1469058739", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="QbbS8Ybs8pgz9fN7Ub", oauth_token="rQWr6gYvafJvULaWjp", oauth_signature="%26U4yvEWxbSzgEwQFrYhNQ4KHX2RK27zjx"'}} configuration
Jul 20 23:52:19 node05 [CLOUDINIT] url_helper.py[DEBUG]: Read from http://10.90.90.254:5240/MAAS/metadata/status/4y3had (200, 2b) after 1 attempts
Jul 20 23:52:19 node05 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final
Jul 20 23:52:19 node05 cloud-init[1141]: Cloud-init v. 0.7.7 running 'modules:final' at Wed, 20 Jul 2016 23:51:15 +0000. Up 28.21 seconds.
Jul 20 23:52:19 node05 cloud-init[1141]: Cloud-init v. 0.7.7 finished at Wed, 20 Jul 2016 23:52:19 +0000. Datasource DataSourceMAAS [http://10.90.90.254:5240/MAAS/metadata/curtin]. Up 91.87 seconds

idk if this has really any relation, but if we see in the failed node output, we see "url_helper.py[DEBUG]: [0/1] open" after "Jul 20 23:52:19 node05 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final", while in the successful run is the other way around.

Andres Rodriguez (andreserl) wrote :

Matching the failure wth this:

Jul 20 01:01:00 maas2-integration maas.node: [INFO] hayward-11: Status transition from DEPLOYING to FAILED_DEPLOYMENT
Jul 20 01:01:00 maas2-integration maas.node: [ERROR] hayward-11: Marking node failed: Installation failed (refer to the installation log for more information)., that would be that cloud-init send a message saying that something "FAILED" for maas to mark it as failed, which might as well be something related to the fact that we see the access to the metadata *after*:

"Jul 20 16:55:21 hayward-11 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final"

Andres Rodriguez (andreserl) wrote :

Jul 20 16:55:21 hayward-11 [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg' with {'allow_redirects': True, 'url': 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg', 'method': 'POST', 'headers': {'Authorization': 'OAuth oauth_nonce="158422460373674945521469033721", oauth_timestamp="1469033721", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="QNpkS3KULgYwj9m6Xu", oauth_token="RbHTw3U4qqWaqtzyK9", oauth_signature="%26ZfXD8J77vZU5UZjdwLPwbzqXZf59na76"'}} configuration

Andres Rodriguez (andreserl) wrote :

Actually, it may be even be this:

Jul 20 16:52:47 hayward-11 [CLOUDINIT] handlers.py[DEBUG]: finish: modules-config/config-runcmd: SUCCESS: config-runcmd ran successfully
Jul 20 16:52:47 hayward-11 [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg' with {'headers': {'Authorization': 'OAuth oauth_nonce="118413150381654594601469033567", oauth_timestamp="1469033567", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="QNpkS3KULgYwj9m6Xu", oauth_token="RbHTw3U4qqWaqtzyK9", oauth_signature="%26ZfXD8J77vZU5UZjdwLPwbzqXZf59na76"'}, 'url': 'http://10.244.192.10:5240/MAAS/metadata/status/4y3hdg', 'allow_redirects': True, 'method': 'POST'} configuration
Jul 20 16:52:47 hayward-11 pollinate[2090]: WARNING: Network communication failed [0]\n % Total % Received % Xferd Average Speed Time Time Time Current#012 Dload Upload Total Spent Left Speed#012#015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 016:52:44.818437 * Trying 91.189.94.10...#012#015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
Jul 20 16:52:47 hayward-11 pollinate[2090]: Jul 20 16:52:47 hayward-11 <13>Jul 20 16:52:47 pollinate[2090]: WARNING: Network communication failed [0]\n % Total % Received % Xferd Average Speed Time Time Time Current
Jul 20 16:52:47 hayward-11 pollinate[2090]: Dload Upload Total Spent Left Speed
Jul 20 16:52:47 hayward-11 pollinate[2090]: #015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 016:52:44.818437 * Trying 91.189.94.10...
Jul 20 16:52:47 hayward-11 pollinate[2090]: #015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0#015 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0

Scott Moser (smoser) wrote :

Hi,
to debug this please follow
https://gist.github.com/smoser/2610e9b78b8d7b54319675d9e3986a1b

stop the node from rebooting, and then grab /var/log/cloud-init*
and /run/cloud-init* and /var/log/cloud

there is probably a WARN message in /var/log/cloud-init.log that indicates the failure.

ideally collect a /tmp/install.log that would be there if curtin ran.

Jason Hobbs (jason-hobbs) wrote :

Scott,

That isn't going to work. This is a failure in an automated system - one of
hundreds of installs a day. maas/curtin/cloud-init should be capturing the
cause of the failure and logging it - the fact that it isn't is a bug that
needs to be fixed. Please investigate why that failure isn't being logged
to MAAS and how it can be in an automated fashion. Why is the WARN message
from cloud-init.log not being pushed to MAAS? Why isn't cloud-init.log
being pushed to MAAS? Is there some way to automate that?

On Thu, Jul 21, 2016 at 8:38 AM, Scott Moser <email address hidden> wrote:

> Hi,
> to debug this please follow
> https://gist.github.com/smoser/2610e9b78b8d7b54319675d9e3986a1b
>
> stop the node from rebooting, and then grab /var/log/cloud-init*
> and /run/cloud-init* and /var/log/cloud
>
> there is probably a WARN message in /var/log/cloud-init.log that
> indicates the failure.
>
> ideally collect a /tmp/install.log that would be there if curtin ran.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1604962
>
> Title:
> node set to "failed deployment" for no visible reason
>
> Status in MAAS:
> New
> Status in MAAS 2.0 series:
> New
> Status in MAAS trunk series:
> New
>
> Bug description:
> A node reached the end of installation and was marked Failed
> Deployment, but I can't discover the reason it was marked that.
>
> I've attached logs from the MAAS server, and the event log, and the
> install log.
>
> In the event log we see this:
> {
> "type": "Node changed status",
> "level": "INFO",
> "node": "4y3hdg",
> "hostname": "hayward-11",
> "id": 1436389,
> "description": "From 'Deploying' to 'Failed deployment'",
> "created": "Wed, 20 Jul. 2016 01:01:00"
> },
> {
> "type": "Node installation",
> "level": "DEBUG",
> "node": "4y3hdg",
> "hostname": "hayward-11",
> "id": 1436388,
> "description": "'cloudinit' running modules for final",
> "created": "Wed, 20 Jul. 2016 01:01:00"
> },
>
> In the install log we see this:
> Length: unspecified [text/plain]
> Saving to: '/dev/null'
>
> 0K
> 138K=0s
>
> 2016-07-20 01:00:50 (138 KB/s) - '/dev/null' saved [2]
>
> curtin: Installation finished.
>
>
> This is from this run:
>
> http://10.245.162.43:8080/job/pipeline_deploy/8151/console
>
> This is with juju 2.0 beta 12 and maas 2.0 RC2.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1604962/+subscriptions
>

tags: added: conjure
Scott Moser (smoser) wrote :

Currently, the problem I believe that we're seeing is due to the way in which curtin implements its reboot.
Basically, when curtin is done with install, it backgrounds a processes that does:
   sleep 5
   reboot

As I understand this problem, the race condition is that the 'reboot' is happening before cloud-init is finished. reboot tells systemd to kill all processes and shutdown. systemd kills cloud-init, cloud-init exits (and reports) failure because it is not done.

If that is the case, then the situtation can be avoided simply by adding a longer 'delay' to the curtin configuration that maas sends.

In /etc/maas/preseeds/curtin_userdata, there is a section here that looks like:
  power_state:
    mode: reboot

You can change this to say:
  power_state:
    mode: reboot
    delay: 30

Scott Moser (smoser) wrote :

Does this issue happen with 1.9?

Scott Moser (smoser) wrote :

further investigation and working with Adam indicate that my assesment in comment 14 is probably correct.
Looking at cloud-init logs, we see things like:

Jul 22 14:46:53 ubuntu [CLOUDINIT] handlers.py[DEBUG]: start: init-network/config-ubuntu-init-switch: running config-ubuntu-init-switch with frequency once-per-instance
Jul 22 14:46:53 ubuntu [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://192.168.10.114:5240/MAAS/metadata/status/4y3h84' with {'allow_redirects': True, 'url': 'http://192.168.10.114:5240/MAAS/metadata/status/4y3h84', 'headers': {'Authorization': 'OAuth oauth_nonce="107803251774003364951469198813", oauth_timestamp="1469198813", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="KCpd3JJTTbtFMjKz7D", oauth_token="9kJS9LRLK7ZaQELPCv", oauth_signature="%26JXB592yRU6RaWGa7GTLegfqWN8H6B9ZE"'}, 'method': 'POST'} configuration
Jul 22 14:47:00 ubuntu [CLOUDINIT] url_helper.py[DEBUG]: Read from http://192.168.10.114:5240/MAAS/metadata/status/4y3h84 (200, 2b) after 1 attempts
Jul 22 14:47:00 ubuntu [CLOUDINIT] util.py[DEBUG]: Writing to /var/lib/cloud/instances/4y3h84/sem/config_ubuntu_init_switch - wb: [420] 25 bytes
Jul 22 14:47:00 ubuntu [CLOUDINIT] helpers.py[DEBUG]: Running co

there, the event post to maas took 7 seconds.

So what happened was curtin finished, cloud-init was in the middle of posting an event, and curtin's reboot fired. cloud-init got the kill signal from systemd and reported its failure.

We can definitely look to improve how curtin interacts with cloud-init for rebooting so that this doesn't happen, but its not good if an event takes 7 seconds to post. Say an post event took 2 seconds, and a deploy-install-boot reported 120 events.. that'd be 4 minutes of wall clock spent waiting for api responses.

I dont' have any good ideas on how we could handle that. We background / batch off the status posts and go, but probably should still wait for them to come back and check their result at some point rather than just ignoring the possible post failure.

Changed in curtin:
importance: Undecided → Medium
status: New → Confirmed
Andres Rodriguez (andreserl) wrote :

Hi Scott,

So in that case, I have a question. Why do we have to tell curtin to reboot at all if at the end of the day, cloud-init still needs to run and do stuff? Couldn't cloud-init simply reboot the machine when it is done ? We could tell cloud-init that instead of curtin ? What do you think?

Andres Rodriguez (andreserl) wrote :
Blake Rouse (blake-rouse) wrote :

@Andres

The above paste should work and fix the race condition, have you tried deploying with that change?

I have tried it and seems to work just fine. The question I have for you is
why did we add the power off stuff in the curtin pressed rather than
cloud-Init? Does precise need it as cloud-unit doesn't provide the power
off stanza?

On Sunday, July 24, 2016, Blake Rouse <email address hidden> wrote:

> @Andres
>
> The above paste should work and fix the race condition, have you tried
> deploying with that change?
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1604962
>
> Title:
> node set to "failed deployment" for no visible reason
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1604962/+subscriptions
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Scott Moser (smoser) wrote :

You added it to curtin for a few reasons.
a.) cloud-init in precise does not support the poweroff stanza (we could SRU that).
b.) maas prior to 2.0 only sent a single configuration "part" to cloud-init. That part was the curtin blob to execute (started with '#!') so there was no channel to carry the cloud-init poweroff stanza.

Scott Moser (smoser) on 2016-07-27
description: updated
Changed in maas:
status: Fix Committed → Fix Released
Andreas Hasenack (ahasenack) wrote :

We started seeing this with maas 2.0 and added the workaround from comment #14 (delay: 30). That fixed the problem for us.

I'm attaching the output of tail -f /var/log/cloud-init* while the installation was happening.

Jon Grimm (jgrimm) wrote :

Scott is looking at ways to make curtin a bit more robust in this area. Rather than waiting an arbitrary timeout value (5 seconds currently), will attempt to watch for when things are done. Note: the question of why maas is so busy/limited responsiveness is a different question(bug?), we are just teaching curtin to be a bit smarter than an arbitrary timeout value.

Larry Michel (lmic) wrote :

I have been hitting this even with the workaround in place:

Node changed status - From 'Deploying' to 'Failed deployment' Fri, 04 Nov. 2016 17:10:16
Marking node failed - Installation failed (refer to the installation log for more information). Fri, 04 Nov. 2016 17:10:16
Node installation failure - 'cloudinit' running modules for final Fri, 04 Nov. 2016 17:10:16
Installation complete - Node disabled netboot Fri, 04 Nov. 2016 17:09:26
PXE Request - installation Fri, 04 Nov. 2016 16:59:24
Node powered on Fri, 04 Nov. 2016 16:56:12
Powering node on Fri, 04 Nov. 2016 16:56:07
User starting deployment - (oil) Fri, 04 Nov. 2016 16:56:04
User acquiring node - (oil) Fri, 04 Nov. 2016 16:56:02

I hit it yesterday on one of the systems and tried to increase to 45 seconds:

power_state:
  mode: reboot
  delay: 45

But seeing it again.

This is with maas 2.1.

ubuntu@maas2-production:~$ dpkg -l|grep curtin
ii curtin-common 0.1.0~bzr418-0ubuntu1~ubuntu16.04.1 all Library and tools for curtin installer
ii python-curtin 0.1.0~bzr418-0ubuntu1~ubuntu16.04.1 all Library and tools for curtin installer
ii python3-curtin 0.1.0~bzr418-0ubuntu1~ubuntu16.04.1 all Library and tools for curtin installer
ubuntu@maas2-production:~$ dpkg -l|grep maas
ii maas 2.1.0+bzr5480-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS client and command-line interface
ii maas-common 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.1.0+bzr5480-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.1.0+bzr5480-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.1.0+bzr5480-0ubuntu1~16.04.1 all Region Controller for MAAS
ii python3-django-maas 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.1.0+bzr5480-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

Andres Rodriguez (andreserl) wrote :
Download full text (3.4 KiB)

Larry,

Compare the versions of cloud-unit in the images VS before and so on

On Friday, November 4, 2016, Larry Michel <email address hidden>
wrote:

> I have been hitting this even with the workaround in place:
>
> Node changed status - From 'Deploying' to 'Failed deployment' Fri, 04
> Nov. 2016 17:10:16
> Marking node failed - Installation failed (refer to the installation log
> for more information). Fri, 04 Nov. 2016 17:10:16
> Node installation failure - 'cloudinit' running modules for final
> Fri, 04 Nov. 2016 17:10:16
> Installation complete - Node disabled netboot Fri, 04 Nov. 2016 17:09:26
> PXE Request - installation Fri, 04 Nov. 2016 16:59:24
> Node powered on Fri, 04 Nov. 2016 16:56:12
> Powering node on Fri, 04 Nov. 2016 16:56:07
> User starting deployment - (oil) Fri, 04 Nov. 2016 16:56:04
> User acquiring node - (oil) Fri, 04 Nov. 2016 16:56:02
>
> I hit it yesterday on one of the systems and tried to increase to 45
> seconds:
>
> power_state:
> mode: reboot
> delay: 45
>
> But seeing it again.
>
> This is with maas 2.1.
>
> ubuntu@maas2-production:~$ dpkg -l|grep curtin
> ii curtin-common 0.1.0~bzr418-0ubuntu1~ubuntu16.04.1
> all Library and tools for curtin installer
> ii python-curtin 0.1.0~bzr418-0ubuntu1~ubuntu16.04.1
> all Library and tools for curtin installer
> ii python3-curtin 0.1.0~bzr418-0ubuntu1~ubuntu16.04.1
> all Library and tools for curtin installer
> ubuntu@maas2-production:~$ dpkg -l|grep maas
> ii maas 2.1.0+bzr5480-0ubuntu1~16.04.1
> all "Metal as a Service" is a physical cloud and IPAM
> ii maas-cli 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS client and command-line interface
> ii maas-common 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS server common files
> ii maas-dhcp 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS DHCP server
> ii maas-dns 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS DNS server
> ii maas-proxy 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS Caching Proxy
> ii maas-rack-controller 2.1.0+bzr5480-0ubuntu1~16.04.1
> all Rack Controller for MAAS
> ii maas-region-api 2.1.0+bzr5480-0ubuntu1~16.04.1
> all Region controller API service for MAAS
> ii maas-region-controller 2.1.0+bzr5480-0ubuntu1~16.04.1
> all Region Controller for MAAS
> ii python3-django-maas 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS server Django web framework (Python 3)
> ii python3-maas-client 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS python API client (Python 3)
> ii python3-maas-provisioningserver 2.1.0+bzr5480-0ubuntu1~16.04.1
> all MAAS server provisioning libraries (Python 3)
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launc...

Read more...

Changed in landscape:
milestone: none → 16.11
tags: added: landscape
Changed in landscape:
status: New → Confirmed
tags: added: cdo-qa-blocker
no longer affects: maas/2.0
Changed in landscape:
milestone: 16.11 → 16.12
Changed in landscape:
milestone: 16.12 → 17.01
Chad Smith (chad.smith) on 2017-02-10
Changed in landscape:
milestone: 17.01 → 17.02
Chad Smith (chad.smith) wrote :

Landscape system-tests haven't seen this issue in quite a while, scapestack is running MAAS 2.1.3

Changed in landscape:
status: Confirmed → Incomplete
Chad Smith (chad.smith) on 2017-03-16
Changed in landscape:
milestone: 17.02 → 17.03
Scott Moser (smoser) wrote :

I deleted the curtin task. The reasoning is that a change went into maas to not use the faulty code in curtin, but to use cloud-init instead.

no longer affects: curtin
Scott Moser (smoser) on 2017-03-21
description: updated
no longer affects: landscape
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers