MAAS

deploying node re-enlists. regiond.log shows 'Unable to determine purpose for node'

Bug #1473167 reported by Scott Moser on 2015-07-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	Critical	Lee Trager	MAAS 1.9.0

Bug Description

we upgraded yesterday from:
python-maas-provisioningserver:amd64 (1.8.0~rc3+bzr4000-0ubuntu1~trusty1, 1.8.0+bzr4001-0ubuntu2~trusty1)

that potentially is the cause for this.

I have a node in maas, i deploy it via UI or cmdline and it goes into enlisting mode. the node enlists, seemingly successfully (although maas already new about it). the enlistment process changes the ipmi password, so that maas can no longer turn it on or off.

/etc/maas/maas_cluster.conf has:
MAAS_URL="http://10.245.168.2/MAAS"
CLUSTER_UUID="9a4dbe50-1015-4fe1-92ab-d37c34052733"

/var/log/maas/clusterd.log shows:
2015-07-09 17:53:17+0000 [TFTP (UDP)] Datagram received from ('10.245.168.10', 25305): <RRQDatagram(filename=/grub/grub.cfg-ec:b1:d7:75:81:a0, mode=octet, options={'blksize': '1024', 'tsize': '0'})>
2015-07-09 17:53:17+0000 [HTTPPageGetter,client] Starting TFTP back-end failed.
Traceback (most recent call last):
Failure: twisted.web.error.Error: 500 INTERNAL SERVER ERROR

2015-07-09 17:53:17+0000 [TFTP (UDP)] Datagram received from ('10.245.168.10', 25306): <RRQDatagram(filename=/grub/grub.cfg-default-amd64, mode=octet, options={'blksize': '1024', 'tsize': '0'})>

/var/log/maas/maas-django.log shows:

ERROR 2015-07-09 17:53:17,798 maasserver Unable to determine purpose for node: 'horsea.dellstack'
ERROR 2015-07-09 17:53:17,800 maasserver ################################ Exception: (u"Unable to determine purpose for node: '%s'", u'horsea.dellstack') ################################
ERROR 2015-07-09 17:53:17,802 maasserver Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 112, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/pxeconfig.py", line 185, in pxeconfig
    if node is None or node.get_boot_purpose() == "commissioning":
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 1856, in get_boot_purpose
    preseed_type = get_deploying_preseed_type_for(self)
  File "/usr/lib/python2.7/dist-packages/maasserver/preseed.py", line 379, in get_deploying_preseed_type_for
    purpose = get_available_purpose_for_node(purpose_order, node)
  File "/usr/lib/python2.7/dist-packages/maasserver/preseed.py", line 348, in get_available_purpose_for_node
    "Unable to determine purpose for node: '%s'", node.fqdn)
PreseedError: (u"Unable to determine purpose for node: '%s'", u'horsea.dellstack')

regiond.log also has those errors.

2015-07-09 17:53:17 [maasserver] ERROR: Unable to determine purpose for node: 'horsea.dellstack'
2015-07-09 17:53:17 [maasserver] ERROR: ################################ Exception: (u"Unable to determine purpose for node: '%s'", u'horsea.dellstack') ################################
2015-07-09 17:53:17 [maasserver] ERROR: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 112, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/pxeconfig.py", line 185, in pxeconfig
    if node is None or node.get_boot_purpose() == "commissioning":
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 1856, in get_boot_purpose
    preseed_type = get_deploying_preseed_type_for(self)
  File "/usr/lib/python2.7/dist-packages/maasserver/preseed.py", line 379, in get_deploying_preseed_type_for
    purpose = get_available_purpose_for_node(purpose_order, node)
  File "/usr/lib/python2.7/dist-packages/maasserver/preseed.py", line 348, in get_available_purpose_for_node
    "Unable to determine purpose for node: '%s'", node.fqdn)
PreseedError: (u"Unable to determine purpose for node: '%s'", u'horsea.dellstack')

See original description

Related branches

lp:~ltrager/maas/hwe_backend

Merged into lp:~maas-committers/maas/trunk at revision 4160

Blake Rouse (community): Approve on 2015-08-05

Lee Trager (community): Needs Resubmitting on 2015-08-04

Mike Pontillo (community): Abstain on 2015-08-03

Jason Hobbs: Pending requested 2015-08-01

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2015-07-09:

FYI, also see potentially-related bug https://bugs.launchpad.net/maas/+bug/1460097.

Revision history for this message

Scott Moser (smoser) wrote on 2015-07-09:

ok. I poked around a bit in maas shell

> from maasserver.models import Node as mnode
> horsea = [f for f in mnode.objects.all() if 'horsea' in f.__repr__()][0]
> horsea.architecture
u'amd64/hwe-w'
> horsea.get_distro_series()
trusty
> horsea.osystem
u'ubuntu'
> horsea.split_arch()
(u'amd64', u'hwe-w')
> horsea.get_boot_purpose()
>
... stacktrace ...

To illustrate, we can just call like this:
get_boot_images_for(nodegroup=horsea.nodegroup, osystem=u'ubuntu', architecture=u'amd64', subarchitecture=u'hwe-w', series=u'vivid')

that stacktraces also.

The stacktrace takes you into the default boot mode, which is enlist.

my system had gotten 'amd64/hwe-w' set via my 'maas-deploy-node'.
http://bazaar.launchpad.net/~virtual-maasers/+junk/maas-libvirt-utils/view/head:/maas-deploy-node

which works around the inability to declare need for an hwe-X kernel by setting the arch for the node's arch before deployment (LP: #1459762).

So, this is somewhat operator error that got me here.
However, this is a very hittable problem for a user. The recreate path to it is:
* import Ubuntu images for trusty, utopic, vivid. This gives the user the ability to set 'hwe-v'.
* set arch to 'amd64/hwe-v'
* deploy utopic

Then, the system boots into enlistment which resets the systems ipmi password, breaking maas.

description:

updated

Andres Rodriguez (andreserl) on 2015-07-09

Changed in maas:
importance:	Undecided → Critical
milestone:	none → 1.9.0

Scott Moser (smoser) on 2015-07-09

summary:

- deploying node re-enlists regiond.log shows 'Unable to determine purpose
- for node'
+ deploying node re-enlists. regiond.log shows 'Unable to determine
+ purpose for node'

Revision history for this message

Lee Trager (ltrager) wrote on 2015-07-14:

So it looks like MAAS generates an invalid pxelinux.cfg file whenever you select a newer kernel then the base OS. If you inhibit the node from booting and grab pxelinux.cfg from MAAS by running

curl tftp://10.0.0.1/pxelinux.cfg/01-52-54-00-d9-e0-5e

You'll see

KERNEL ubuntu/amd64/hwe-v/utopic/no-such-image/boot-kernel
INITRD ubuntu/amd64/hwe-v/utopic/no-such-image/boot-initrd

I was able to reproduce this with hwe-u/trust, hwe-v/trusty and hwe-v-utopic. The root cause of the incorrect path is that we do not have hwe kernels for Vivid, and Utopic doesn't have a hwe kernel for Trusty.

After the node is given an invalid kernel and initrd it continues booting to disk. If the disk is blank nothing boots. I haven't figured out why the enlistment code is being rerun.

I'm going to look into patching MAAS to throw an error when the user instructs MAAS to boot something which MAAS does not have boot files for.

Revision history for this message

Mike Pontillo (mpontillo) wrote on 2015-07-14:

It's likely that you see it boot back into the enlistment environment because it's simply still on disk from a previous enlistment.

In particular, since you're testing with KVM virtual machines, during 1.8 development we made the decision to set the boot order to [network, local_disk] in the virtual BIOS (KVM virtual machine settings, in this case).

Since every BIOS behaves differently (virtual or non-virtual), this issue may manifest itself similarly (or very differently) on other types of systems.

At the time, I asked if we wanted to recommend, require, or enforce that MAAS managed machines *only* PXE boot after they become managed nodes. And we decided that the safest bet was to set them to fall back to a local disk, because we didn't want deployed nodes to fail to boot due to the MAAS DHCP server being unavailable. (but clearly, there are other corner cases where this fallback cannot happen - and yet more corner cases where we have very little control over the boot order.)

So, in no particular order, I think the possible fixes are:

(1) Check whether we have the boot image the user is requesting before trying to boot
(2) Enforce boot order in a more refined manner (such as, when managing virtual machines, only include local disks after the machine has been deployed. *However*, we may not always have such fine-grained control.)

You might also try replacing "releases" with "daily" in your boot images path; there may be additional unreleased images you can try. (though I haven't checked if that's true for this particular case; Scott probably knows better.)

Andres Rodriguez (andreserl) on 2015-07-31

Changed in maas:
assignee:	nobody → Lee Trager (ltrager)
status:	New → Triaged
status:	Triaged → In Progress

Revision history for this message

Lee Trager (ltrager) wrote on 2015-08-02:

The hwe_backend branch validates that the specified kernel is available for the OS being deployed. In the reproduction case that Scot provided MAAS will return the following error

{"hwe_kernel": ["hwe-v is not avaliable for ubuntu/utopic on amd64"]}

MAAS Lander (maas-lander) on 2015-08-05

Changed in maas:
status:	In Progress → Fix Committed

Blake Rouse (blake-rouse) on 2015-08-25

no longer affects:

maas/1.8

Andres Rodriguez (andreserl) on 2016-01-05

Changed in maas:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.