deploying node re-enlists. regiond.log shows 'Unable to determine purpose for node'

Bug #1473167 reported by Scott Moser
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Lee Trager

Bug Description

we upgraded yesterday from:
python-maas-provisioningserver:amd64 (1.8.0~rc3+bzr4000-0ubuntu1~trusty1, 1.8.0+bzr4001-0ubuntu2~trusty1)

that potentially is the cause for this.

I have a node in maas, i deploy it via UI or cmdline and it goes into enlisting mode. the node enlists, seemingly successfully (although maas already new about it). the enlistment process changes the ipmi password, so that maas can no longer turn it on or off.

/etc/maas/maas_cluster.conf has:
MAAS_URL="http://10.245.168.2/MAAS"
CLUSTER_UUID="9a4dbe50-1015-4fe1-92ab-d37c34052733"

/var/log/maas/clusterd.log shows:
2015-07-09 17:53:17+0000 [TFTP (UDP)] Datagram received from ('10.245.168.10', 25305): <RRQDatagram(filename=/grub/grub.cfg-ec:b1:d7:75:81:a0, mode=octet, options={'blksize': '1024', 'tsize': '0'})>
2015-07-09 17:53:17+0000 [HTTPPageGetter,client] Starting TFTP back-end failed.
        Traceback (most recent call last):
        Failure: twisted.web.error.Error: 500 INTERNAL SERVER ERROR

2015-07-09 17:53:17+0000 [TFTP (UDP)] Datagram received from ('10.245.168.10', 25306): <RRQDatagram(filename=/grub/grub.cfg-default-amd64, mode=octet, options={'blksize': '1024', 'tsize': '0'})>

/var/log/maas/maas-django.log shows:

ERROR 2015-07-09 17:53:17,798 maasserver Unable to determine purpose for node: 'horsea.dellstack'
ERROR 2015-07-09 17:53:17,800 maasserver ################################ Exception: (u"Unable to determine purpose for node: '%s'", u'horsea.dellstack') ################################
ERROR 2015-07-09 17:53:17,802 maasserver Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 112, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/pxeconfig.py", line 185, in pxeconfig
    if node is None or node.get_boot_purpose() == "commissioning":
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 1856, in get_boot_purpose
    preseed_type = get_deploying_preseed_type_for(self)
  File "/usr/lib/python2.7/dist-packages/maasserver/preseed.py", line 379, in get_deploying_preseed_type_for
    purpose = get_available_purpose_for_node(purpose_order, node)
  File "/usr/lib/python2.7/dist-packages/maasserver/preseed.py", line 348, in get_available_purpose_for_node
    "Unable to determine purpose for node: '%s'", node.fqdn)
PreseedError: (u"Unable to determine purpose for node: '%s'", u'horsea.dellstack')

regiond.log also has those errors.

2015-07-09 17:53:17 [maasserver] ERROR: Unable to determine purpose for node: 'horsea.dellstack'
2015-07-09 17:53:17 [maasserver] ERROR: ################################ Exception: (u"Unable to determine purpose for node: '%s'", u'horsea.dellstack') ################################
2015-07-09 17:53:17 [maasserver] ERROR: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 112, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/pxeconfig.py", line 185, in pxeconfig
    if node is None or node.get_boot_purpose() == "commissioning":
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 1856, in get_boot_purpose
    preseed_type = get_deploying_preseed_type_for(self)
  File "/usr/lib/python2.7/dist-packages/maasserver/preseed.py", line 379, in get_deploying_preseed_type_for
    purpose = get_available_purpose_for_node(purpose_order, node)
  File "/usr/lib/python2.7/dist-packages/maasserver/preseed.py", line 348, in get_available_purpose_for_node
    "Unable to determine purpose for node: '%s'", node.fqdn)
PreseedError: (u"Unable to determine purpose for node: '%s'", u'horsea.dellstack')

Related branches

Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI, also see potentially-related bug https://bugs.launchpad.net/maas/+bug/1460097.

Revision history for this message
Scott Moser (smoser) wrote :

ok. I poked around a bit in maas shell

> from maasserver.models import Node as mnode
> horsea = [f for f in mnode.objects.all() if 'horsea' in f.__repr__()][0]
> horsea.architecture
u'amd64/hwe-w'
> horsea.get_distro_series()
trusty
> horsea.osystem
u'ubuntu'
> horsea.split_arch()
 (u'amd64', u'hwe-w')
> horsea.get_boot_purpose()
>
... stacktrace ...

To illustrate, we can just call like this:
get_boot_images_for(nodegroup=horsea.nodegroup, osystem=u'ubuntu', architecture=u'amd64', subarchitecture=u'hwe-w', series=u'vivid')

that stacktraces also.

The stacktrace takes you into the default boot mode, which is enlist.

my system had gotten 'amd64/hwe-w' set via my 'maas-deploy-node'.
 http://bazaar.launchpad.net/~virtual-maasers/+junk/maas-libvirt-utils/view/head:/maas-deploy-node

which works around the inability to declare need for an hwe-X kernel by setting the arch for the node's arch before deployment (LP: #1459762).

So, this is somewhat operator error that got me here.
However, this is a very hittable problem for a user. The recreate path to it is:
 * import Ubuntu images for trusty, utopic, vivid. This gives the user the ability to set 'hwe-v'.
 * set arch to 'amd64/hwe-v'
 * deploy utopic

Then, the system boots into enlistment which resets the systems ipmi password, breaking maas.

description: updated
Changed in maas:
importance: Undecided → Critical
milestone: none → 1.9.0
Scott Moser (smoser)
summary: - deploying node re-enlists regiond.log shows 'Unable to determine purpose
- for node'
+ deploying node re-enlists. regiond.log shows 'Unable to determine
+ purpose for node'
Revision history for this message
Lee Trager (ltrager) wrote :

So it looks like MAAS generates an invalid pxelinux.cfg file whenever you select a newer kernel then the base OS. If you inhibit the node from booting and grab pxelinux.cfg from MAAS by running

curl tftp://10.0.0.1/pxelinux.cfg/01-52-54-00-d9-e0-5e

You'll see

  KERNEL ubuntu/amd64/hwe-v/utopic/no-such-image/boot-kernel
  INITRD ubuntu/amd64/hwe-v/utopic/no-such-image/boot-initrd

I was able to reproduce this with hwe-u/trust, hwe-v/trusty and hwe-v-utopic. The root cause of the incorrect path is that we do not have hwe kernels for Vivid, and Utopic doesn't have a hwe kernel for Trusty.

After the node is given an invalid kernel and initrd it continues booting to disk. If the disk is blank nothing boots. I haven't figured out why the enlistment code is being rerun.

I'm going to look into patching MAAS to throw an error when the user instructs MAAS to boot something which MAAS does not have boot files for.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

It's likely that you see it boot back into the enlistment environment because it's simply still on disk from a previous enlistment.

In particular, since you're testing with KVM virtual machines, during 1.8 development we made the decision to set the boot order to [network, local_disk] in the virtual BIOS (KVM virtual machine settings, in this case).

Since every BIOS behaves differently (virtual or non-virtual), this issue may manifest itself similarly (or very differently) on other types of systems.

At the time, I asked if we wanted to recommend, require, or enforce that MAAS managed machines *only* PXE boot after they become managed nodes. And we decided that the safest bet was to set them to fall back to a local disk, because we didn't want deployed nodes to fail to boot due to the MAAS DHCP server being unavailable. (but clearly, there are other corner cases where this fallback cannot happen - and yet more corner cases where we have very little control over the boot order.)

So, in no particular order, I think the possible fixes are:

(1) Check whether we have the boot image the user is requesting before trying to boot
(2) Enforce boot order in a more refined manner (such as, when managing virtual machines, only include local disks after the machine has been deployed. *However*, we may not always have such fine-grained control.)

You might also try replacing "releases" with "daily" in your boot images path; there may be additional unreleased images you can try. (though I haven't checked if that's true for this particular case; Scott probably knows better.)

Changed in maas:
assignee: nobody → Lee Trager (ltrager)
status: New → Triaged
status: Triaged → In Progress
Revision history for this message
Lee Trager (ltrager) wrote :

The hwe_backend branch validates that the specified kernel is available for the OS being deployed. In the reproduction case that Scot provided MAAS will return the following error

{"hwe_kernel": ["hwe-v is not avaliable for ubuntu/utopic on amd64"]}

Changed in maas:
status: In Progress → Fix Committed
no longer affects: maas/1.8
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.