pserv crashing while doing TFTP for HWE images

Bug #1382281 reported by Julian Edwards on 2014-10-17
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Critical
Blake Rouse
1.7
Critical
Blake Rouse

Bug Description

While booting over PXE:

2014-10-17 10:44:38+1000 [ClusterClient,client] Cluster '305014a2-21c7-48c5-90e8-25944a9b
d355' registered (via maas:pid=14386).
2014-10-17 10:44:38+1000 [ClusterClient,client] Cluster '305014a2-21c7-48c5-90e8-25944a9b
d355' registered (via maas:pid=14386).
2014-10-17 10:44:39+1000 [ClusterClient,client] Cluster '305014a2-21c7-48c5-90e8-25944a9b
d355' registered (via maas:pid=14385).
2014-10-17 10:44:39+1000 [ClusterClient,client] Cluster '305014a2-21c7-48c5-90e8-25944a9b
d355' registered (via maas:pid=14385).
2014-10-17 10:44:42+1000 [TFTP (UDP)] Datagram received from ('10.0.0.200', 2073): <RRQDa
tagram(filename=pxelinux.0, mode=octet, options={'tsize': '0'})>
2014-10-17 10:44:42+1000 [TFTP (UDP)] Datagram received from ('10.0.0.200', 2073): <RRQDa
tagram(filename=pxelinux.0, mode=octet, options={'tsize': '0'})>
2014-10-17 10:44:42+1000 [ClusterClient,client] Unhandled error in Deferred:
2014-10-17 10:44:42+1000 [ClusterClient,client] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.protocols.amp.UnknownRemoteError: Code<UNKNOWN>: Unknown Error

2014-10-17 10:44:42+1000 [ClusterClient,client] Logged OOPS id OOPS-13d2930ff8eeb73ead79e
9ad8e702c12: No exception type: No exception value
2014-10-17 10:44:42+1000 [ClusterClient,client] ClusterClient connection lost (HOST:IPv4A
ddress(TCP, '127.0.0.1', 55012) PEER:IPv4Address(TCP, u'127.0.0.1', 37733))
2014-10-17 10:44:42+1000 [ClusterClient,client] ClusterClient connection lost (HOST:IPv4A
ddress(TCP, '127.0.0.1', 55012) PEER:IPv4Address(TCP, u'127.0.0.1', 37733))
2014-10-17 10:44:42+1000 [ClusterClient,client] Logged OOPS id OOPS-52fd50c37059cecb06c71
d34415649fd: UnknownRemoteError: Code<UNKNOWN>: Unknown Error
2014-10-17 10:44:50+1000 [TFTP (UDP)] Datagram received from ('10.0.0.200', 2074): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'tsize': '0'})>
2014-10-17 10:44:50+1000 [TFTP (UDP)] Datagram received from ('10.0.0.200', 2074): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'tsize': '0'})>
2014-10-17 10:44:50+1000 [ClusterClient,client] Unhandled error in Deferred:
2014-10-17 10:44:50+1000 [ClusterClient,client] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.protocols.amp.UnknownRemoteError: Code<UNKNOWN>: Unknown Error

The oops has no more information in it than this either.

Related branches

Julian Edwards (julian-edwards) wrote :

This is with a package built from the very latest trunk as of the date of this bug.

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 1.7.0
Julian Edwards (julian-edwards) wrote :

Ah here we go in maas-django.log:

ERROR 2014-10-17 10:46:51,167 twisted Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 196, in _worker
    result = context.call(ctx, function, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 143, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/utils/async.py", line 153, in call_within_transaction
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/rpc/nodes.py", line 159, in request_node_info_by_mac_address
    return (node, node.get_boot_purpose())
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 1530, in get_boot_purpose
    preseed_type = get_preseed_type_for(self)
  File "/usr/lib/python2.7/dist-packages/maasserver/preseed.py", line 352, in get_preseed_type_for
    purpose = get_available_purpose_for_node(purpose_order, node)
  File "/usr/lib/python2.7/dist-packages/maasserver/preseed.py", line 330, in get_available_purpose_for_node
    "Unable to determine purpose for node: '%s'", node.fqdn)
maasserver.exceptions.PreseedError: (u"Unable to determine purpose for node: '%s'", u'nuc1.maas')

Julian Edwards (julian-edwards) wrote :

I think this is related to bug 1378527 because it only happens when I use the precise image that was downloaded. That bug says that maas is functional but I don't think it is.

Regardless of the underlying bug, the PreseedError has not been caught before AMP tried to serialise it, so that needs handling correctly too.

Christian Reis (kiko) wrote :

Agreed, but AIUI PreseedError would not happen if bug 1378527 where fixed. If that is the case, then this doesn't need to land in 1.7.

Christian Reis (kiko) on 2014-10-17
summary: - pserv crashing while doing TFTP
+ pserv crashing while doing TFTP for HWE images
Julian Edwards (julian-edwards) wrote :

I think we need to establish whether bug 1378257 is indeed the cause of this before making that assumption. Blake, can you comment please?

Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
Blake Rouse (blake-rouse) wrote :

There was some serious issues with the BootImageMapping overriding the correct boot resource for HWE images. I am working on a fix for it at the moment, also looking into fixing an issue with using the correct kernel for HWE xinstall. Lets see if this resolves this issue.

I have a feeling it won't, as it should not affect determining the purpose for a node. But I will let you know once it has landed to give it a test.

Blake Rouse (blake-rouse) wrote :

Once this is merged into 1.7, give it another try to see if this fixes your issue.

https://code.launchpad.net/~blake-rouse/maas/fix-hwe-images-1.7/+merge/239003

Blake Rouse (blake-rouse) wrote :

It has been merged into 1.7, please give it a test to see if this issue occurs again.

Blake Rouse (blake-rouse) wrote :

I have tested deploying precise/generic, precise/hwe-p, precise/hwe-s, trusty/generic and have not had this issue.

Can anyone else check to see if they can reproduce this?

[testing hat] I'll give it a test shortly!
[release manager hat] Can you attach your branch to the bug please!

Confirmed working after I re-imported images (it failed if I didn't re-import).

Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
milestone: 1.7.0 → next
Changed in maas:
milestone: next → none
Changed in maas:
status: Fix Committed → Fix Released
Andres Rodriguez (andreserl) wrote :

This bug was filed in upstream MAAS and not in Ubuntu. It was fixed as part of 1.7. This was fixed and verified to be working in all Ubuntu releases. Ubuntu 1.7 is being SRU'd. Marking this as verification-done, as it seems to be blocking SRU.

tags: added: verification-done
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers