Power monitor service hits amp.TooLong errors with > ~600 nodes to a cluster

Bug #1389007 reported by Blake Rouse
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Gavin Panella
1.8
Fix Released
Critical
Gavin Panella

Bug Description

When setting up my MAAS demo of angular, I was creating nodes in 50 batch increments. Once my MAAS got somewhere above >600 I don't know the exact number, as I was not refreshing the page on every 50, MAAS UI stopped working.

The only way to get it to work again was to disconnect the cluster from the region.

This error doesn't really help, because it is not giving the actual call site. Looks like twisted is mangling that, but if the cluster is disconnected the region works, that should help narrow down the call.

ERROR 2014-11-03 09:05:11,673 twisted Amp server or network failure unhandled by client application. Dropping connection! To avoid, add errbacks to ALL remote commands!
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1201, in mainLoop
    self.runUntilCurrent()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 797, in runUntilCurrent
    f(*a, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 382, in callback
    self._startRunCallbacks(result)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 924, in _safeEmit
    aBox._sendTo(self.boxSender)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 577, in _sendTo
    proto.sendBox(self)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 2153, in sendBox
    self.transport.write(box.serialize())
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 555, in serialize
    raise TooLong(False, True, v, k)
twisted.protocols.amp.TooLong:

Tags: amp

Related branches

Revision history for this message
Julian Edwards (julian-edwards) wrote :

I don't think this is too critical as we can scale to more clusters, but it is annoying for sure. We're aiming for up to ~2000 per cluster.

Changed in maas:
status: Confirmed → Triaged
Christian Reis (kiko)
Changed in maas:
milestone: 1.7.0 → 1.7.1
Revision history for this message
Graham Binns (gmb) wrote :

Grabbing this one as I'm intrigued; I have a couple of hypotheses, so I'll test them out and report back tomorrow.

Changed in maas:
assignee: nobody → Graham Binns (gmb)
Graham Binns (gmb)
Changed in maas:
status: Triaged → In Progress
Revision history for this message
Graham Binns (gmb) wrote :

So, things we can already tell before I start diving in:

> raise TooLong(False, True, v, k)

http://twistedmatrix.com/documents/8.2.0/api/twisted.protocols.amp.TooLong.html tells us that "One of the protocol's length limitations was violated.". Looking at the definition for TooLong.__init__(), we can see that this was a value that was too long (the first parameter is False, which means that the string was in a value position). No idea yet what the value actually was… going to insert some debugging code and see…

Revision history for this message
Graham Binns (gmb) wrote :

So, I'm up to 805 nodes and everything's peachy. To be expected — I haven't tried to do anything with them that would require an RPC call… Blake, you don't specify whether you were doing anything with the nodes, but ISTM that you must have been.

Going to try a commissioning run (as much as can be tried on imaginary nodes, anyway) and see what happens.

Revision history for this message
Graham Binns (gmb) wrote :

Aha! I hit the error as soon as I start to accept enlistment of the new bogus nodes. Okay, time to dig deeper.

Revision history for this message
Graham Binns (gmb) wrote :

I got as far as identifying this as being a problem with the ListNodePowerParameters RPC call, which is made by the cluster's power monitor service to the region.

With Gavin's help I found that the result of that call (with ~800 nodes) is too big — 90256 characters; AMP, which is currently 16-bit, supports a maximum of 65536 characters in values.

I spoke to Gavin about this and he's going to look into making AMP support 32-bit value strings, so I'm assigning this bug to him.

Changed in maas:
assignee: Graham Binns (gmb) → Gavin Panella (allenap)
summary: - cannot handle >600 nodes
+ Power monitor service hits amp.TooLong errors with > ~600 nodes to a
+ cluster
Revision history for this message
Andres Rodriguez (andreserl) wrote :

so that means we will be able to handle ~1600 nodes?

Revision history for this message
Graham Binns (gmb) wrote : Re: [Bug 1389007] Re: Power monitor service hits amp.TooLong errors with > ~600 nodes to a cluster

On 12 November 2014 12:58, Andres Rodriguez <email address hidden> wrote:
> so that means we will be able to handle ~1600 nodes?

This wins for most beautifully ironic bug comment this week.

I think we need to come up with a plan for chunking big blobs like
this. After all, the cluster doesn't know how many nodes it manages
until the region tells it, so there shouldn't be a bottleneck on
telling-the-cluster-what-its-nodes-are.

(Worth noting too that this is the *only* place (AFAIK) where we batch
up node power options like this; everywhere else we do power stuff one
node at a time.)

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Wednesday 12 Nov 2014 13:09:26 you wrote:
> On 12 November 2014 12:58, Andres Rodriguez <email address hidden> wrote:
> > so that means we will be able to handle ~1600 nodes?
>
> This wins for most beautifully ironic bug comment this week.
>
> I think we need to come up with a plan for chunking big blobs like
> this. After all, the cluster doesn't know how many nodes it manages
> until the region tells it, so there shouldn't be a bottleneck on
> telling-the-cluster-what-its-nodes-are.
>
> (Worth noting too that this is the *only* place (AFAIK) where we batch
> up node power options like this; everywhere else we do power stuff one
> node at a time.)

I think that paging the responses is a good approach to take generally,
perhaps this particular message needs a change to do that.

Changed in maas:
milestone: 1.7.1 → 1.7.2
Changed in maas:
milestone: 1.7.2 → 1.7.3
Changed in maas:
milestone: 1.7.3 → 1.9.0
Gavin Panella (allenap)
Changed in maas:
status: In Progress → Fix Committed
Gavin Panella (allenap)
tags: added: amp
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.