MAAS

Power monitor service hits amp.TooLong errors with > ~600 nodes to a cluster

Series 1.8
Bug #1389007

Bug #1389007 reported by Blake Rouse on 2014-11-03

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	Critical	Gavin Panella	MAAS 1.9.0
	1.8	Fix Released	Critical	Gavin Panella	MAAS 1.8.1

Bug Description

When setting up my MAAS demo of angular, I was creating nodes in 50 batch increments. Once my MAAS got somewhere above >600 I don't know the exact number, as I was not refreshing the page on every 50, MAAS UI stopped working.

The only way to get it to work again was to disconnect the cluster from the region.

This error doesn't really help, because it is not giving the actual call site. Looks like twisted is mangling that, but if the cluster is disconnected the region works, that should help narrow down the call.

ERROR 2014-11-03 09:05:11,673 twisted Amp server or network failure unhandled by client application. Dropping connection! To avoid, add errbacks to ALL remote commands!
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1201, in mainLoop
    self.runUntilCurrent()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 797, in runUntilCurrent
    f(*a, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 382, in callback
    self._startRunCallbacks(result)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 924, in _safeEmit
    aBox._sendTo(self.boxSender)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 577, in _sendTo
    proto.sendBox(self)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 2153, in sendBox
    self.transport.write(box.serialize())
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 555, in serialize
    raise TooLong(False, True, v, k)
twisted.protocols.amp.TooLong:

Tags:

Related branches

lp:~allenap/maas/more-better-things--bug-1389007

Rejected for merging into lp:~maas-committers/maas/trunk

Jeroen T. Vermeulen (community): Approve on 2014-11-12

Christian Reis (community): Needs Information on 2014-11-12

lp:~allenap/maas/power-poll-fewer--bug-1389007

Merged into lp:~maas-committers/maas/trunk at revision 4076

Gavin Panella (community): Approve on 2015-07-07

lp:~allenap/maas/power-poll-fewer--bug-1389007--1.8

Merged into lp:maas/1.8 at revision 4022

Gavin Panella (community): Approve on 2015-07-15

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-11-03:

#1

I don't think this is too critical as we can scale to more clusters, but it is annoying for sure. We're aiming for up to ~2000 per cluster.

Changed in maas:
status:	Confirmed → Triaged

Christian Reis (kiko) on 2014-11-05

Changed in maas:
milestone:	1.7.0 → 1.7.1

Revision history for this message

Graham Binns (gmb) wrote on 2014-11-10:

#2

Grabbing this one as I'm intrigued; I have a couple of hypotheses, so I'll test them out and report back tomorrow.

Changed in maas:
assignee:	nobody → Graham Binns (gmb)

Graham Binns (gmb) on 2014-11-11

Changed in maas:
status:	Triaged → In Progress

Revision history for this message

Graham Binns (gmb) wrote on 2014-11-11:

#3

So, things we can already tell before I start diving in:

> raise TooLong(False, True, v, k)

http://twistedmatrix.com/documents/8.2.0/api/twisted.protocols.amp.TooLong.html tells us that "One of the protocol's length limitations was violated.". Looking at the definition for TooLong.__init__(), we can see that this was a value that was too long (the first parameter is False, which means that the string was in a value position). No idea yet what the value actually was… going to insert some debugging code and see…

Revision history for this message

Graham Binns (gmb) wrote on 2014-11-11:

#4

So, I'm up to 805 nodes and everything's peachy. To be expected — I haven't tried to do anything with them that would require an RPC call… Blake, you don't specify whether you were doing anything with the nodes, but ISTM that you must have been.

Going to try a commissioning run (as much as can be tried on imaginary nodes, anyway) and see what happens.

Revision history for this message

Graham Binns (gmb) wrote on 2014-11-11:

#5

Aha! I hit the error as soon as I start to accept enlistment of the new bogus nodes. Okay, time to dig deeper.

Revision history for this message

Graham Binns (gmb) wrote on 2014-11-11:

#6

I got as far as identifying this as being a problem with the ListNodePowerParameters RPC call, which is made by the cluster's power monitor service to the region.

With Gavin's help I found that the result of that call (with ~800 nodes) is too big — 90256 characters; AMP, which is currently 16-bit, supports a maximum of 65536 characters in values.

I spoke to Gavin about this and he's going to look into making AMP support 32-bit value strings, so I'm assigning this bug to him.

Changed in maas:
assignee:	Graham Binns (gmb) → Gavin Panella (allenap)
summary:	- cannot handle >600 nodes + Power monitor service hits amp.TooLong errors with > ~600 nodes to a + cluster

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2014-11-12:

#7

so that means we will be able to handle ~1600 nodes?

Revision history for this message

Graham Binns (gmb) wrote on 2014-11-12: Re: [Bug 1389007] Re: Power monitor service hits amp.TooLong errors with > ~600 nodes to a cluster

#8

On 12 November 2014 12:58, Andres Rodriguez <email address hidden> wrote:
> so that means we will be able to handle ~1600 nodes?

This wins for most beautifully ironic bug comment this week.

I think we need to come up with a plan for chunking big blobs like
this. After all, the cluster doesn't know how many nodes it manages
until the region tells it, so there shouldn't be a bottleneck on
telling-the-cluster-what-its-nodes-are.

(Worth noting too that this is the *only* place (AFAIK) where we batch
up node power options like this; everywhere else we do power stuff one
node at a time.)

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-11-12:

#9

On Wednesday 12 Nov 2014 13:09:26 you wrote:
> On 12 November 2014 12:58, Andres Rodriguez <email address hidden> wrote:
> > so that means we will be able to handle ~1600 nodes?
>
> This wins for most beautifully ironic bug comment this week.
>
> I think we need to come up with a plan for chunking big blobs like
> this. After all, the cluster doesn't know how many nodes it manages
> until the region tells it, so there shouldn't be a bottleneck on
> telling-the-cluster-what-its-nodes-are.
>
> (Worth noting too that this is the *only* place (AFAIK) where we batch
> up node power options like this; everywhere else we do power stuff one
> node at a time.)

I think that paging the responses is a good approach to take generally,
perhaps this particular message needs a change to do that.

Andres Rodriguez (andreserl) on 2014-12-11

Changed in maas:
milestone:	1.7.1 → 1.7.2

Andres Rodriguez (andreserl) on 2015-03-03

Changed in maas:
milestone:	1.7.2 → 1.7.3

Andres Rodriguez (andreserl) on 2015-06-30

Changed in maas:
milestone:	1.7.3 → 1.9.0

Gavin Panella (allenap) on 2015-07-09

Changed in maas:
status:	In Progress → Fix Committed

Gavin Panella (allenap) on 2015-09-02

tags:

added: amp

Andres Rodriguez (andreserl) on 2016-01-05

Changed in maas:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.