Can't start a node with its associated cluster interface configured as Unmanaged

Bug #1382108 reported by Jason Hobbs on 2014-10-16
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Critical
Julian Edwards
1.7
Critical
Graham Binns

Bug Description

I have my cluster interface configured as "Unmanaged" - no DNS or DHCP should be configured. When I go to start a node via the UI, I get an error:

"Multiple failures encountered. See /var/log/maas/maas-django.log on the region server for more information."

The cluster daemon also crashes and in the UI I see:
"One or more clusters are currently disconnected. Visit the clusters page for more information."

In maas.log I see:
Oct 16 23:42:11 trusty-maas7 maas.dhcp: [ERROR] Could not create host map for 74:d4:35:89:ba:b5 with address 192.168.10.102: Command `omshell` returned non-zero exit status 0:#012> > > > not connected.#012> no open object.#012> no open object.#012> no open object.#012> no open object.#012> not connected.#012>

In pserv.log:
2014-10-16 23:45:18+0800 [-] Unhandled Error
 Traceback (most recent call last):
   File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1201, in mainLoop
     self.runUntilCurrent()
   File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 797, in runUntilCurrent
     f(*a, **kw)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 423, in errback
     self._startRunCallbacks(fail)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
     self._runCallbacks()
 --- <exception caught here> ---
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 1020, in checkKnownErrors
     desc = str(error.value)
 exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\u2192' in position 18: ordinal not in range(128)

This is with 1.7.0~beta6+bzr3260-0ubuntu1~trusty1 - this used to work.

Related branches

summary: - Can't start a node with its cluster configured as Unmanaged
+ Can't start a node with its associated cluster interface configured as
+ Unmanaged
Gavin Panella (allenap) wrote :

This comes from p.rpc.dhcp.create_host_maps: the CannotCreateHostMap
exception contains the Unicode character for an arrow. It's more than a
bit rubbish that Twisted's AMP implementation doesn't cope with
non-ASCII errors, but we can work around it at least.

Fwiw:

    Command `omshell` returned non-zero exit status 0:#012> > > > not
    connected.#012> no open object.#012> no open object.#012> no open
    object.#012> no open object.#012> not connected.#012>

is an example of the error message that comes out of __str__() and
__unicode__() on ExternalProcessError. The #012 bit is probably meant to
be octal for newline, but I'm not sure what's introducing it (syslog
maybe?).

tags: added: trivial
Changed in maas:
status: New → Triaged
importance: Undecided → High
Jason Hobbs (jason-hobbs) wrote :

Why is DHCP/omshell being invoked when the cluster interface is unmanaged?

Gavin Panella (allenap) wrote :

Good point! That's the important bug.

I was fixated on the UnicodeEncodeError, which I've now filed separately as bug 1382237.

> The cluster daemon also crashes ...

It disconnects because Twisted's AMP implementation does that after passing an unrecognised exception over the wire. However, it should reconnect within 30 seconds at most.

Gavin Panella (allenap) on 2014-10-16
tags: added: dhcp
removed: trivial
Julian Edwards (julian-edwards) wrote :

Making it critical because it's a crash.

Changed in maas:
milestone: none → 1.7.0
importance: High → Critical
Christian Reis (kiko) wrote :

Oh, this is interesting; look at the change that Newell did for bug 1376888:

   https://code.launchpad.net/~newell-jensen/maas/bug-1376888/+merge/237861

I wonder if we missed taking unmanaged interfaces into account there. Can you check?

Newell Jensen (newell-jensen) wrote :

Kiko,

My branch was only for deleting nodes. However the error here is very similar to the one that we were also getting for deleting nodes that were created on a managed interface and then deleted after the interface had been changed to unmanaged.

Changed in maas:
assignee: nobody → Julian Edwards (julian-edwards)
status: Triaged → In Progress
Changed in maas:
milestone: 1.7.0 → none
Changed in maas:
status: In Progress → Fix Committed
Julian Edwards (julian-edwards) wrote :

I started backporting to 1.7 and was amazed to see the code divergence already. Graham, are you planning on backporting your start_nodes change to 1.7?

Christian Reis (kiko) wrote :

I'm pretty sure no, he isn't, because it would not be accepted through at this point.

Julian Edwards (julian-edwards) wrote :

I honestly think it's worthwhile because it adds a set of very valuable cleanups, error catching and robustness. I were in charge, I'd approve it as a necessary change ...

Andres Rodriguez (andreserl) wrote :

I agree here. If the fixes above plus gmb's branch fix crashes that affect the integrity of the release, these should be backported to 1.7.

Raphael, any comments as one of the gate keepers?

On 21 October 2014 03:04, Julian Edwards <email address hidden> wrote:
> I started backporting to 1.7 and was amazed to see the code divergence
> already. Graham, are you planning on backporting your start_nodes change
> to 1.7?

Argh, we missed talking about this at the meeting yesterday. I wasn't
going to push hard for it to be backported *but* landing it would also
fix some of the problems seen in bug 1375942 (e.g. old static IPs
being left allocated after a failed deployment). So yes, I say let's
do it.

On Wednesday 22 Oct 2014 07:31:23 you wrote:
> On 21 October 2014 03:04, Julian Edwards <email address hidden> wrote:
> > I started backporting to 1.7 and was amazed to see the code divergence
> > already. Graham, are you planning on backporting your start_nodes change
> > to 1.7?
>
> Argh, we missed talking about this at the meeting yesterday. I wasn't
> going to push hard for it to be backported *but* landing it would also
> fix some of the problems seen in bug 1375942 (e.g. old static IPs
> being left allocated after a failed deployment). So yes, I say let's
> do it.

There's some chat on the bug, basically everyone is for backporting apart from
Kiko. We need to pressga^Wconvince him.

Christian Reis (kiko) wrote :

The patch in this bug is totally acceptable, fwiw, if it does fix the problem.

Graham Binns (gmb) wrote :

We've agreed to backport just this patch (pulling it out of Node.start() in trunk and putting it in NodeManager.start_nodes() in 1.7. I'm working on that now.

Changed in maas:
status: Fix Committed → Fix Released

Hello Jason, or anyone else affected,

Accepted maas into utopic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/maas/1.7.5+bzr3369-0ubuntu1~14.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-needed
Andres Rodriguez (andreserl) wrote :

This issue has been verified to work both on upgrade and fresh install, and has been QA'd. Marking verification-done.

tags: added: verification-done
removed: verification-needed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers