Tried to allocate an IP to MAC but its cluster interface is not known

Bug #1378479 reported by Andreas Hasenack
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Confirmed
Medium
Unassigned

Bug Description

I upgraded to 1.7.0~beta5+bzr3199+3164+318~ppa0~ubuntu14.04.1 and now I can't bring up nodes successfully.

This pastebin has a console log of what happens when I click on "start node":

https://pastebin.canonical.com/118355/

Basically it comes up, downloads an image, installs it, reboots, and then the network doesn't work anymore.

In the maas logs, the only interesting thing that I could find was this:

==> maas.log <==
Oct 7 18:20:05 atlas maas.macaddress: [ERROR] mino: Tried to allocate an IP to MAC 2c:59:e5:3b:f0:fe but its cluster interface is not known
Oct 7 18:20:05 atlas maas.macaddress: [ERROR] mino: Tried to allocate an IP to MAC 2c:59:e5:3b:f0:ff but its cluster interface is not known

==> maas-django.log <==
ERROR 2014-10-07 18:23:53,800 maasserver Unable to identify boot image for (ubuntu/amd64/generic/trusty/local): cluster 'scapestack' does not have matching boot image.

"mino" has 4 nics, and two are connected to the maas-eth0 network which is fully managed by maas (dns + dhcp). The two MAC addresses above are from the two NICs that are connected elsewhere or not connected.

This was a working cluster before the upgrade, and images are there in the "Images" tab.

Tags: landscape
summary: - Upgraded to 1.7.0~beta5+bzr3199+3164+318~ppa0~ubuntu14.04.1 can't reach
+ Upgraded to 1.7.0~beta5+bzr3199+3164+318~ppa0~ubuntu14.04.1, can't reach
nodes
Revision history for this message
Raphaël Badin (rvb) wrote : Re: Upgraded to 1.7.0~beta5+bzr3199+3164+318~ppa0~ubuntu14.04.1, can't reach nodes

Hang on, I'm not sure this is a valid bug, this —I assume— is a result of the testing you've done on a package built from https://code.launchpad.net/~rvb/maas/fix-etc-hosts-bug-1087183-2/+merge/236709; correct (see the MP for details)?

Changed in maas:
status: New → Invalid
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

This is also happening with 1.7.0~beta5+bzr3198-0ubuntu1~trusty1. I just installed fresh. After the OS is intalled, the booted system doesn't get a network. I don't know what is going on. Could it be something broken with the images?

Cloud-init v. 0.7.5 running 'init' at Wed, 08 Oct 2014 01:06:59 +0000. Up 15.58 seconds.
ci-info: +++++++++++++++++++++++Net device info++++++++++++++++++++++++
ci-info: +--------+-------+-----------+-----------+-------------------+
ci-info: | Device | Up | Address | Mask | Hw-Address |
ci-info: +--------+-------+-----------+-----------+-------------------+
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | . |
ci-info: | eth3 | False | . | . | 2c:59:e5:3b:c0:e7 |
ci-info: | eth2 | False | . | . | 2c:59:e5:3b:c0:e6 |
ci-info: | eth1 | False | . | . | 2c:59:e5:3b:c0:e5 |
ci-info: | eth0 | False | . | . | 2c:59:e5:3b:c0:e4 |
ci-info: +--------+-------+-----------+-----------+-------------------+
ci-info: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Route info failed!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Here is the whole console output from a node since the enlisting step:

https://pastebin.canonical.com/118383/

Changed in maas:
status: Invalid → New
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Marking it "new" again because it happened on a fresh install from the testing ppa, version 1.7.0~beta5+bzr3198-0ubuntu1~trusty1.

Revision history for this message
Christian Reis (kiko) wrote :

Andreas, you also posted this:

<andreas> I can't bring up nodes anymore
<andreas> the OS is installed, and the subsequent boot fails to bring up the network
<andreas> ci-info: | eth0 | False | . | . | 2c:59:e5:3b:c0:e4 |
<andreas> ci-info: +--------+-------+-----------+-----------+-------------------+
<andreas> ci-info: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Route info failed!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Can you go back to a known good working version to confirm that with no external changes (i.e. no changes to the images, for instance) that MAAS works but has ceased to as of the current testing PPA beta or earlier?

Changed in maas:
status: New → Incomplete
Revision history for this message
Christian Reis (kiko) wrote :

The message you reported in maas-django.log is likely to be what Blake is addressing in bug 1376028, btw.

tags: added: landscape
summary: - Upgraded to 1.7.0~beta5+bzr3199+3164+318~ppa0~ubuntu14.04.1, can't reach
- nodes
+ Tried to allocate an IP to MAC but its cluster interface is not known
Revision history for this message
Adam Collard (adam-collard) wrote :

Seen again, with the same message:

Oct 8 09:33:39 atlas maas.macaddress: [ERROR] shawmut: Tried to allocate an IP to MAC 2c:59:e5:3b:c0:06 but its cluster interface is not known

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I found older maas packages and installed them. After I downgraded to 1.7.0~beta4+bzr3162-0ubuntu1~ppa1, it started working again.

Christian Reis (kiko)
Changed in maas:
milestone: none → 1.7.0
Changed in maas:
status: Incomplete → Confirmed
Revision history for this message
Julian Edwards (julian-edwards) wrote :

If you recommission, does it work afterwards?

Changed in maas:
status: Confirmed → Incomplete
Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 1378479] Re: Tried to allocate an IP to MAC but its cluster interface is not known

On Wednesday 08 Oct 2014 13:17:03 you wrote:
> I found older maas packages and installed them. After I downgraded to
> 1.7.0~beta4+bzr3162-0ubuntu1~ppa1, it started working again.

This may be a coincidence. The MAC to cluster interface link is made when an
interface DHCP boots for the first time, but this info is only uploaded when
the periodic dhcp lease scan is made.

Revision history for this message
Adam Collard (adam-collard) wrote :

It doesn't. Recommissioned on several times.

On 9 October 2014 05:21, Julian Edwards <email address hidden> wrote:

> If you recommission, does it work afterwards?
>
> ** Changed in: maas
> Status: Confirmed => Incomplete
>
> --
> You received this bug notification because you are a member of
> Landscape, which is subscribed to the bug report.
> https://bugs.launchpad.net/bugs/1378479
>
> Title:
> Tried to allocate an IP to MAC but its cluster interface is not known
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1378479/+subscriptions
>
> --
> Landscape-bugs mailing list
> <email address hidden>
> Modify settings or unsubscribe at:
> https://lists.canonical.com/mailman/listinfo/landscape-bugs
>

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Thursday 09 Oct 2014 10:07:39 you wrote:
> It doesn't. Recommissioned on several times.

Ok good to know. So if you upgrade again does it fail again?

(This is important - it means the code is broken rather than the data.)

Changed in maas:
importance: Undecided → Medium
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I'm reluctant to break our cluster again. Last time we spent more than 24h without a working maas, on two installations, and that impacted cloud installer work.

I might try it on a kvm somewhere.

Also, please keep in mind that we *think* this error in the log is related, but the *actual* problem is that after the OS is installed, the node has no IPs on any of its 4 NICs.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Your QA can try it too on a kvm somewhere, btw. Install the working version, register a node, bring it up. Confirm it works. Then stop the node, upgrade to the problematic version, start the same node again and see if it gets a network. Oh, and make the node have 4 NICs, and connect two of them to the maas managed network. That's the scenario we have in production.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Download full text (3.8 KiB)

I installed maas 1.7.0~beta4+bzr3162-0ubuntu1~ppa1 fresh on a trusty KVM. Registered one node with it, that has two NICs and on the same maas managed network. So 2 NICs in total.

It enlisted, commissioned and started up just fine.

I then stopped it, and upgraded maas to 1.7.0~beta5+bzr3198-0ubuntu1~trusty1. Rebooted the machine afterwards.

I acquired the same node and started it. Got the same problem as reported in this bug. In the last bootup, both nics had no IPs and the node stayed in the "Deploying" state. I believe it will transition to "Failed" on its own after a while.

I also got this again in the logs:
Oct 9 14:59:36 maas maas.macaddress: [ERROR] terrific-popcorn.local: Tried to allocate an IP to MAC 52:54:00:3a:95:af but its cluster interface is not known

And, a new backtrace I haven't seen before:
==> /var/log/maas/maas-django.log <==
ERROR 2014-10-09 14:59:05,313 maasserver Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 112, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/pxeconfig.py", line 208, in pxeconfig
    event_log_pxe_request(node, purpose)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/pxeconfig.py", line 128, in event_log_pxe_request
    event_description=options[purpose])
  File "/usr/lib/python2.7/dist-packages/maasserver/models/event.py", line 62, in create_node_event
    event_description=event_description)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/event.py", line 49, in register_event_and_event_type
    level=type_level)
  File "/usr/lib/python2.7/dist-packages/django/db/models/manager.py", line 157, in create
    return self.get_queryset().create(**kwargs)
  File "/usr/lib/python2.7/dist-packages/django/db/models/query.py", line 319, in create
    obj.save(force_insert=True, using=self.db)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/cleansave.py", line 38, in save
    return super(CleanSave, self).save(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/timestampedmodel.py", line 55, in save
    return super(TimestampedModel, self).save(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/db/models/base.py", line 545, in save
    force_update=force_update, update_fields=update_fields)
  File "/usr/lib/python2.7/dist-packages/django/db/models/base.py", line 573, in save_base
    updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
  File "/usr/lib/python2.7/dist-packages/django/db/models/base.py", line 654, in _save_table
    result = self._do_insert(cls._base_manager, using, fields, update_pk, raw)
  File "/usr/lib/python2.7/dist-packages/django/db/models/base.py", line 687, in _do_insert
    using=using, raw=raw)
  File "/usr/lib/python2.7/dist-packages/django/db/models/manager.py", line 232, in _insert
    return insert_query(self.model, objs, fields, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/db/models/query.py", line 1511, in insert_query
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/usr/li...

Read more...

Revision history for this message
Gavin Panella (allenap) wrote :

That last backtrace is a race in EventManager.register_event_and_event_type(). Filed as bug 1379401.

Revision history for this message
Graham Binns (gmb) wrote :

On 9 October 2014 16:35, Gavin Panella <email address hidden> wrote:
> EventManager.register_event_and_event_type(). Filed as bug 1379401.

That's already fixed as of r3155… Or it *should*be, and yet this
occurred in +3162, which is odd indeed.

Revision history for this message
Graham Binns (gmb) wrote :

Ah, no ,wait… I'm lying. I think my change got clobbered. Anyway, that
discussion is for the other bug, not here. Sorry for t'noise.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Here's my theory: Everything is working as expected and that log message is normal.

You said you have 4 NICs and that 2 are on the management network. This explains why there's two messages about the unknown cluster interface - those are the two NICs that are not connected anywhere.

As for why the networking stops - I think it's unrelated to this message and we need to delve deeper into recent changes. I rather suspect it's something to do with jtv's IPv6 stuff.

Changed in maas:
status: Incomplete → Confirmed
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Setting confirmed but not triaged yet since it needs someone to recreate it.

Revision history for this message
Raphaël Badin (rvb) wrote :

We think this is a duplicate of bug 1379591; the error message was a side effect of bug 1379209 which has been fixed now.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.