CI breakage: Deployed nodes don't get a static IP address

Bug #1366726 reported by Raphaël Badin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Julian Edwards

Bug Description

The failure in http://d-jenkins.ubuntu-ci:8080/view/MAAS/job/utopic-adt-maas/612/ indicates that the deployed nodes didn't get a static IP address. They kept their dynamic IP address from the commissioning phase.

It seems that there is a race somewhere as it doesn't happen all the time.

Related branches

Raphaël Badin (rvb)
description: updated
Revision history for this message
Raphaël Badin (rvb) wrote :

I think this is a real bug.

I put more debugging statements (https://code.launchpad.net/~rvb/maas/debug-level/+merge/233898) and ran the CI tests.

MAAS log: http://d-jenkins.ubuntu-ci:8080/view/MAAS/job/utopic-adt-maas-manual/204/artifact/results/artifacts/maas-logs/var/log/maas/maas.log
Console output: http://d-jenkins.ubuntu-ci:8080/view/MAAS/job/utopic-adt-maas-manual/204/console

You can see that only one of the 2 deployed nodes got a static IP address, the other one did not.

What I think happened is this:

- the nodes are booted up. They request an IP address from the DHCP server but are not yet enlisted.
- the leases parser kicks in: it reports the IP addresses *but* update_mac_cluster_interfaces discards the information (and thus doesn't create the network/cluster<->MAC connection) because the MACs are currently unknown (search for "Silently ignore MAC addresses that we don't know about" in src/provisioningserver/pserv_services/lease_upload_service.py ).
- the nodes enlist, the MACs are now known.
- the leases parser kicks in *but* it doesn't report the IP addresses back since it thinks that the leases haven't changed since last time (and it prints " No leases changed since last scan" in the log).

=> The result is that the connection MAC <-> cluster is never done and thus the node won't get an static IP address.

Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 1366726] Re: CI breakage: Deployed nodes don't get a static IP address

Excellent analysis and I concur, thank you rvb.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

I think we need to get rid of the upload change check and just always upload everything.

Changed in maas:
assignee: nobody → Julian Edwards (julian-edwards)
status: Triaged → In Progress
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Well I'm not doing that any more, I'm re-scanning the leases every time a mac is added.

Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: none → 1.7.0
Revision history for this message
Larry Michel (lmic) wrote :

I am seeing this in our environment with phelps and mokoi servers which deployed with dynamic IP associated with mac address as shown by the dhcpd.lease file rather than the static IP address shown on each node's page in the maas Web UI. .

On the maas server, their DNS names resolved with the correct static IP address. However, that static address was not present in the dhcpd.lease file and I could only access the server through the dynamic IP address.

These servers are in that same state (deployed with dynamic IP) at the time that the logs (attached) are collected.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Which version of MAAS are you using?

On Friday 31 October 2014 11:13:47 you wrote:
> I am seeing this in our environment with phelps and mokoi servers which
> deployed with dynamic IP associated with mac address as shown by the
> dhcpd.lease file rather than the static IP address shown on each node's
> page in the maas Web UI. .
>
> On the maas server, their DNS names resolved with the correct static IP
> address. However, that static address was not present in the dhcpd.lease
> file and I could only access the server through the dynamic IP address.

Was there a host {} block for the static IP for the node(s)? If not the help
depends on the version of MAAS you're using.

Revision history for this message
Larry Michel (lmic) wrote :

Can you please clarify what you mean by host {} block.

We are not not doing anything specific for any of the host. So all the
hosts in the environment should behave the same AFAICT. In the past, both
of these hosts had previously deployed OK with an IP address allocated at
deploy times rather than the dynamic IP. Mokoi is also now deploying OK
with a different static IP.

Maas version is 1.7.0~rc1+bzr3295-0ubuntu1~trusty1.

On Sun, Nov 2, 2014 at 4:37 PM, Julian Edwards <email address hidden>
wrote:

> Which version of MAAS are you using?
>
> On Friday 31 October 2014 11:13:47 you wrote:
> > I am seeing this in our environment with phelps and mokoi servers which
> > deployed with dynamic IP associated with mac address as shown by the
> > dhcpd.lease file rather than the static IP address shown on each node's
> > page in the maas Web UI. .
> >
> > On the maas server, their DNS names resolved with the correct static IP
> > address. However, that static address was not present in the dhcpd.lease
> > file and I could only access the server through the dynamic IP address.
>
> Was there a host {} block for the static IP for the node(s)? If not the
> help
> depends on the version of MAAS you're using.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1366726
>
> Title:
> CI breakage: Deployed nodes don't get a static IP address
>
> Status in MAAS:
> Fix Committed
>
> Bug description:
> The failure in http://d-jenkins.ubuntu-ci:8080/view/MAAS/job/utopic-
> adt-maas/612/ indicates that the deployed nodes didn't get a static IP
> address. They kept their dynamic IP address from the commissioning
> phase.
>
> It seems that there is a race somewhere as it doesn't happen all the
> time.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1366726/+subscriptions
>

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Monday 03 November 2014 13:58:12 you wrote:
> Can you please clarify what you mean by host {} block.

In the leases file, we write a host{} entry which forces an IP for a MAC.
However it only does this if it knows on which cluster interface the MAC
resides.

> We are not not doing anything specific for any of the host. So all the
> hosts in the environment should behave the same AFAICT. In the past, both
> of these hosts had previously deployed OK with an IP address allocated at
> deploy times rather than the dynamic IP. Mokoi is also now deploying OK
> with a different static IP.

Did you delete the node(s) and re-add at all? It sounds like it had not
picked up the cluster interface link at this stage and when you redeployed
that knowledge had then been ascertained at that point. It relies on scanning
the dynamic leases to work this out.

>
> Maas version is 1.7.0~rc1+bzr3295-0ubuntu1~trusty1.

Ok, thanks.

Revision history for this message
Larry Michel (lmic) wrote :
Download full text (4.5 KiB)

In the lease file, I did not see a host{} for that IP/MAC address
combination so answer would be no. There was only an entry for the Dynamic
IP.

To answer your question about the delete/re-add, the answer is yes for
phelps node (we were moving network cards between servers) and a no for
mokoi node.

There was an issue last week that I had reported in 1387859. This had to do
with too many unreleased dynamic leases and a file could not be parsed. So
perhaps, the behaviour I described in this bug was related to these
servers' mac addresses not having been fully synced to cluster as you
indicated. I would have to monitor for recreate since we don't think we're
running into 1387859.

About the cluster not knowing about the mac addresses, I am seeing a flood
of these log messages below and I am wondering whether it's related. We are
seeing failed deployments where the server can not be accessed on the DNS
IP address, but not all servers with this ERROR have failed to come up on
the static IP. I will attach latest maas logs and lease files as well.

...
/var/log/maas/maas.log:Nov 5 14:52:54 maas-trusty-back-may22
maas.macaddress: [ERROR] tucker.local: Tried to allocate an IP to MAC
2c:59:e5:41:a8:6d but its cluster interface is not known
/var/log/maas/maas.log:Nov 5 14:52:54 maas-trusty-back-may22
maas.macaddress: [ERROR] tucker.local: Tried to allocate an IP to MAC
2c:59:e5:41:a8:6e but its cluster interface is not known
/var/log/maas/maas.log:Nov 5 14:52:54 maas-trusty-back-may22
maas.macaddress: [ERROR] tucker.local: Tried to allocate an IP to MAC
2c:59:e5:41:a8:6f but its cluster interface is not known
/var/log/maas/maas.log:Nov 5 14:52:54 maas-trusty-back-may22
maas.macaddress: [ERROR] muncie.local: Tried to allocate an IP to MAC
2c:59:e5:3a:47:e5 but its cluster interface is not known
/var/log/maas/maas.log:Nov 5 14:52:54 maas-trusty-back-may22
maas.macaddress: [ERROR] muncie.local: Tried to allocate an IP to MAC
2c:59:e5:3a:47:e6 but its cluster interface is not known
/var/log/maas/maas.log:Nov 5 14:52:54 maas-trusty-back-may22
maas.macaddress: [ERROR] muncie.local: Tried to allocate an IP to MAC
2c:59:e5:3a:47:e7 but its cluster interface is not known
/var/log/maas/maas.log:Nov 5 14:53:04 maas-trusty-back-may22
maas.macaddress: [ERROR] sunset.local: Tried to allocate an IP to MAC
2c:59:e5:3a:65:55 but its cluster interface is not known
/var/log/maas/maas.log:Nov 5 14:53:04 maas-trusty-back-may22
maas.macaddress: [ERROR] sunset.local: Tried to allocate an IP to MAC
2c:59:e5:3a:65:56 but its cluster interface is not known
/var/log/maas/maas.log:Nov 5 14:53:04 maas-trusty-back-may22
maas.macaddress: [ERROR] sunset.local: Tried to allocate an IP to MAC
2c:59:e5:3a:65:57 but its cluster interface is not known
ubuntu@maas-trusty-back-may22:/var/lib/maas/dhcp$
...

On Mon, Nov 3, 2014 at 5:46 PM, Julian Edwards <email address hidden>
wrote:

> On Monday 03 November 2014 13:58:12 you wrote:
> > Can you please clarify what you mean by host {} block.
>
> In the leases file, we write a host{} entry which forces an IP for a MAC.
> However it only does this if it knows on which cluster interface the MAC
> resides.
>
> > We are not not d...

Read more...

Revision history for this message
Larry Michel (lmic) wrote :

Latest log file for our maas server per my previous comment:

Revision history for this message
Julian Edwards (julian-edwards) wrote :

> maas.macaddress: [ERROR] sunset.local: Tried to allocate an IP to MAC 2c:59:e5:3a:65:55 but its cluster interface is not known

How often is it putting that in the log? (never cut the timestamps out when pasting logs!) It should only do it when the node is getting started up.

If you are still seeing this behaviour on any of the nodes, can you please try re-commissioning them from the web UI and it *should* clear the fault.

Revision history for this message
Larry Michel (lmic) wrote :

I have not looked at specific log files to retrieve the log entries. I am only grepping the entire /var/log/maas/*.log looking for the error messages. So, what I pasted here is the result of the grep.

Yep, it does seem to occur at the beginning (within first couple of minutes) based on grep results and timestamps from the Web UI's node event log...

I have already tried to recommission the servers and it does not get rid of the messages. I still see them for the servers I recommissioned. The commission time for all the servers should say something like 11/5/2014 around ~12:10 AM CST in the Logs that I attached this morning (bug1366726-2.tar.gz).

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Thursday 06 November 2014 02:37:14 you wrote:
> I have not looked at specific log files to retrieve the log entries. I
> am only grepping the entire /var/log/maas/*.log looking for the error
> messages. So, what I pasted here is the result of the grep.
>
> Yep, it does seem to occur at the beginning (within first couple of
> minutes) based on grep results and timestamps from the Web UI's node
> event log...
>
> I have already tried to recommission the servers and it does not get rid
> of the messages. I still see them for the servers I recommissioned. The
> commission time for all the servers should say something like 11/5/2014
> around ~12:10 AM CST in the Logs that I attached this morning
> (bug1366726-2.tar.gz).

Ok the messages are going to be bug 1381609 I think.

I am fixing that now and it'll get released in 1.7.1. Please ignore them,
they are harmless.

Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.