[1.9] NIC previously discovered through commissioning no longer connected to Maas network

Bug #1575815 reported by Larry Michel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned

Bug Description

Our servers have 2 NICs connected to the maas network, eth0 and eth1 which should show on the UI,

Since upgrade to 1.9.1, I have been seeing systems all of a sudden showing the connection missing.

In the case of eth1, I am seeing these servers not showing the connection after commissioning.

In the case of eth0, it's the PXE NIC that loses the connection but it happens after the system has been commissioned. In that case, system will fail to deploy and remain in Allocated state. I have observed this for R720XD (screenshot attached) and SM15K servers.

The screen capture of this latest scenario shows system failing to deploy and remaining in Ready state. If system is allocated first, then it's in allocated state that it'll remain.

The error is "Node failed to be deployed, because of the following error: {"network": ["Node must be configured to use a network"]}"

From the event log, here's window where the issue seems to start:

Node changed status - From 'Releasing' to 'Ready' Tue, 26 Apr. 2016 00:56:09
Node changed status - From 'Allocated' to 'Releasing' Tue, 26 Apr. 2016 00:56:07
User releasing node - (oil-slave-9) Tue, 26 Apr. 2016 00:56:07
Node changed status - From 'Ready' to 'Allocated' (to oil-slave-9) Tue, 26 Apr. 2016 00:36:52
User acquiring node - (oil-slave-9) Tue, 26 Apr. 2016 00:36:52
Node powered off Mon, 25 Apr. 2016 21:18:56
Node changed status - From 'Releasing' to 'Ready' Mon, 25 Apr. 2016 21:18:55
Powering node off Mon, 25 Apr. 2016 21:18:45
Node changed status - From 'Failed deployment' to 'Releasing' Mon, 25 Apr. 2016 21:18:37
User releasing node - (oil-slave-13) Mon, 25 Apr. 2016 21:18:37
Node changed status - From 'Deploying' to 'Failed deployment' Mon, 25 Apr. 2016 20:42:49
TFTP Request - chain.c32 Mon, 25 Apr. 2016 20:12:08
PXE Request - local boot Mon, 25 Apr. 2016 20:12:08
TFTP Request - pxelinux.cfg/01-bc-30-5b-ee-aa-40 Mon, 25 Apr. 2016 20:12:08
TFTP Request - pxelinux.cfg/44454c4c-5800-1054-8042-b4c04f515631 Mon, 25 Apr. 2016 20:12:08
TFTP Request - pxelinux.0 Mon, 25 Apr. 2016 20:12:08
TFTP Request - pxelinux.0 Mon, 25 Apr. 2016 20:12:08

There's a failure lot Local boot at 20:12:08 then first failure to deploy happens at 00:36:52.

$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=====================================-================================-============-===============================================================================
ii maas 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server all-in-one metapackage
ii maas-cli 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS command line API tool
ii maas-cluster-controller 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server common files
ii maas-dhcp 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS DHCP server
ii maas-dns 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS DNS server
ii maas-proxy 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS Caching Proxy
ii maas-region-controller 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server complete region controller
ii maas-region-controller-min 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS Server minimum region controller
ii python-django-maas 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server Django web framework
ii python-maas-client 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS python API client
ii python-maas-provisioningserver 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server provisioning libraries

Revision history for this message
Larry Michel (lmic) wrote :
Revision history for this message
Larry Michel (lmic) wrote :

The maas logs (clusterd and regiond filtered for "2016-04-25 2[0|1|2|3] to "2016-04-26 00:[0|1|2|3]")

Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [Bug 1575815] [NEW] NIC previously discovered through commissioning no longer connected to Maas network

Hi Larry,

Can you post the commissioning oupput for the network/interfaces? You can
get it from the UI.

Also, log into the commissioning environment and try to run dhclient on
those interfaces and see what the result is?

On Wednesday, April 27, 2016, Larry Michel <email address hidden>
wrote:

> The maas logs (clusterd and regiond filtered for "2016-04-25 2[0|1|2|3]
> to "2016-04-26 00:[0|1|2|3]")
>
> ** Attachment added: "logs.tar.gz"
>
> https://bugs.launchpad.net/maas/+bug/1575815/+attachment/4649317/+files/logs.tar.gz
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1575815
>
> Title:
> NIC previously discovered through commissioning no longer connected to
> Maas network
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1575815/+subscriptions
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Larry Michel (lmic) wrote : Re: NIC previously discovered through commissioning no longer connected to Maas network

Hi Andres,
Here's lshw from one of the systems. This is from recreate on 1.9.2 system.

Revision history for this message
Larry Michel (lmic) wrote :

I didn't run into any issue with dhclient. I did the following with the failing interface:

root@saiph:~# dhclient eth0
RTNETLINK answers: File exists
root@saiph:~# ifdown eth0
ifdown: interface eth0 not configured
root@saiph:~# dhclient eth0
RTNETLINK answers: File exists
root@saiph:~# ifup eth0
Internet Systems Consortium DHCP Client 4.2.4
Copyright 2004-2012 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/

Listening on LPF/eth0/60:eb:69:dc:3e:19
Sending on LPF/eth0/60:eb:69:dc:3e:19
Sending on Socket/fallback
DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3 (xid=0xbe2a1700)
DHCPREQUEST of 10.244.192.169 on eth0 to 255.255.255.255 port 67 (xid=0x172abe)
DHCPOFFER of 10.244.192.169 from 10.244.192.10
DHCPACK of 10.244.192.169 from 10.244.192.10
RTNETLINK answers: File exists
bound to 10.244.192.169 -- renewal in 17523 seconds.
root@saiph:~# ifdown eth0
Internet Systems Consortium DHCP Client 4.2.4
Copyright 2004-2012 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/

Listening on LPF/eth0/60:eb:69:dc:3e:19
Sending on LPF/eth0/60:eb:69:dc:3e:19
Sending on Socket/fallback
DHCPRELEASE on eth0 to 10.244.192.10 port 67 (xid=0x5f632f4c)
root@saiph:~# dhclient eth0
root@saiph:~# ifconfig
eth0 Link encap:Ethernet HWaddr 60:eb:69:dc:3e:19
          inet addr:10.244.192.169 Bcast:10.244.255.255 Mask:255.255.192.0
          inet6 addr: fe80::62eb:69ff:fedc:3e19/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:57 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5257 (5.2 KB) TX bytes:1026 (1.0 KB)

eth1 Link encap:Ethernet HWaddr 60:eb:69:dc:3e:1a
          inet addr:10.244.240.12 Bcast:10.244.255.255 Mask:255.255.192.0
          inet6 addr: fe80::62eb:69ff:fedc:3e1a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:3960 errors:0 dropped:0 overruns:0 frame:0
          TX packets:379 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:306067 (306.0 KB) TX bytes:51176 (51.1 KB)

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

Revision history for this message
Larry Michel (lmic) wrote :

After talking to Andres, it is looking like the upgrade may be what's dropping the connection. This system was recently upgraded from 1.9.1 to 1.9.2, and for the original bug, it did happen shortly after upgrading from 1.9.0 to 1.9.1.

Revision history for this message
Larry Michel (lmic) wrote :

To answer previous question, dhclient works for eth1 which is connected... the other nics are not connected to the network.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

I have a feeling this might be related to the following bug:

https://bugs.launchpad.net/maas/+bug/1552923

Can you describe your network topology, and how it is modeled in MAAS? Are any tagged VLANs in use? Are the VLAN tags on the rack different from the tags on the nodes?

Revision history for this message
Andres Rodriguez (andreserl) wrote :

NO tagged VLAN's, it is a flat network with 2 NIC's connected to the same subnet.

Changed in maas:
status: New → Incomplete
milestone: none → 1.9.3
Changed in maas:
milestone: 1.9.3 → 1.9.4
Revision history for this message
Larry Michel (lmic) wrote :

I have seen this recreated without an upgrade during the course of normal operation. 2 systems, one sm15k and one HP systems are now showing the pxe NIC as disconnected.

Changed in maas:
status: Incomplete → New
Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [Bug 1575815] [NEW] NIC previously discovered through commissioning no longer connected to Maas network

Hi Larry,

Can you please provide the exact steps to reproduce? And:

Did you simply upgrade and machines showed disconnected for the pxe
interface?
Or
You upgraded, information was correct, you recommissioned and information
went away?

On Wednesday, May 18, 2016, Larry Michel <email address hidden> wrote:

> I have seen this recreated without an upgrade during the course of
> normal operation. 2 systems, one sm15k and one HP systems are now
> showing the pxe NIC as disconnected.
>
>
> ** Attachment added: "logs.tar.gz"
>
> https://bugs.launchpad.net/maas/+bug/1575815/+attachment/4665424/+files/logs.tar.gz
>
> ** Changed in: maas
> Status: Incomplete => New
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1575815
>
> Title:
> NIC previously discovered through commissioning no longer connected to
> Maas network
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1575815/+subscriptions
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Larry Michel (lmic) wrote : Re: NIC previously discovered through commissioning no longer connected to Maas network

Hi Andres,
It's just normal operation, no upgrade, no commissioning. It happened on a Saturday and no one would have been accessing the maas server at that time AFAICT.

I don't know any sequence to recreate since this happened during automated deployments, but I narrowed this issue possibly surfacing for krastin server during the time window from that section of the maas log: https://pastebin.canonical.com/156722/

The warning message that's rather strange is this one:
May 14 12:08:42 maas-integration-september maas.interface: [WARNING] Auto IP address (10.244.192.169) on krastin.oilstaging was deleted because it was handed out by the MAAS DHCP server from the dynamic range.

The problem is that 10.244.192.169 is from the static range.

DHCP dynamic IP range low value

10.244.224.0
Lowest IP number of the range for dynamic IPs, used for enlistment, commissioning and unknown devices.

DHCP dynamic IP range high value

10.244.240.255
Highest IP number of the range for dynamic IPs, used for enlistment, commissioning and unknown devices.

Static IP range low value

10.244.192.152
Lowest IP number of the range for IPs given to allocated nodes, must be in same network as dynamic range.

Static IP range high value

10.244.196.255
Highest IP number of the range for IPs given to allocated nodes, must be in same network as dynamic range.

Christian Reis (kiko)
tags: added: cdo-qa-blocker
summary: - NIC previously discovered through commissioning no longer connected to
- Maas network
+ [2.0] NIC previously discovered through commissioning no longer
+ connected to Maas network
summary: - [2.0] NIC previously discovered through commissioning no longer
+ [1.9] NIC previously discovered through commissioning no longer
connected to Maas network
Changed in maas:
milestone: 1.9.4 → 1.9.5
Changed in maas:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Marked as invalid since it is old/incomplete.

Changed in maas:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.