[2.1.3] A node enlistment fails to contact metadata service

Bug #1666719 reported by OpenStack on 2017-02-21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
High
Mike Pontillo
2.1
High
Mike Pontillo

Bug Description

This one probably related to https://bugs.launchpad.net/maas/+bug/1665459

A node enlistment fails to contact metadata service

MAAS node is receiving DHCP PXE calls on MAAS subnet 172.16.1.x, fails to connect to metadata service and finishes boot process, but never shows up in MAAS Device discovery list.

The server:

Ubuntu 16.04.2 LTS

Linux cs-srv-233 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Network:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether d4:85:64:51:16:c8 brd ff:ff:ff:ff:ff:ff
    inet 120.263.220.233/24 brd 120.263.220.255 scope global enp3s0f0
       valid_lft forever preferred_lft forever
    inet6 fe80::d685:64ff:fe51:16c8/64 scope link
       valid_lft forever preferred_lft forever
3: enp3s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether d4:85:64:51:16:ca brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.100/24 brd 172.16.1.255 scope global enp3s0f1
       valid_lft forever preferred_lft forever
    inet6 fe80::d685:64ff:fe51:16ca/64 scope link
       valid_lft forever preferred_lft forever

Network configuration:

auto lo
iface lo inet loopback

# The primary network interface
auto enp3s0f0
iface enp3s0f0 inet static
 address 120.263.220.233
 netmask 255.255.255.0
 network 120.263.220.0
 broadcast 120.263.220.255
 gateway 120.263.220.254
 # dns-* options are implemented by the resolvconf package, if installed
 dns-nameservers 172.16.1.100 120.263.55.2 120.263.5.3 8.8.8.8
 dns-search cs.du.edu

auto enp3s0f1
iface enp3s0f1 inet static
        address 172.16.1.100
        netmask 255.255.255.0
        network 172.16.1.0
        broadcast 172.16.1.255

Region and Rack controllers:

dpkg -s maas-region-controller
Package: maas-region-controller
Status: install ok installed
Priority: optional
Section: net
Installed-Size: 45
Maintainer: Ubuntu Developers <email address hidden>
Architecture: all
Source: maas
Version: 2.1.3+bzr5573-0ubuntu1~16.04.1
Depends: avahi-utils, dbconfig-pgsql, iputils-ping, maas-dns (= 2.1.3+bzr5573-0ubuntu1~16.04.1), maas-region-api (= 2.1.3+bzr5573-0ubuntu1~16.04.1), postgresql (>= 9.1), tcpdump, debconf (>= 0.5) | debconf-2.0
Recommends: openssh-server
Suggests: nmap
Description: Region Controller for MAAS

dpkg -s maas-rack-controller
Package: maas-rack-controller
Status: install ok installed
Priority: optional
Section: net
Installed-Size: 96
Maintainer: Ubuntu Developers <email address hidden>
Architecture: all
Source: maas
Version: 2.1.3+bzr5573-0ubuntu1~16.04.1
Replaces: maas-cluster-controller, python-maas-provisioningserver
Depends: authbind, avahi-utils, bind9utils, distro-info, freeipmi-tools, grub-common, iputils-ping, maas-cli (= 2.1.3+bzr5573-0ubuntu1~16.04.1), maas-common (= 2.1.3+bzr5573-0ubuntu1~16.04.1), maas-dhcp (= 2.1.3+bzr5573-0ubuntu1~16.04.1), ntp, pxelinux | syslinux-common (<< 3:6.00~pre4+dfsg-5), python3-httplib2, python3-maas-provisioningserver (= 2.1.3+bzr5573-0ubuntu1~16.04.1), python3-netaddr, python3-tempita, python3-twisted, python3-zope.interface, syslinux-common, tcpdump, tgt, uuid-runtime, wget, debconf (>= 0.5) | debconf-2.0, init-system-helpers (>= 1.18~), python3:any (>= 3.5~)
Suggests: amtterm, ipmitool, libvirt-bin, nmap, wsmancli
Breaks: maas-cluster-controller, python-maas-provisioningserver
Conflicts: tftpd-hpa
Conffiles:
 /etc/logrotate.d/maas-rack-controller 22fcb01a80fe77c722ab6ca9c78de11d
 /etc/sudoers.d/99-maas-sudoers c79448e89b644bf67f3dd5d430392f85
Description: Rack Controller for MAAS

MAAS packages installed:

maas-cli 2.1.3+bzr5573-0ubuntu1~16.04.1
maas-common 2.1.3+bzr5573-0ubuntu1~16.04.1
maas-dhcp 2.1.3+bzr5573-0ubuntu1~16.04.1
maas-dns 2.1.3+bzr5573-0ubuntu1~16.04.1
maas-proxy 2.1.3+bzr5573-0ubuntu1~16.04.1
maas-rack-controller 2.1.3+bzr5573-0ubuntu1~16.04.1
maas-region-api 2.1.3+bzr5573-0ubuntu1~16.04.1
maas-region-controller 2.1.3+bzr5573-0ubuntu1~16.04.1
python3-django-maas 2.1.3+bzr5573-0ubuntu1~16.04.1
python3-maas-client 2.1.3+bzr5573-0ubuntu1~16.04.1
python3-maas-provisioningserver 2.1.3+bzr5573-0ubuntu1~16.04.1

cat /etc/maas/rackd.conf
cluster_uuid: c2278486-1453-47fa-8857-6c313c395153
maas_url: http://172.16.1.100:5240/MAAS

http://120.263.220.233:5240/MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed
#cloud-config
datasource:
  MAAS:
    timeout : 50
    max_wait : 120
    # there are no default values for metadata_url or oauth credentials
    # If no credentials are present, non-authed attempts will be made.
    metadata_url: http://120.263.220.233:5240/MAAS/metadata/enlist

output: {all: '| tee -a /var/log/cloud-init-output.log'}

Patch http://paste.ubuntu.com/24010744/ is applied.

MAAS region info:

maas-region shell
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from maasserver.preseed import get_preseed_context
>>> from pprint import pprint
>>> from maasserver.models import RackController
>>> pprint([get_preseed_context(rack_controller=rack) for rack in RackController.objects.all()])
[{'metadata_enlist_url': 'http://172.16.1.100:5240/metadata/enlist',
  'osystem': '',
  'release': '',
  'server_host': '172.16.1.100',
  'server_url': 'http://172.16.1.100:5240/api/2.0/machines/',
  'syslog_host_port': '172.16.1.100:514'}]

The error I see on the node PXE boot:

285.792403} cloud-init[1190]: 2-17-02-21 21:07:02, 027 - url_helper.py
Calling ‘http://169.254.169.254/2009-04-04/meta-data/instance-id'
request error [HTTPConnectionPool(host=‘169.254.169.254’, port=80)
Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError
(‘<requests.packages.urllib3.connection.HTTPConnection objects 0x7f35cefb00>: Failed to establish a new connection: [Errno 113] No route to host’.)

Please, let me know if more information is needed.

Thanks,

OpenStack (andy1723) wrote :
Andres Rodriguez (andreserl) wrote :

Could you please post the kernel parameters sent on the PXE process ? That should tell us what IP address is being used to download the metadata (based from your bug, however, seems to be the correct)

The error here:

285.792403} cloud-init[1190]: 2-17-02-21 21:07:02, 027 - url_helper.py
Calling ‘http://169.254.169.254/2009-04-04/meta-data/instance-id'
request error [HTTPConnectionPool(host=‘169.254.169.254’, port=80)

Is because cloud-init was unable to contact MAAS' metadata and it fallsback to the 169 address.

Changed in maas:
status: New → Incomplete
Andres Rodriguez (andreserl) wrote :

Also, please attach regiond.conf

Mike Pontillo (mpontillo) wrote :

I see that in this case, the call to `get_preseed_context(rack_controller=rack)` is correct. From the details in the bug report, everything looks like it should be working properly.

Now here's the wildcard: is your MAAS server doing any kind of NAT? If so, what do your NAT rules look like?

I ask because when the machine on the 172 network commissions, what might be happening is this (and there's a clue about this in your metadata_url, but I don't know from which host you requested that):

(1) An iptables rule on the MAAS server is matched, and traffic incoming on enp3s0f1 masquerades via 120.263.220.233.
(2) HTTP packets from 172.16.1.x are rewritten as HTTP packets from 120.263.220.233.
(3) When MAAS receives traffic from the host that matches the NAT network, it attempts to look up the best interface interface on the region for communicating with the rack. Lacking the information that the PXE booting node is on the 172 network, it might hand out the 120 address as the region IP address of choice.
(4) End result: your node times out when attempting to reach the 120 address from the 172 address, either due to strict source routing, iptables rules (or lack thereof), or the lack of a configured gateway address on the nodes that would allow it to hit 120.263.220.233 *and* be able to route back to the 172 network.

To solve this one, I would hit the preseed URL from a node on the same network as the nodes you are trying to PXE boot and check the result (where did you run it from in the output inside the bug description?):

    http://172.16.1.100:5240/MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed

It might also be instructive to capture packets on the MAAS server, and/or look at the counters presented in "iptables -t nat -L -n -v" and "iptables -L -n -v".

Mike Pontillo (mpontillo) wrote :

I just checked the log files; the HTTP request from the PXE booting node is as follows:

62038 2017-02-21 14:03:44 twisted.python.log: [info] ::ffff:172.16.1.192 - - [21/Feb/2017:21:03:44 +0000] "GET /MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed HTTP/1.1" 200 247 "-" "Cloud-Init/0.7.8"

That looks okay; the metadata service doesn't seem to be contacted by an IP address that has been mangled by MAAS. Here are all the relevant requests in the regiond.log:

   http://paste.ubuntu.com/24044852/

One thing I noticed was the MAAS URL returned (such as what you see in the metadata_enlist_url) is missing the /MAAS suffix. But after investigating, I think this is a red herring. (there must be some code to insert the correct path.)

So this bug seems to take the behavior seen in bug #1665459 a step further: MAAS can now contact the metadata service initially, but the subsequent URL is still incorrect. (How frustrating.)

The good news is, I seem to have reproduced the issue. I was able to hit the URL to get the enlistment preseed and observe MAAS return a metadata_url on a completely different network. So there is still some work to be done here to make multiple-interface MAAS servers work correctly 100% of the time.

I'll investigate how to fix this.

Changed in maas:
status: Incomplete → Triaged
importance: Undecided → High
assignee: nobody → Mike Pontillo (mpontillo)
milestone: none → 2.2.0
OpenStack (andy1723) wrote :

Thanks, Mike. Please, let me know if any information or further testing is needed.

Changed in maas:
status: Triaged → Incomplete
Andres Rodriguez (andreserl) wrote :

Can you please share your:

/etc/maas/regiond.conf
/etc/maas/rackd.conf

OpenStack (andy1723) wrote :

This issue seems to be fixed in 2.1.4+bzr5591. I can't reproduce the issue anymore.
Thanks, everyone.

Changed in maas:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers