Failed download of squashfs while enlisting a node

Bug #1807252 reported by Adam Seishas
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned

Bug Description

Ubuntu version: 18.04.1 LTS
MAAS version: 2.5.0~rc2 (7433-gea48d302e-0ubuntu1~18.04.1)

Enlisting any new node hangs at:

root=squash:http://10.203.0.5:5248/images/ubuntu/amd64/ga-18.04/bionic/daily/squashfs
mount_squash downloading http://10.203.0.5:5248/images/ubuntu/amd64/ga-18.04/bionic/daily/squashfs to root.tmp.img
Connecting to 10.203.0.2:5248 (10.203.0.2:5248)

On the server side, strace indicates the connection is made, and the file is partially transferred before hanging and eventually timing out (see attached). We've tried toggling nginx options sendfile and tcp_nopush, but changes have no effect.

$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-==========================================-============-=================================================
ii maas 2.5.0~rc2-7433-gea48d302e-0ubuntu1~18.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.5.0~rc2-7433-gea48d302e-0ubuntu1~18.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.5.0~rc2-7433-gea48d302e-0ubuntu1~18.04.1 all MAAS server common files
ii maas-dhcp 2.5.0~rc2-7433-gea48d302e-0ubuntu1~18.04.1 all MAAS DHCP server
un maas-dns <none> <none> (no description available)
ii maas-proxy 2.5.0~rc2-7433-gea48d302e-0ubuntu1~18.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.5.0~rc2-7433-gea48d302e-0ubuntu1~18.04.1 all Rack Controller for MAAS
ii maas-region-api 2.5.0~rc2-7433-gea48d302e-0ubuntu1~18.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.5.0~rc2-7433-gea48d302e-0ubuntu1~18.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.5.0~rc2-7433-gea48d302e-0ubuntu1~18.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.5.0~rc2-7433-gea48d302e-0ubuntu1~18.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.5.0~rc2-7433-gea48d302e-0ubuntu1~18.04.1 all MAAS server provisioning libraries (Python 3)

Revision history for this message
Adam Seishas (aseishas) wrote :
Revision history for this message
Adam Seishas (aseishas) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :
Download full text (3.2 KiB)

Hi Adam,

Thank you for your bug report. Could you please provide a bit more information on your environment?

1. How is this MAAS configured? Is this a single region/rack? 1 region with multiple racks? How was this isntalled? Was this an upgrade?

2. Who is 10.203.0.5 ? Who is 10.203.0.2 ?

3. What client machine is failing? Do you happen to have MAC/IP to investigate in the logs ?

4. Can you provide a console log of the booting process?

5. Do all machines fail? Is it just one?

6. Lastly, is there any specific MTU configuration in the underlying vlan ? Anything relevant wrt networking? e.g. it could be due to the underlying network being the blocker?

And from looking at the logs, the only weird thing I see is this:

2018-12-06 18:30:23 provisioningserver.rackdservices.tftp: [info] ubuntu/amd64/ga-18.04/bionic/daily/boot-kernel requested by 10.203.0.218
2018-12-06 18:30:24 provisioningserver.rackdservices.tftp: [info] ubuntu/amd64/ga-18.04/bionic/daily/boot-initrd requested by 10.203.0.218
2018-12-06 18:30:25 twisted.internet.defer: [critical] Unhandled error in Deferred:
2018-12-06 18:30:25 twisted.internet.defer: [critical]
 Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/twisted/internet/asyncioreactor.py", line 267, in run
     self._asyncioEventloop.run_forever()
   File "/usr/lib/python3/dist-packages/twisted/internet/asyncioreactor.py", line 290, in run
     f(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 459, in callback
     self._startRunCallbacks(result)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 567, in _startRunCallbacks
     self._runCallbacks()
 --- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/internet/task.py", line 865, in <lambda>
     d.addCallback(lambda ignored: callable(*args, **kw))
   File "/usr/lib/python3/dist-packages/tftp/session.py", line 274, in sendData
     self.transport.write(bytes)
   File "/usr/lib/python3/dist-packages/twisted/internet/udp.py", line 269, in write
     return self.socket.send(datagram)
 builtins.BlockingIOError: [Errno 11] Resource temporarily unavailable

2018-12-06 18:30:48 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/squashfs requested by 10.203.0.218
\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\002018-12-06 18:34:49 twisted.scripts: [info] twistd 17.9.0 (/usr/bin/python3 3.6.7) starting up.
2018-12-06 18:34:49 twisted.scripts: [info] reactor class: twisted.internet.asyncioreactor.AsyncioSelectorReactor.
2018-12-06 18:34:49 provisioningserver.utils.services: [info] Starting beaconing for interfaces: {'ens7'}

Which would apparently show that something chrased and then the rackd was restar...

Read more...

Changed in maas:
status: New → Incomplete
milestone: none → 2.5.1
Revision history for this message
Adam Seishas (aseishas) wrote :

This is a single (virtual) server, regiond+rackd. The installation is originally 2.5, not an upgrade from a previous stable release, but it has been updated from rc1 to rc2.

10.203.0.2 is the MAAS server, and 10.203.0.5 (3c:fd:fe:57:d3:18) is the client. In the case of this particular log entry the client request was made using curl from a test system, but there should be other instances in the logs of real enlistment attempts (e.g., from 10.203.0.218/3c:fd:fe:57:5d:38 and 10.203.0.219/3c:fd:fe:58:19:d8), and the strace documents one such attempt. Right now, all the systems we've tried are failing to enlist.

We've investigated the network configuration and haven't been able to find anything obviously wrong so far. MTU is 9000 on this VLAN, but it seems to be set correctly everywhere, including in the MAAS configuration.

I don't have a full console log, but I'll see about getting one! Also, we've been focused on the MAAS nginx instance, and I don't think we've tried monitoring rackd itself directly during a failure. It's certainly possible that it's crashing and triggering an nginx restart. We'll take a look.

Revision history for this message
Adam Seishas (aseishas) wrote :

It doesn't look like rackd is crashing, but I'll continue to investigate. I've attached a console log, but only kernel messages are being reported; I'll see if I can get a more complete log, and in the meantime I've also attached a screenshot of the console in its hung state.

Revision history for this message
Adam Seishas (aseishas) wrote :
Revision history for this message
Adam Seishas (aseishas) wrote :

After further testing, this does appear to be a network issue: if the MTU for this VLAN is set to 1500 in the MAAS configuration the transfer succeeds and the new node is enlisted successfully; if the MTU is set to 9000, then transfer and enlistment fail. However, I've reverified that the MTU is set correctly for jumbo frames on both the host and guest (MAAS VM) interfaces, on the host's switch port, and on the switch port for the node being enlisted...

Revision history for this message
Mike Pontillo (mpontillo) wrote :

I wonder if there is a race condition with regard to the initramfs setting up the MTU before or after the transfer starts. I don't know off the top of my head how to troubleshoot that, though. (Except that a full packet capture could tell us much more.)

I think it's odd that the kernel logs an MTU change from 1500 to 9000 ~20 seconds after boot. It makes me wonder what it was doing for the previous 20 seconds, when the transfer actually started, and how that correlates with a (presumed) DHCP request inside the initramfs at the time of the MTU change...

Changed in maas:
milestone: 2.5.1 → 2.5.2
Revision history for this message
A. Karl Kornel (akkornel) wrote :

I'm closing this bug out, because it turns out the problem _was_ actually MTU.

We had a jumbo-frame MTU (of 9216) set on all of the switch ports, and we had an MTU of 9000 set on the OS side. What was missing is, a couple of our port-channels did _not_ have the MTU set. The problem port-channel was on one side of an inter-rack link. If the rack controller and the server being provisioned were in the same rack, then things would be fine (because all of the regular ports had the MTU set correctly). But, if the server being provisioned was in the other rack, then the traffic would go through the inter-rack port channel, and be dropped.

(For the record, we are in the process of spinning up a second rack controller, so each rack will have their own.)

So, for anyone who comes across this problem in the future: If you are using any form of aggregate, make sure you have the MTUs set on the port-channel interfaces!

Changed in maas:
status: Incomplete → Invalid
Revision history for this message
A. Karl Kornel (akkornel) wrote :

(Sorry, I forgot to note, Adam is my co-worker!)

Changed in maas:
milestone: 2.5.2 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.