pxe boot from maas fails due to time out

Bug #1246236 reported by ZhengPeng Hou
58
This bug affects 10 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Critical
Unassigned
maas (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

1 create a new vm node on a 13.10 server, attach the network to a existing openvswitch, use virtio as nic.
2 install 13.10 onto the vm
3 login to the vm via ssh, and apt-get install maas from archive.
4 modify /etc/maas/import_pxe_files to only download pxe files for precise and saucy
5 run maas-import-pxe-files
6 create another node by using virt-manager from my laptop, and make it boot from pxe.
7 run dhcpdum on maas node, we can see dhcp works as expect, and from the console of the pxe boot node, it shows ip address has been assigned, trying to get pxelinux.0 from tftp, after couple of seconds, it fails, due to timeout.

=============================================pserv log(run pserv manually=========================================

ubuntu@maas:~$ cat /tmp/pserv.log
2013-10-30 09:58:07+0800 [-] Log opened.
2013-10-30 09:58:07+0800 [-] twistd 13.0.0 (/usr/bin/python 2.7.5) starting up.
2013-10-30 09:58:07+0800 [-] reactor class: twisted.internet.epollreactor.EPollReactor.
2013-10-30 09:58:07+0800 [-] TFTP starting on 69
2013-10-30 09:58:07+0800 [-] Starting protocol <tftp.protocol.TFTP instance at 0x2752200>
2013-10-30 09:58:07+0800 [-] TFTP Listener started at 127.0.0.1:69
2013-10-30 09:58:07+0800 [-] TFTP starting on 69
2013-10-30 09:58:07+0800 [-] Starting protocol <tftp.protocol.TFTP instance at 0x27522d8>
2013-10-30 09:58:07+0800 [-] TFTP Listener started at 192.168.1.5:69
2013-10-30 09:58:07+0800 [-] set uid/gid 109/116
2013-10-30 09:58:45+0800 [TFTP (UDP)] Datagram received from ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1432', 'tsize': '0'})>
2013-10-30 09:58:45+0800 [-] RemoteOriginReadSession starting on 42340
2013-10-30 09:58:45+0800 [-] Starting protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752878>
2013-10-30 09:58:45+0800 [TFTP (UDP)] Datagram received from ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1432', 'tsize': '0'})>
2013-10-30 09:58:45+0800 [-] RemoteOriginReadSession starting on 56340
2013-10-30 09:58:45+0800 [-] Starting protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752f38>
2013-10-30 09:58:46+0800 [TFTP (UDP)] Datagram received from ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1432', 'tsize': '0'})>
2013-10-30 09:58:46+0800 [-] RemoteOriginReadSession starting on 40332
2013-10-30 09:58:46+0800 [-] Starting protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2756320>
2013-10-30 09:58:48+0800 [TFTP (UDP)] Datagram received from ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1432', 'tsize': '0'})>
2013-10-30 09:58:48+0800 [-] RemoteOriginReadSession starting on 43806
2013-10-30 09:58:48+0800 [-] Starting protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x27566c8>
2013-10-30 09:58:51+0800 [TFTP (UDP)] Datagram received from ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1432', 'tsize': '0'})>
2013-10-30 09:58:51+0800 [-] RemoteOriginReadSession starting on 46409
2013-10-30 09:58:51+0800 [-] Starting protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2756b00>
2013-10-30 09:58:51+0800 [-] Timed during option negotiation process
2013-10-30 09:58:51+0800 [-] (UDP Port 42340 Closed)
2013-10-30 09:58:51+0800 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752878>
2013-10-30 09:58:51+0800 [-] Timed during option negotiation process
2013-10-30 09:58:51+0800 [-] (UDP Port 56340 Closed)
2013-10-30 09:58:51+0800 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752f38>
2013-10-30 09:58:52+0800 [-] Timed during option negotiation process
2013-10-30 09:58:52+0800 [-] (UDP Port 40332 Closed)
2013-10-30 09:58:52+0800 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2756320>
2013-10-30 09:58:54+0800 [-] Timed during option negotiation process
2013-10-30 09:58:54+0800 [-] (UDP Port 43806 Closed)
2013-10-30 09:58:54+0800 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x27566c8>
2013-10-30 09:58:57+0800 [-] Timed during option negotiation process
2013-10-30 09:58:57+0800 [-] (UDP Port 46409 Closed)
2013-10-30 09:58:57+0800 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2756b00>
2013-10-30 12:55:38+0800 [-] Received SIGINT, shutting down.
2013-10-30 12:55:38+0800 [-] (UDP Port 69 Closed)
2013-10-30 12:55:38+0800 [-] Stopping protocol <tftp.protocol.TFTP instance at 0x27522d8>
2013-10-30 12:55:38+0800 [-] (UDP Port 69 Closed)
2013-10-30 12:55:38+0800 [-] Stopping protocol <tftp.protocol.TFTP instance at 0x2752200>
2013-10-30 12:55:38+0800 [-] Main loop terminated.
2013-10-30 12:55:38+0800 [-] Warning: No permission to delete pid file
2013-10-30 12:55:38+0800 [-] Server Shut Down.
=======================================================End of pserv log===============================================

========================================tcpdump ===============================================================
ubuntu@maas:~$ sudo tcpdump -vvnnXS -i eth0 port 69
[sudo] password for ubuntu:
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
17:33:16.950957 IP (tos 0x0, ttl 64, id 1177, offset 0, flags [none], proto UDP (17), length 68)
    192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok] 40 RRQ "pxelinux.0" octet blksize 1432 tsize 0
 0x0000: 4500 0044 0499 0000 4011 f2a1 c0a8 0119 E..D....@.......
 0x0010: c0a8 0105 0400 0045 0030 9796 0001 7078 .......E.0....px
 0x0020: 656c 696e 7578 2e30 006f 6374 6574 0062 elinux.0.octet.b
 0x0030: 6c6b 7369 7a65 0031 3433 3200 7473 697a lksize.1432.tsiz
 0x0040: 6500 3000 e.0.
17:33:17.329158 IP (tos 0x0, ttl 64, id 1435, offset 0, flags [none], proto UDP (17), length 68)
    192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok] 40 RRQ "pxelinux.0" octet blksize 1432 tsize 0
 0x0000: 4500 0044 059b 0000 4011 f19f c0a8 0119 E..D....@.......
 0x0010: c0a8 0105 0400 0045 0030 9796 0001 7078 .......E.0....px
 0x0020: 656c 696e 7578 2e30 006f 6374 6574 0062 elinux.0.octet.b
 0x0030: 6c6b 7369 7a65 0031 3433 3200 7473 697a lksize.1432.tsiz
 0x0040: 6500 3000 e.0.
17:33:18.098064 IP (tos 0x0, ttl 64, id 1692, offset 0, flags [none], proto UDP (17), length 68)
    192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok] 40 RRQ "pxelinux.0" octet blksize 1432 tsize 0
 0x0000: 4500 0044 069c 0000 4011 f09e c0a8 0119 E..D....@.......
 0x0010: c0a8 0105 0400 0045 0030 9796 0001 7078 .......E.0....px
 0x0020: 656c 696e 7578 2e30 006f 6374 6574 0062 elinux.0.octet.b
 0x0030: 6c6b 7369 7a65 0031 3433 3200 7473 697a lksize.1432.tsiz
 0x0040: 6500 3000 e.0.
17:33:19.635949 IP (tos 0x0, ttl 64, id 1966, offset 0, flags [none], proto UDP (17), length 68)
    192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok] 40 RRQ "pxelinux.0" octet blksize 1432 tsize 0
 0x0000: 4500 0044 07ae 0000 4011 ef8c c0a8 0119 E..D....@.......
 0x0010: c0a8 0105 0400 0045 0030 9796 0001 7078 .......E.0....px
 0x0020: 656c 696e 7578 2e30 006f 6374 6574 0062 elinux.0.octet.b
 0x0030: 6c6b 7369 7a65 0031 3433 3200 7473 697a lksize.1432.tsiz
 0x0040: 6500 3000 e.0.
17:33:22.711767 IP (tos 0x0, ttl 64, id 2241, offset 0, flags [none], proto UDP (17), length 68)
    192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok] 40 RRQ "pxelinux.0" octet blksize 1432 tsize 0
 0x0000: 4500 0044 08c1 0000 4011 ee79 c0a8 0119 E..D....@..y....
 0x0010: c0a8 0105 0400 0045 0030 9796 0001 7078 .......E.0....px
 0x0020: 656c 696e 7578 2e30 006f 6374 6574 0062 elinux.0.octet.b
 0x0030: 6c6b 7369 7a65 0031 3433 3200 7473 697a lksize.1432.tsiz
 0x0040: 6500 3000
======================================================End of tcpdump===============================================

In the meanwhile, set up another vm, installed tftpd, run tftp <ip> -c get file, the file can be downloaded correctly. Another thing might be helpful is /var/log/maas/maas.log is empty.

Tags: cts landscape
Revision history for this message
ZhengPeng Hou (zhengpeng-hou) wrote :
summary: - pxe boot from maas due to time out
+ pxe boot from maas fails due to time out
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in maas (Ubuntu):
status: New → Confirmed
Revision history for this message
Julian Edwards (julian-edwards) wrote :

I've had a few people report this, and seen it myself on one occasion when I mixed 100Mb/1Gb switches on my network.

We probably just need a bigger timeout in the TFTP server code. Gavin?

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 14.04
Revision history for this message
Gavin Panella (allenap) wrote :

> We probably just need a bigger timeout in the TFTP server code. Gavin?

Let's try it!

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Nobody has reported seeing this for a while. I looked at the code and it has a whole second timeout per packet. You would think that this is more than enough over a LAN segment.

I also discovered that the tftp client can set its own timeouts (http://tools.ietf.org/html/rfc2349). Given that you see it working from one client and not another, I am suspicious although there's nothing in the tcpdump that shows a timeout option getting set.

Are you still seeing this problem?

Changed in maas:
status: Triaged → Incomplete
tags: added: landscape
Revision history for this message
Adam Collard (adam-collard) wrote :

> Are you still seeing this problem?

I'm seeing this problem now. For me it only started occurring after upgrading MAAS from 1.5+bzr1977-0ubuntu3 to 1.5+bzr1977-0ubuntu4. MAAS is no longer able to PXE boot.

To rule out any upgrade/configuration concerns I purged and reinstalled MAAS (wiping out the database, all configuration files, downloaded ephemerals etc. and essentially going again from scratch). This landed me in exactly the same situation.

Changed in maas:
status: Incomplete → Confirmed
Revision history for this message
Adam Collard (adam-collard) wrote :

I was adviced to try rebooting the MAAS server (region and cluster controller on one box) which fixed the problem. But a subsequent reboot brought it back :/

Revision history for this message
Adam Collard (adam-collard) wrote :

So in my case, the problem was PEBKAC and networking. Specifically I had two interfaces on my MAAS server that were connected to the same network, both had equally weighted routes to the gateway. Only one of them was configured in MAAS for the cluster controller. It was (seemingly) by chance that the second NIC was the one that the other nodes were trying to TFTP off of.

Revision history for this message
Graham Binns (gmb) wrote : Re: [Bug 1246236] Re: pxe boot from maas fails due to time out

On 14 March 2014 14:43, Adam Collard <email address hidden> wrote:
> So in my case, the problem was PEBKAC and networking. Specifically I had
> two interfaces on my MAAS server that were connected to the same
> network, both had equally weighted routes to the gateway. Only one of
> them was configured in MAAS for the cluster controller. It was
> (seemingly) by chance that the second NIC was the one that the other
> nodes were trying to TFTP off of.

It'd be lovely if MAAS would report this kind of stuff... "Hey, I can
ping myself... That's not good..."

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Friday 14 Mar 2014 09:37:18 you wrote:
> I was adviced to try rebooting the MAAS server (region and cluster
> controller on one box) which fixed the problem. But a subsequent reboot
> brought it back :/

I am pretty sure there's something wrong with some folks' networking setups
but I'm not sure what it is.

The timeouts on the TFTP server are very generous, it uses a backoff algorithm
and ends up as a few seconds. This should be more than enough for a local
network!

As a data point. I have seen this locally once when I mixed different types of
switches together (100MB and Gigabit) and went away as soon as I moved back to
a single speed network.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Friday 14 Mar 2014 15:06:17 Graham Binns wrote:
> On 14 March 2014 14:43, Adam Collard <email address hidden> wrote:
> > So in my case, the problem was PEBKAC and networking. Specifically I had
> > two interfaces on my MAAS server that were connected to the same
> > network, both had equally weighted routes to the gateway. Only one of
> > them was configured in MAAS for the cluster controller. It was
> > (seemingly) by chance that the second NIC was the one that the other
> > nodes were trying to TFTP off of.
>
> It'd be lovely if MAAS would report this kind of stuff... "Hey, I can
> ping myself... That's not good..."

This should not be a problem any more as the TFTP server should listen on all
interfaces in the next release.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Going incomplete again since there's still no reliable way to reproduce any bug in maas.

Changed in maas:
status: Confirmed → Incomplete
Changed in maas (Ubuntu):
status: Confirmed → Incomplete
Changed in maas:
milestone: 14.04 → none
Revision history for this message
Adam Collard (adam-collard) wrote :

On 17 March 2014 00:39, Julian Edwards <email address hidden> wrote:

> On Friday 14 Mar 2014 09:37:18 you wrote:
> > I was adviced to try rebooting the MAAS server (region and cluster
> > controller on one box) which fixed the problem. But a subsequent reboot
> > brought it back :/
>
> The timeouts on the TFTP server are very generous, it uses a backoff
> algorithm
> and ends up as a few seconds. This should be more than enough for a local
> network!

This may be true for upstream MAAS (actually python-tx-tftp) but in the
packaged version all of the timeouts are 1s - see
https://bugs.launchpad.net/maas/+bug/1155556 so it's not (1, 3, 7) but (1,
1, 1, 1, 1, 1)

The actual values of the timeouts aren't relevant for this bug, since the
issue (at least for me) was faulty networking configuration.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Huh, that timeout must have changed since I last looked at the code. Anyway,
7 seconds should be enough per datagram.

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

I'm seeing this error in the Garage MAAS today. As it happens, also have two interfaces on the network on that machine. Will try reducing to one and reboot.

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Was not able to reproduce after a reboot.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for maas (Ubuntu) because there has been no activity for 60 days.]

Changed in maas (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
Ante Karamatić (ivoks)
tags: added: cts
Changed in maas:
status: Expired → New
Changed in maas (Ubuntu):
status: Expired → New
Changed in maas:
status: New → Incomplete
Christian Reis (kiko)
Changed in maas:
milestone: none → 1.7.1
Revision history for this message
Christian Reis (kiko) wrote :

See also bug 1398159

Revision history for this message
Dave Walker (davewalker) wrote :

Might want to also check it isn't bug 1155556.

Changed in maas (Ubuntu):
status: New → Incomplete
Changed in maas:
milestone: 1.7.1 → 1.7.2
zack (jnrflatwork)
Changed in maas:
assignee: nobody → zack (jnrflatwork)
Christian Reis (kiko)
Changed in maas:
assignee: zack (jnrflatwork) → nobody
Revision history for this message
JJ Asghar (d-jj) wrote :

This effects me :(

Revision history for this message
zack (jnrflatwork) wrote :
Download full text (9.1 KiB)

Um ok
On Feb 2, 2015 2:25 PM, "Christian Reis" <email address hidden> wrote:

> ** Changed in: maas
> Assignee: zack (jnrflatwork) => (unassigned)
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1246236
>
> Title:
> pxe boot from maas fails due to time out
>
> Status in MAAS:
> Incomplete
> Status in maas package in Ubuntu:
> Incomplete
>
> Bug description:
> 1 create a new vm node on a 13.10 server, attach the network to a
> existing openvswitch, use virtio as nic.
> 2 install 13.10 onto the vm
> 3 login to the vm via ssh, and apt-get install maas from archive.
> 4 modify /etc/maas/import_pxe_files to only download pxe files for
> precise and saucy
> 5 run maas-import-pxe-files
> 6 create another node by using virt-manager from my laptop, and make it
> boot from pxe.
> 7 run dhcpdum on maas node, we can see dhcp works as expect, and from
> the console of the pxe boot node, it shows ip address has been assigned,
> trying to get pxelinux.0 from tftp, after couple of seconds, it fails, due
> to timeout.
>
>
> =============================================pserv log(run pserv
> manually=========================================
>
> ubuntu@maas:~$ cat /tmp/pserv.log
> 2013-10-30 09:58:07+0800 [-] Log opened.
> 2013-10-30 09:58:07+0800 [-] twistd 13.0.0 (/usr/bin/python 2.7.5)
> starting up.
> 2013-10-30 09:58:07+0800 [-] reactor class:
> twisted.internet.epollreactor.EPollReactor.
> 2013-10-30 09:58:07+0800 [-] TFTP starting on 69
> 2013-10-30 09:58:07+0800 [-] Starting protocol <tftp.protocol.TFTP
> instance at 0x2752200>
> 2013-10-30 09:58:07+0800 [-] TFTP Listener started at 127.0.0.1:69
> 2013-10-30 09:58:07+0800 [-] TFTP starting on 69
> 2013-10-30 09:58:07+0800 [-] Starting protocol <tftp.protocol.TFTP
> instance at 0x27522d8>
> 2013-10-30 09:58:07+0800 [-] TFTP Listener started at 192.168.1.5:69
> 2013-10-30 09:58:07+0800 [-] set uid/gid 109/116
> 2013-10-30 09:58:45+0800 [TFTP (UDP)] Datagram received from
> ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet,
> options={'blksize': '1432', 'tsize': '0'})>
> 2013-10-30 09:58:45+0800 [-] RemoteOriginReadSession starting on 42340
> 2013-10-30 09:58:45+0800 [-] Starting protocol
> <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752878>
> 2013-10-30 09:58:45+0800 [TFTP (UDP)] Datagram received from
> ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet,
> options={'blksize': '1432', 'tsize': '0'})>
> 2013-10-30 09:58:45+0800 [-] RemoteOriginReadSession starting on 56340
> 2013-10-30 09:58:45+0800 [-] Starting protocol
> <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752f38>
> 2013-10-30 09:58:46+0800 [TFTP (UDP)] Datagram received from
> ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet,
> options={'blksize': '1432', 'tsize': '0'})>
> 2013-10-30 09:58:46+0800 [-] RemoteOriginReadSession starting on 40332
> 2013-10-30 09:58:46+0800 [-] Starting protocol
> <tftp.bootstrap.RemoteOriginReadSession instance at 0x2756320>
> 2013-10-30 09:58:48+0800 [TFTP (UDP)] Datagram received from
> ('192.168.1.25', 1024): <...

Read more...

Revision history for this message
JJ Asghar (d-jj) wrote :

Here's my tcpdump :)

Revision history for this message
Christian Reis (kiko) wrote :

When we start transferring the kernel over, the machine on the other side stops responding:

15:59:23.563014 IP 10.0.0.12.49164 > 10.0.0.1.tftp: 77 RRQ "ubuntu/amd64/generic/trusty/release/boot-kernel" octet tsize 0 blksize 1408
15:59:23.578439 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 29
15:59:23.578751 IP 10.0.0.12.49164 > 10.0.0.1.59570: UDP, length 4
15:59:23.579230 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:23.599678 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:24.579356 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:24.599841 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:25.580228 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:25.600033 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:26.581397 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:26.600222 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:27.582615 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:27.600402 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:28.583782 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:28.600571 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26

It won't respond to pings either, and ARP requests at that point fail:

16:02:58.089006 IP 10.0.0.1 > 10.0.0.12: ICMP echo request, id 5945, seq 1, length 64
16:02:59.088958 IP 10.0.0.1 > 10.0.0.12: ICMP echo request, id 5945, seq 2, length 64
16:03:00.088961 IP 10.0.0.1 > 10.0.0.12: ICMP echo request, id 5945, seq 3, length 64
16:03:03.104829 ARP, Request who-has 10.0.0.12 tell 10.0.0.1, length 28
16:03:04.104824 ARP, Request who-has 10.0.0.12 tell 10.0.0.1, length 28
16:03:04.953050 IP 10.0.0.1 > 10.0.0.12: ICMP echo request, id 5952, seq 1, length 64
16:03:05.104832 ARP, Request who-has 10.0.0.12 tell 10.0.0.1, length 28
16:03:05.952926 IP 10.0.0.1 > 10.0.0.12: ICMP echo request, id 5952, seq 2, length 64

(I'm not sure pxelinux responds to pings, but the ARP request should work I assume)

Revision history for this message
Raphaël Badin (rvb) wrote :

> When we start transferring the kernel over, the machine on the other side stops responding

This is indeed what seems to be causing the problem. Note that we're seeing (a worrying number of) retries in some of the earlier (and successful) transfers.

What kind of bridge is being used here? (virbr0 -the bridge created by virsh-, br0, ?). Can you please paste the characteristics of the bridge (by running something like "cd /sys/class/net/virbr0/bridge; for f in `ls *`; do echo -n "$f "; cat $f; done).

As described on https://bugs.launchpad.net/maas/+bug/1155556, the default timeout behavior might be the problem here. Can you try changing the RemoteOriginReadSession class in /usr/share/pyshared/tftp/bootstrap.py: change the timeout = (1, 1, 1, 1, 1, 1) to timeout = (1, 3, 7). and then restart the TFTP server?

Revision history for this message
Christian Reis (kiko) wrote :

We're not on a bridged network (nor are there VMs) so we may be seeing a different bug, or perhaps both situations trigger the same symptom. I'll try playing with the timeout.

Revision history for this message
Christian Reis (kiko) wrote :

I installed a newer version of pxelinux to see if that is the culprit. It doesn't work, but /some/ traffic between the nodes now happens when transferring the kernel:

11:54:03.181388 IP 10.0.0.12.49164 > 10.0.0.1.tftp: 77 RRQ "ubuntu/amd64/generic/trusty/release/boot-kernel" octet tsize 0 blksize 1408
11:54:03.196192 IP 10.0.0.1.56150 > 10.0.0.12.49164: UDP, length 29
11:54:03.196470 IP 10.0.0.12.49164 > 10.0.0.1.56150: UDP, length 4
11:54:03.196888 IP 10.0.0.1.56150 > 10.0.0.12.49164: UDP, length 1404
11:54:03.237564 IP 10.0.0.1.52379 > 10.0.0.12.49162: UDP, length 26
11:54:04.197035 IP 10.0.0.1.56150 > 10.0.0.12.49164: UDP, length 1404
11:54:04.237737 IP 10.0.0.1.52379 > 10.0.0.12.49162: UDP, length 26
11:54:07.197394 IP 10.0.0.1.56150 > 10.0.0.12.49164: UDP, length 1404
11:54:07.237986 IP 10.0.0.1.52379 > 10.0.0.12.49162: UDP, length 26
11:54:12.201806 IP 10.0.0.1.56150 > 10.0.0.12.49164: UDP, length 1404
11:54:14.240251 IP 10.0.0.1.52379 > 10.0.0.12.49162: UDP, length 26

and then it stops.

Revision history for this message
Christian Reis (kiko) wrote :

Oh, I'm mistaken -- it's exactly the same as before. I hadn't noticed that the machine had replied once before going offline.

Gavin looked up the ARP issue and it turns out that pxelinux doesn't respond to ARP when in the keyboard read loop and when in the tftp send/recv loop, so that's why there's no ARP response. But I wonder, could the issue here be that the ARP cache is being blown away for some reason? While the transfer hung, I looked at the arp cache and it had an <incomplete> entry.

Revision history for this message
Christian Reis (kiko) wrote :

Static ARP did not help.

Setting the MTU to 1200 caused pxelinux.0 to fail transferring with similar, but different behaviour:

12:12:50.699971 IP 10.0.0.12.2071 > 10.0.0.1.tftp: 32 RRQ "pxelinux.0" octet blksize 1456
12:12:50.773393 IP 10.0.0.1.59263 > 10.0.0.12.2071: UDP, length 15
12:12:50.773775 IP 10.0.0.12.2071 > 10.0.0.1.59263: UDP, length 4
12:12:50.774211 IP 10.0.0.1.59263 > 10.0.0.12.2071: UDP, length 1404
12:12:50.774226 IP 10.0.0.1 > 10.0.0.12: udp
12:12:51.774349 IP 10.0.0.1.59263 > 10.0.0.12.2071: UDP, length 1404
12:12:51.774387 IP 10.0.0.1 > 10.0.0.12: udp
12:12:54.775554 IP 10.0.0.1.59263 > 10.0.0.12.2071: UDP, length 1404
12:12:54.775592 IP 10.0.0.1 > 10.0.0.12: udp

I'm curious as to why the blksize is set at 1456 by pxelinux.

Revision history for this message
Christian Reis (kiko) wrote :

Okay, we did a lot of debugging and tracing attempting to find the problem. In the end, it was hardware.

I upgraded manually to a newer version of syslinux and pxelinux which /did/ appear to be working, albeit VERY slowly. It seems that the 6.x syslinux network stack is more tolerant of errors, as it seemed to keep limping along.

I tested tftp from a MacOS and Ubuntu 14.04 host on the network, and it was able to download the image fine.

It turns out that the issue was actually a bad USB NIC that was generating a high amount of RX errors; the ifconfig output showed:

           RX packets:1380 errors:2758 dropped:1379 overruns:0 frame:1379

We had swapped out the switch, cable and tried other client nodes before realizing it.

Moral of the story is that pxelinux's network stack tends to be more sensitive to networking problems, and the symptom is that tftp fails or works very slowly. I'll add that to my collection of release notes.

Changed in maas:
milestone: 1.7.2 → 1.7.3
Gema Gomez (gema)
Changed in maas:
status: Incomplete → Confirmed
Revision history for this message
Mike Pontillo (mpontillo) wrote :

I hit this problem today, and it turned out to be a non-MAAS bug.

I was trying to configure a Trusty host with VMware Workstation *and* LXCs sharing the same network.

It turns out that the combination of VMware plus Linux bridge interfaces have a bug: if I set up the VM network to be bridged to br0, I can ping hosts on the local network, and I can ping the Linux hypervisor, and I can ping any of the LXCs. I can even DHCP from the LXCs. But I *can't* establish any TCP connections to the LXCs from the VMware guest.

My current theory is that somehow the bridge interface is causing packets sentto the LXC from the VMware guest to be discarded at a low level, rather than sent up the stack. I couldn't find any IP Because when I configure VMware to bridge to a physical interface rather than a virtual bridge, it works. Until I disconnect the physical interface.

So the bottom line for me was:
(1) Have VMware bridge to the physical interface (in my case, it was eth0.100, a VLAN interface) if you want to use bridged mode
(2) Use a host-only network (without DHCP, so MAAS can do its thing!) if you want to develop in isolation, without sharing the network with the local LAN
 - Do NOT bridge VMware to a bridge created with 'brctl' or similar!

I added the VMware bridge interface (vmnet1) to the br0 interface I'm using for my MAAS LXC, which may get me the best of both. I'll continue to test this configuration.

Revision history for this message
Christian Reis (kiko) wrote :

For reference, this is the libvirt page which covers bridge timeouts affecting PXE booting: http://wiki.libvirt.org/page/PXE_boot_%28or_dhcp%29_on_guest_failed#Long_Forward_Delay_Time_on_Bridge

Changed in maas (Ubuntu):
status: Incomplete → Invalid
Changed in maas:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.