MAAS

pxe boot from maas fails due to time out

Bug #1246236 reported by ZhengPeng Hou on 2013-10-30

This bug affects 10 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Invalid	Critical	Unassigned	MAAS 1.7.3
	maas (Ubuntu)	Invalid	Undecided	Unassigned

Bug Description

1 create a new vm node on a 13.10 server, attach the network to a existing openvswitch, use virtio as nic.
2 install 13.10 onto the vm
3 login to the vm via ssh, and apt-get install maas from archive.
4 modify /etc/maas/import_pxe_files to only download pxe files for precise and saucy
5 run maas-import-pxe-files
6 create another node by using virt-manager from my laptop, and make it boot from pxe.
7 run dhcpdum on maas node, we can see dhcp works as expect, and from the console of the pxe boot node, it shows ip address has been assigned, trying to get pxelinux.0 from tftp, after couple of seconds, it fails, due to timeout.

=============================================pserv log(run pserv manually=========================================

ubuntu@maas:~$ cat /tmp/pserv.log
2013-10-30 09:58:07+0800 [-] Log opened.
2013-10-30 09:58:07+0800 [-] twistd 13.0.0 (/usr/bin/python 2.7.5) starting up.
2013-10-30 09:58:07+0800 [-] reactor class: twisted.internet.epollreactor.EPollReactor.
2013-10-30 09:58:07+0800 [-] TFTP starting on 69
2013-10-30 09:58:07+0800 [-] Starting protocol <tftp.protocol.TFTP instance at 0x2752200>
2013-10-30 09:58:07+0800 [-] TFTP Listener started at 127.0.0.1:69
2013-10-30 09:58:07+0800 [-] TFTP starting on 69
2013-10-30 09:58:07+0800 [-] Starting protocol <tftp.protocol.TFTP instance at 0x27522d8>
2013-10-30 09:58:07+0800 [-] TFTP Listener started at 192.168.1.5:69
2013-10-30 09:58:07+0800 [-] set uid/gid 109/116
2013-10-30 09:58:45+0800 [TFTP (UDP)] Datagram received from ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1432', 'tsize': '0'})>
2013-10-30 09:58:45+0800 [-] RemoteOriginReadSession starting on 42340
2013-10-30 09:58:45+0800 [-] Starting protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752878>
2013-10-30 09:58:45+0800 [TFTP (UDP)] Datagram received from ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1432', 'tsize': '0'})>
2013-10-30 09:58:45+0800 [-] RemoteOriginReadSession starting on 56340
2013-10-30 09:58:45+0800 [-] Starting protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752f38>
2013-10-30 09:58:46+0800 [TFTP (UDP)] Datagram received from ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1432', 'tsize': '0'})>
2013-10-30 09:58:46+0800 [-] RemoteOriginReadSession starting on 40332
2013-10-30 09:58:46+0800 [-] Starting protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2756320>
2013-10-30 09:58:48+0800 [TFTP (UDP)] Datagram received from ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1432', 'tsize': '0'})>
2013-10-30 09:58:48+0800 [-] RemoteOriginReadSession starting on 43806
2013-10-30 09:58:48+0800 [-] Starting protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x27566c8>
2013-10-30 09:58:51+0800 [TFTP (UDP)] Datagram received from ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1432', 'tsize': '0'})>
2013-10-30 09:58:51+0800 [-] RemoteOriginReadSession starting on 46409
2013-10-30 09:58:51+0800 [-] Starting protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2756b00>
2013-10-30 09:58:51+0800 [-] Timed during option negotiation process
2013-10-30 09:58:51+0800 [-] (UDP Port 42340 Closed)
2013-10-30 09:58:51+0800 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752878>
2013-10-30 09:58:51+0800 [-] Timed during option negotiation process
2013-10-30 09:58:51+0800 [-] (UDP Port 56340 Closed)
2013-10-30 09:58:51+0800 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752f38>
2013-10-30 09:58:52+0800 [-] Timed during option negotiation process
2013-10-30 09:58:52+0800 [-] (UDP Port 40332 Closed)
2013-10-30 09:58:52+0800 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2756320>
2013-10-30 09:58:54+0800 [-] Timed during option negotiation process
2013-10-30 09:58:54+0800 [-] (UDP Port 43806 Closed)
2013-10-30 09:58:54+0800 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x27566c8>
2013-10-30 09:58:57+0800 [-] Timed during option negotiation process
2013-10-30 09:58:57+0800 [-] (UDP Port 46409 Closed)
2013-10-30 09:58:57+0800 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x2756b00>
2013-10-30 12:55:38+0800 [-] Received SIGINT, shutting down.
2013-10-30 12:55:38+0800 [-] (UDP Port 69 Closed)
2013-10-30 12:55:38+0800 [-] Stopping protocol <tftp.protocol.TFTP instance at 0x27522d8>
2013-10-30 12:55:38+0800 [-] (UDP Port 69 Closed)
2013-10-30 12:55:38+0800 [-] Stopping protocol <tftp.protocol.TFTP instance at 0x2752200>
2013-10-30 12:55:38+0800 [-] Main loop terminated.
2013-10-30 12:55:38+0800 [-] Warning: No permission to delete pid file
2013-10-30 12:55:38+0800 [-] Server Shut Down.
=======================================================End of pserv log===============================================

========================================tcpdump ===============================================================
ubuntu@maas:~$ sudo tcpdump -vvnnXS -i eth0 port 69
[sudo] password for ubuntu:
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
17:33:16.950957 IP (tos 0x0, ttl 64, id 1177, offset 0, flags [none], proto UDP (17), length 68)
    192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok] 40 RRQ "pxelinux.0" octet blksize 1432 tsize 0
0x0000: 4500 0044 0499 0000 4011 f2a1 c0a8 0119 E..D....@.......
0x0010: c0a8 0105 0400 0045 0030 9796 0001 7078 .......E.0....px
0x0020: 656c 696e 7578 2e30 006f 6374 6574 0062 elinux.0.octet.b
0x0030: 6c6b 7369 7a65 0031 3433 3200 7473 697a lksize.1432.tsiz
0x0040: 6500 3000 e.0.
17:33:17.329158 IP (tos 0x0, ttl 64, id 1435, offset 0, flags [none], proto UDP (17), length 68)
    192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok] 40 RRQ "pxelinux.0" octet blksize 1432 tsize 0
0x0000: 4500 0044 059b 0000 4011 f19f c0a8 0119 E..D....@.......
0x0010: c0a8 0105 0400 0045 0030 9796 0001 7078 .......E.0....px
0x0020: 656c 696e 7578 2e30 006f 6374 6574 0062 elinux.0.octet.b
0x0030: 6c6b 7369 7a65 0031 3433 3200 7473 697a lksize.1432.tsiz
0x0040: 6500 3000 e.0.
17:33:18.098064 IP (tos 0x0, ttl 64, id 1692, offset 0, flags [none], proto UDP (17), length 68)
    192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok] 40 RRQ "pxelinux.0" octet blksize 1432 tsize 0
0x0000: 4500 0044 069c 0000 4011 f09e c0a8 0119 E..D....@.......
0x0010: c0a8 0105 0400 0045 0030 9796 0001 7078 .......E.0....px
0x0020: 656c 696e 7578 2e30 006f 6374 6574 0062 elinux.0.octet.b
0x0030: 6c6b 7369 7a65 0031 3433 3200 7473 697a lksize.1432.tsiz
0x0040: 6500 3000 e.0.
17:33:19.635949 IP (tos 0x0, ttl 64, id 1966, offset 0, flags [none], proto UDP (17), length 68)
    192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok] 40 RRQ "pxelinux.0" octet blksize 1432 tsize 0
0x0000: 4500 0044 07ae 0000 4011 ef8c c0a8 0119 E..D....@.......
0x0010: c0a8 0105 0400 0045 0030 9796 0001 7078 .......E.0....px
0x0020: 656c 696e 7578 2e30 006f 6374 6574 0062 elinux.0.octet.b
0x0030: 6c6b 7369 7a65 0031 3433 3200 7473 697a lksize.1432.tsiz
0x0040: 6500 3000 e.0.
17:33:22.711767 IP (tos 0x0, ttl 64, id 2241, offset 0, flags [none], proto UDP (17), length 68)
    192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok] 40 RRQ "pxelinux.0" octet blksize 1432 tsize 0
0x0000: 4500 0044 08c1 0000 4011 ee79 c0a8 0119 E..D....@..y....
0x0010: c0a8 0105 0400 0045 0030 9796 0001 7078 .......E.0....px
0x0020: 656c 696e 7578 2e30 006f 6374 6574 0062 elinux.0.octet.b
0x0030: 6c6b 7369 7a65 0031 3433 3200 7473 697a lksize.1432.tsiz
0x0040: 6500 3000
======================================================End of tcpdump===============================================

In the meanwhile, set up another vm, installed tftpd, run tftp <ip> -c get file, the file can be downloaded correctly. Another thing might be helpful is /var/log/maas/maas.log is empty.

Tags:

Revision history for this message

ZhengPeng Hou (zhengpeng-hou) wrote on 2013-10-30:

pserv.log from /var/log/maas/ Edit (35.2 KiB, text/plain)

summary:

- pxe boot from maas due to time out
+ pxe boot from maas fails due to time out

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-10-30:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in maas (Ubuntu):
status:	New → Confirmed

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2013-11-01:

I've had a few people report this, and seen it myself on one occasion when I mixed 100Mb/1Gb switches on my network.

We probably just need a bigger timeout in the TFTP server code. Gavin?

Changed in maas:
status:	New → Triaged
importance:	Undecided → Critical
milestone:	none → 14.04

Revision history for this message

Gavin Panella (allenap) wrote on 2013-11-13:

> We probably just need a bigger timeout in the TFTP server code. Gavin?

Let's try it!

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-02-10:

Nobody has reported seeing this for a while. I looked at the code and it has a whole second timeout per packet. You would think that this is more than enough over a LAN segment.

I also discovered that the tftp client can set its own timeouts (http://tools.ietf.org/html/rfc2349). Given that you see it working from one client and not another, I am suspicious although there's nothing in the tcpdump that shows a timeout option getting set.

Are you still seeing this problem?

Changed in maas:
status:	Triaged → Incomplete

Adam Collard (adam-collard) on 2014-03-14

tags:

added: landscape

Revision history for this message

Adam Collard (adam-collard) wrote on 2014-03-14:

> Are you still seeing this problem?

I'm seeing this problem now. For me it only started occurring after upgrading MAAS from 1.5+bzr1977-0ubuntu3 to 1.5+bzr1977-0ubuntu4. MAAS is no longer able to PXE boot.

To rule out any upgrade/configuration concerns I purged and reinstalled MAAS (wiping out the database, all configuration files, downloaded ephemerals etc. and essentially going again from scratch). This landed me in exactly the same situation.

Changed in maas:
status:	Incomplete → Confirmed

Revision history for this message

Adam Collard (adam-collard) wrote on 2014-03-14:

I was adviced to try rebooting the MAAS server (region and cluster controller on one box) which fixed the problem. But a subsequent reboot brought it back :/

Revision history for this message

Adam Collard (adam-collard) wrote on 2014-03-14:

So in my case, the problem was PEBKAC and networking. Specifically I had two interfaces on my MAAS server that were connected to the same network, both had equally weighted routes to the gateway. Only one of them was configured in MAAS for the cluster controller. It was (seemingly) by chance that the second NIC was the one that the other nodes were trying to TFTP off of.

Revision history for this message

Graham Binns (gmb) wrote on 2014-03-14: Re: [Bug 1246236] Re: pxe boot from maas fails due to time out

On 14 March 2014 14:43, Adam Collard <email address hidden> wrote:
> So in my case, the problem was PEBKAC and networking. Specifically I had
> two interfaces on my MAAS server that were connected to the same
> network, both had equally weighted routes to the gateway. Only one of
> them was configured in MAAS for the cluster controller. It was
> (seemingly) by chance that the second NIC was the one that the other
> nodes were trying to TFTP off of.

It'd be lovely if MAAS would report this kind of stuff... "Hey, I can
ping myself... That's not good..."

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-03-16:

#10

On Friday 14 Mar 2014 09:37:18 you wrote:
> I was adviced to try rebooting the MAAS server (region and cluster
> controller on one box) which fixed the problem. But a subsequent reboot
> brought it back :/

I am pretty sure there's something wrong with some folks' networking setups
but I'm not sure what it is.

The timeouts on the TFTP server are very generous, it uses a backoff algorithm
and ends up as a few seconds. This should be more than enough for a local
network!

As a data point. I have seen this locally once when I mixed different types of
switches together (100MB and Gigabit) and went away as soon as I moved back to
a single speed network.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-03-16:

#11

On Friday 14 Mar 2014 15:06:17 Graham Binns wrote:
> On 14 March 2014 14:43, Adam Collard <email address hidden> wrote:
> > So in my case, the problem was PEBKAC and networking. Specifically I had
> > two interfaces on my MAAS server that were connected to the same
> > network, both had equally weighted routes to the gateway. Only one of
> > them was configured in MAAS for the cluster controller. It was
> > (seemingly) by chance that the second NIC was the one that the other
> > nodes were trying to TFTP off of.
>
> It'd be lovely if MAAS would report this kind of stuff... "Hey, I can
> ping myself... That's not good..."

This should not be a problem any more as the TFTP server should listen on all
interfaces in the next release.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-03-17:

#12

Going incomplete again since there's still no reliable way to reproduce any bug in maas.

Changed in maas:
status:	Confirmed → Incomplete
Changed in maas (Ubuntu):
status:	Confirmed → Incomplete
Changed in maas:
milestone:	14.04 → none

Revision history for this message

Adam Collard (adam-collard) wrote on 2014-03-17:

#13

On 17 March 2014 00:39, Julian Edwards <email address hidden> wrote:

> On Friday 14 Mar 2014 09:37:18 you wrote:
> > I was adviced to try rebooting the MAAS server (region and cluster
> > controller on one box) which fixed the problem. But a subsequent reboot
> > brought it back :/
>
> The timeouts on the TFTP server are very generous, it uses a backoff
> algorithm
> and ends up as a few seconds. This should be more than enough for a local
> network!

This may be true for upstream MAAS (actually python-tx-tftp) but in the
packaged version all of the timeouts are 1s - see
https://bugs.launchpad.net/maas/+bug/1155556 so it's not (1, 3, 7) but (1,
1, 1, 1, 1, 1)

The actual values of the timeouts aren't relevant for this bug, since the
issue (at least for me) was faulty networking configuration.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-03-17:

#14

Huh, that timeout must have changed since I last looked at the code. Anyway,
7 seconds should be enough per datagram.

Revision history for this message

Mark Shuttleworth (sabdfl) wrote on 2014-04-04:

#15

I'm seeing this error in the Garage MAAS today. As it happens, also have two interfaces on the network on that machine. Will try reducing to one and reboot.

Revision history for this message

Mark Shuttleworth (sabdfl) wrote on 2014-04-04:

#16

Was not able to reproduce after a reboot.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-06-04:

#17

[Expired for maas (Ubuntu) because there has been no activity for 60 days.]

Changed in maas (Ubuntu):
status:	Incomplete → Expired

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-06-04:

#18

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status:	Incomplete → Expired

Ante Karamatić (ivoks) on 2014-10-17

tags:

added: cts

Janghoon-Paul Sim (janghoon) on 2014-10-17

Changed in maas:
status:	Expired → New
Changed in maas (Ubuntu):
status:	Expired → New

Julian Edwards (julian-edwards) on 2014-10-19

Changed in maas:
status:	New → Incomplete

Christian Reis (kiko) on 2014-11-12

Changed in maas:
milestone:	none → 1.7.1

Revision history for this message

Christian Reis (kiko) wrote on 2014-12-02:

#19

Um ok
On Feb 2, 2015 2:25 PM, "Christian Reis" <kiko@async.com.br> wrote:

> ** Changed in: maas
>      Assignee: zack (jnrflatwork) => (unassigned)
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1246236
>
> Title:
>   pxe boot from maas fails due to time out
>
> Status in MAAS:
>   Incomplete
> Status in maas package in Ubuntu:
>   Incomplete
>
> Bug description:
>   1 create a new vm node on a 13.10 server, attach the network to a
> existing openvswitch, use virtio as nic.
>   2 install 13.10 onto the vm
>   3 login to the vm via ssh, and apt-get install maas from archive.
>   4 modify /etc/maas/import_pxe_files to only download pxe files for
> precise and saucy
>   5 run maas-import-pxe-files
>   6 create another node by using virt-manager from my laptop, and make it
> boot from pxe.
>   7 run dhcpdum on maas node, we can see dhcp works as expect, and from
> the console of the pxe boot node, it shows ip address has been assigned,
> trying to get pxelinux.0 from tftp, after couple of seconds, it fails, due
> to timeout.
>
>
>   =============================================pserv log(run pserv
> manually=========================================
>
>   ubuntu@maas:~$ cat /tmp/pserv.log
>   2013-10-30 09:58:07+0800 [-] Log opened.
>   2013-10-30 09:58:07+0800 [-] twistd 13.0.0 (/usr/bin/python 2.7.5)
> starting up.
>   2013-10-30 09:58:07+0800 [-] reactor class:
> twisted.internet.epollreactor.EPollReactor.
>   2013-10-30 09:58:07+0800 [-] TFTP starting on 69
>   2013-10-30 09:58:07+0800 [-] Starting protocol <tftp.protocol.TFTP
> instance at 0x2752200>
>   2013-10-30 09:58:07+0800 [-] TFTP Listener started at 127.0.0.1:69
>   2013-10-30 09:58:07+0800 [-] TFTP starting on 69
>   2013-10-30 09:58:07+0800 [-] Starting protocol <tftp.protocol.TFTP
> instance at 0x27522d8>
>   2013-10-30 09:58:07+0800 [-] TFTP Listener started at 192.168.1.5:69
>   2013-10-30 09:58:07+0800 [-] set uid/gid 109/116
>   2013-10-30 09:58:45+0800 [TFTP (UDP)] Datagram received from
> ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet,
> options={'blksize': '1432', 'tsize': '0'})>
>   2013-10-30 09:58:45+0800 [-] RemoteOriginReadSession starting on 42340
>   2013-10-30 09:58:45+0800 [-] Starting protocol
> <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752878>
>   2013-10-30 09:58:45+0800 [TFTP (UDP)] Datagram received from
> ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet,
> options={'blksize': '1432', 'tsize': '0'})>
>   2013-10-30 09:58:45+0800 [-] RemoteOriginReadSession starting on 56340
>   2013-10-30 09:58:45+0800 [-] Starting protocol
> <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752f38>
>   2013-10-30 09:58:46+0800 [TFTP (UDP)] Datagram received from
> ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet,
> options={'blksize': '1432', 'tsize': '0'})>
>   2013-10-30 09:58:46+0800 [-] RemoteOriginReadSession starting on 40332
>   2013-10-30 09:58:46+0800 [-] Starting protocol
> <tftp.bootstrap.RemoteOriginReadSession instance at 0x2756320>
>   2013-10-30 09:58:48+0800 [TFTP (UDP)] Datagram received from
> ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet,
> options={'blksize': '1432', 'tsize': '0'})>
>   2013-10-30 09:58:48+0800 [-] RemoteOriginReadSession starting on 43806
>   2013-10-30 09:58:48+0800 [-] Starting protocol
> <tftp.bootstrap.RemoteOriginReadSession instance at 0x27566c8>
>   2013-10-30 09:58:51+0800 [TFTP (UDP)] Datagram received from
> ('192.168.1.25', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet,
> options={'blksize': '1432', 'tsize': '0'})>
>   2013-10-30 09:58:51+0800 [-] RemoteOriginReadSession starting on 46409
>   2013-10-30 09:58:51+0800 [-] Starting protocol
> <tftp.bootstrap.RemoteOriginReadSession instance at 0x2756b00>
>   2013-10-30 09:58:51+0800 [-] Timed during option negotiation process
>   2013-10-30 09:58:51+0800 [-] (UDP Port 42340 Closed)
>   2013-10-30 09:58:51+0800 [-] Stopping protocol
> <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752878>
>   2013-10-30 09:58:51+0800 [-] Timed during option negotiation process
>   2013-10-30 09:58:51+0800 [-] (UDP Port 56340 Closed)
>   2013-10-30 09:58:51+0800 [-] Stopping protocol
> <tftp.bootstrap.RemoteOriginReadSession instance at 0x2752f38>
>   2013-10-30 09:58:52+0800 [-] Timed during option negotiation process
>   2013-10-30 09:58:52+0800 [-] (UDP Port 40332 Closed)
>   2013-10-30 09:58:52+0800 [-] Stopping protocol
> <tftp.bootstrap.RemoteOriginReadSession instance at 0x2756320>
>   2013-10-30 09:58:54+0800 [-] Timed during option negotiation process
>   2013-10-30 09:58:54+0800 [-] (UDP Port 43806 Closed)
>   2013-10-30 09:58:54+0800 [-] Stopping protocol
> <tftp.bootstrap.RemoteOriginReadSession instance at 0x27566c8>
>   2013-10-30 09:58:57+0800 [-] Timed during option negotiation process
>   2013-10-30 09:58:57+0800 [-] (UDP Port 46409 Closed)
>   2013-10-30 09:58:57+0800 [-] Stopping protocol
> <tftp.bootstrap.RemoteOriginReadSession instance at 0x2756b00>
>   2013-10-30 12:55:38+0800 [-] Received SIGINT, shutting down.
>   2013-10-30 12:55:38+0800 [-] (UDP Port 69 Closed)
>   2013-10-30 12:55:38+0800 [-] Stopping protocol <tftp.protocol.TFTP
> instance at 0x27522d8>
>   2013-10-30 12:55:38+0800 [-] (UDP Port 69 Closed)
>   2013-10-30 12:55:38+0800 [-] Stopping protocol <tftp.protocol.TFTP
> instance at 0x2752200>
>   2013-10-30 12:55:38+0800 [-] Main loop terminated.
>   2013-10-30 12:55:38+0800 [-] Warning: No permission to delete pid file
>   2013-10-30 12:55:38+0800 [-] Server Shut Down.
>   =======================================================End of pserv
> log===============================================
>
>   ========================================tcpdump
> ===============================================================
>   ubuntu@maas:~$ sudo tcpdump -vvnnXS -i eth0 port 69
>   [sudo] password for ubuntu:
>   tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size
> 65535 bytes
>   17:33:16.950957 IP (tos 0x0, ttl 64, id 1177, offset 0, flags [none],
> proto UDP (17), length 68)
>       192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok]  40 RRQ
> "pxelinux.0" octet blksize 1432 tsize 0
>         0x0000:  4500 0044 0499 0000 4011 f2a1 c0a8 0119  E..D....@.......
>         0x0010:  c0a8 0105 0400 0045 0030 9796 0001 7078  .......E.0....px
>         0x0020:  656c 696e 7578 2e30 006f 6374 6574 0062  elinux.0.octet.b
>         0x0030:  6c6b 7369 7a65 0031 3433 3200 7473 697a  lksize.1432.tsiz
>         0x0040:  6500 3000                                e.0.
>   17:33:17.329158 IP (tos 0x0, ttl 64, id 1435, offset 0, flags [none],
> proto UDP (17), length 68)
>       192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok]  40 RRQ
> "pxelinux.0" octet blksize 1432 tsize 0
>         0x0000:  4500 0044 059b 0000 4011 f19f c0a8 0119  E..D....@.......
>         0x0010:  c0a8 0105 0400 0045 0030 9796 0001 7078  .......E.0....px
>         0x0020:  656c 696e 7578 2e30 006f 6374 6574 0062  elinux.0.octet.b
>         0x0030:  6c6b 7369 7a65 0031 3433 3200 7473 697a  lksize.1432.tsiz
>         0x0040:  6500 3000                                e.0.
>   17:33:18.098064 IP (tos 0x0, ttl 64, id 1692, offset 0, flags [none],
> proto UDP (17), length 68)
>       192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok]  40 RRQ
> "pxelinux.0" octet blksize 1432 tsize 0
>         0x0000:  4500 0044 069c 0000 4011 f09e c0a8 0119  E..D....@.......
>         0x0010:  c0a8 0105 0400 0045 0030 9796 0001 7078  .......E.0....px
>         0x0020:  656c 696e 7578 2e30 006f 6374 6574 0062  elinux.0.octet.b
>         0x0030:  6c6b 7369 7a65 0031 3433 3200 7473 697a  lksize.1432.tsiz
>         0x0040:  6500 3000                                e.0.
>   17:33:19.635949 IP (tos 0x0, ttl 64, id 1966, offset 0, flags [none],
> proto UDP (17), length 68)
>       192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok]  40 RRQ
> "pxelinux.0" octet blksize 1432 tsize 0
>         0x0000:  4500 0044 07ae 0000 4011 ef8c c0a8 0119  E..D....@.......
>         0x0010:  c0a8 0105 0400 0045 0030 9796 0001 7078  .......E.0....px
>         0x0020:  656c 696e 7578 2e30 006f 6374 6574 0062  elinux.0.octet.b
>         0x0030:  6c6b 7369 7a65 0031 3433 3200 7473 697a  lksize.1432.tsiz
>         0x0040:  6500 3000                                e.0.
>   17:33:22.711767 IP (tos 0x0, ttl 64, id 2241, offset 0, flags [none],
> proto UDP (17), length 68)
>       192.168.1.25.1024 > 192.168.1.5.69: [udp sum ok]  40 RRQ
> "pxelinux.0" octet blksize 1432 tsize 0
>         0x0000:  4500 0044 08c1 0000 4011 ee79 c0a8 0119  E..D....@..y....
>         0x0010:  c0a8 0105 0400 0045 0030 9796 0001 7078  .......E.0....px
>         0x0020:  656c 696e 7578 2e30 006f 6374 6574 0062  elinux.0.octet.b
>         0x0030:  6c6b 7369 7a65 0031 3433 3200 7473 697a  lksize.1432.tsiz
>         0x0040:  6500 3000
>   ======================================================End of
> tcpdump===============================================
>
>   In the meanwhile, set up another vm, installed tftpd, run tftp <ip> -c
>   get file, the file can be downloaded correctly.  Another thing might
>   be helpful is /var/log/maas/maas.log is empty.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1246236/+subscriptions
>

Revision history for this message

JJ Asghar (d-jj) wrote on 2015-02-02:

#23

tftp.cap Edit (2.5 KiB, application/cap)

Here's my tcpdump :)

Revision history for this message

Christian Reis (kiko) wrote on 2015-02-02:

#24

Capture of all traffic to the MAC of the machine trying to boot Edit (56.7 KiB, application/cap)

When we start transferring the kernel over, the machine on the other side stops responding:

15:59:23.563014 IP 10.0.0.12.49164 > 10.0.0.1.tftp: 77 RRQ "ubuntu/amd64/generic/trusty/release/boot-kernel" octet tsize 0 blksize 1408
15:59:23.578439 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 29
15:59:23.578751 IP 10.0.0.12.49164 > 10.0.0.1.59570: UDP, length 4
15:59:23.579230 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:23.599678 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:24.579356 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:24.599841 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:25.580228 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:25.600033 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:26.581397 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:26.600222 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:27.582615 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:27.600402 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:28.583782 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:28.600571 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26

It won't respond to pings either, and ARP requests at that point fail:

16:02:58.089006 IP 10.0.0.1 > 10.0.0.12: ICMP echo request, id 5945, seq 1, length 64
16:02:59.088958 IP 10.0.0.1 > 10.0.0.12: ICMP echo request, id 5945, seq 2, length 64
16:03:00.088961 IP 10.0.0.1 > 10.0.0.12: ICMP echo request, id 5945, seq 3, length 64
16:03:03.104829 ARP, Request who-has 10.0.0.12 tell 10.0.0.1, length 28
16:03:04.104824 ARP, Request who-has 10.0.0.12 tell 10.0.0.1, length 28
16:03:04.953050 IP 10.0.0.1 > 10.0.0.12: ICMP echo request, id 5952, seq 1, length 64
16:03:05.104832 ARP, Request who-has 10.0.0.12 tell 10.0.0.1, length 28
16:03:05.952926 IP 10.0.0.1 > 10.0.0.12: ICMP echo request, id 5952, seq 2, length 64

(I'm not sure pxelinux responds to pings, but the ARP request should work I assume)

When we start transferring the kernel over, the machine on the other side stops responding:

15:59:23.563014 IP 10.0.0.12.49164 > 10.0.0.1.tftp:  77 RRQ "ubuntu/amd64/generic/trusty/release/boot-kernel" octet tsize 0 blksize 1408 
15:59:23.578439 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 29
15:59:23.578751 IP 10.0.0.12.49164 > 10.0.0.1.59570: UDP, length 4
15:59:23.579230 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:23.599678 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:24.579356 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:24.599841 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:25.580228 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:25.600033 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:26.581397 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:26.600222 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:27.582615 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:27.600402 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26
15:59:28.583782 IP 10.0.0.1.59570 > 10.0.0.12.49164: UDP, length 1404
15:59:28.600571 IP 10.0.0.1.44700 > 10.0.0.12.49162: UDP, length 26

It won't respond to pings either, and ARP requests at that point fail:

(I'm not sure pxelinux responds to pings, but the ARP request should work I assume)

Revision history for this message

Raphaël Badin (rvb) wrote on 2015-02-03:

#25

> When we start transferring the kernel over, the machine on the other side stops responding

This is indeed what seems to be causing the problem. Note that we're seeing (a worrying number of) retries in some of the earlier (and successful) transfers.

What kind of bridge is being used here? (virbr0 -the bridge created by virsh-, br0, ?). Can you please paste the characteristics of the bridge (by running something like "cd /sys/class/net/virbr0/bridge; for f in `ls *`; do echo -n "$f "; cat $f; done).

As described on https://bugs.launchpad.net/maas/+bug/1155556, the default timeout behavior might be the problem here. Can you try changing the RemoteOriginReadSession class in /usr/share/pyshared/tftp/bootstrap.py: change the timeout = (1, 1, 1, 1, 1, 1) to timeout = (1, 3, 7). and then restart the TFTP server?

Revision history for this message

Christian Reis (kiko) wrote on 2015-02-03:

#26

We're not on a bridged network (nor are there VMs) so we may be seeing a different bug, or perhaps both situations trigger the same symptom. I'll try playing with the timeout.

Revision history for this message

Christian Reis (kiko) wrote on 2015-02-03:

#27

I installed a newer version of pxelinux to see if that is the culprit. It doesn't work, but /some/ traffic between the nodes now happens when transferring the kernel:

11:54:03.181388 IP 10.0.0.12.49164 > 10.0.0.1.tftp: 77 RRQ "ubuntu/amd64/generic/trusty/release/boot-kernel" octet tsize 0 blksize 1408
11:54:03.196192 IP 10.0.0.1.56150 > 10.0.0.12.49164: UDP, length 29
11:54:03.196470 IP 10.0.0.12.49164 > 10.0.0.1.56150: UDP, length 4
11:54:03.196888 IP 10.0.0.1.56150 > 10.0.0.12.49164: UDP, length 1404
11:54:03.237564 IP 10.0.0.1.52379 > 10.0.0.12.49162: UDP, length 26
11:54:04.197035 IP 10.0.0.1.56150 > 10.0.0.12.49164: UDP, length 1404
11:54:04.237737 IP 10.0.0.1.52379 > 10.0.0.12.49162: UDP, length 26
11:54:07.197394 IP 10.0.0.1.56150 > 10.0.0.12.49164: UDP, length 1404
11:54:07.237986 IP 10.0.0.1.52379 > 10.0.0.12.49162: UDP, length 26
11:54:12.201806 IP 10.0.0.1.56150 > 10.0.0.12.49164: UDP, length 1404
11:54:14.240251 IP 10.0.0.1.52379 > 10.0.0.12.49162: UDP, length 26

and then it stops.

Revision history for this message

Christian Reis (kiko) wrote on 2015-02-03:

#28

Oh, I'm mistaken -- it's exactly the same as before. I hadn't noticed that the machine had replied once before going offline.

Gavin looked up the ARP issue and it turns out that pxelinux doesn't respond to ARP when in the keyboard read loop and when in the tftp send/recv loop, so that's why there's no ARP response. But I wonder, could the issue here be that the ARP cache is being blown away for some reason? While the transfer hung, I looked at the arp cache and it had an <incomplete> entry.

Revision history for this message

Christian Reis (kiko) wrote on 2015-02-03:

#29

Static ARP did not help.

Setting the MTU to 1200 caused pxelinux.0 to fail transferring with similar, but different behaviour:

12:12:50.699971 IP 10.0.0.12.2071 > 10.0.0.1.tftp: 32 RRQ "pxelinux.0" octet blksize 1456
12:12:50.773393 IP 10.0.0.1.59263 > 10.0.0.12.2071: UDP, length 15
12:12:50.773775 IP 10.0.0.12.2071 > 10.0.0.1.59263: UDP, length 4
12:12:50.774211 IP 10.0.0.1.59263 > 10.0.0.12.2071: UDP, length 1404
12:12:50.774226 IP 10.0.0.1 > 10.0.0.12: udp
12:12:51.774349 IP 10.0.0.1.59263 > 10.0.0.12.2071: UDP, length 1404
12:12:51.774387 IP 10.0.0.1 > 10.0.0.12: udp
12:12:54.775554 IP 10.0.0.1.59263 > 10.0.0.12.2071: UDP, length 1404
12:12:54.775592 IP 10.0.0.1 > 10.0.0.12: udp

I'm curious as to why the blksize is set at 1456 by pxelinux.

Revision history for this message

Christian Reis (kiko) wrote on 2015-02-03:

#30

Okay, we did a lot of debugging and tracing attempting to find the problem. In the end, it was hardware.

I upgraded manually to a newer version of syslinux and pxelinux which /did/ appear to be working, albeit VERY slowly. It seems that the 6.x syslinux network stack is more tolerant of errors, as it seemed to keep limping along.

I tested tftp from a MacOS and Ubuntu 14.04 host on the network, and it was able to download the image fine.

It turns out that the issue was actually a bad USB NIC that was generating a high amount of RX errors; the ifconfig output showed:

RX packets:1380 errors:2758 dropped:1379 overruns:0 frame:1379

We had swapped out the switch, cable and tried other client nodes before realizing it.

Moral of the story is that pxelinux's network stack tends to be more sensitive to networking problems, and the symptom is that tftp fails or works very slowly. I'll add that to my collection of release notes.

Andres Rodriguez (andreserl) on 2015-03-03

Changed in maas:
milestone:	1.7.2 → 1.7.3

Gema Gomez (gema) on 2015-03-26

Changed in maas:
status:	Incomplete → Confirmed

Revision history for this message

Mike Pontillo (mpontillo) wrote on 2015-04-08:

#31

I hit this problem today, and it turned out to be a non-MAAS bug.

I was trying to configure a Trusty host with VMware Workstation *and* LXCs sharing the same network.

It turns out that the combination of VMware plus Linux bridge interfaces have a bug: if I set up the VM network to be bridged to br0, I can ping hosts on the local network, and I can ping the Linux hypervisor, and I can ping any of the LXCs. I can even DHCP from the LXCs. But I *can't* establish any TCP connections to the LXCs from the VMware guest.

My current theory is that somehow the bridge interface is causing packets sentto the LXC from the VMware guest to be discarded at a low level, rather than sent up the stack. I couldn't find any IP Because when I configure VMware to bridge to a physical interface rather than a virtual bridge, it works. Until I disconnect the physical interface.

So the bottom line for me was:
(1) Have VMware bridge to the physical interface (in my case, it was eth0.100, a VLAN interface) if you want to use bridged mode
(2) Use a host-only network (without DHCP, so MAAS can do its thing!) if you want to develop in isolation, without sharing the network with the local LAN
- Do NOT bridge VMware to a bridge created with 'brctl' or similar!

I added the VMware bridge interface (vmnet1) to the br0 interface I'm using for my MAAS LXC, which may get me the best of both. I'll continue to test this configuration.

Revision history for this message

Christian Reis (kiko) wrote on 2015-06-03:

#32

For reference, this is the libvirt page which covers bridge timeouts affecting PXE booting: http://wiki.libvirt.org/page/PXE_boot_%28or_dhcp%29_on_guest_failed#Long_Forward_Delay_Time_on_Bridge