MAAS

HP ProLiant DL380 G7 tftps kernel, but initrd tracebacks in tftp server. DL380 G6 succeeds.

Bug #1155556 reported by Nick Moffitt on 2013-03-15

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Invalid	High	Unassigned
	python-tx-tftp (Ubuntu)	Fix Released	Undecided	Andres Rodriguez

Bug Description

I've had a MAAS running with the Precise SRU PPA for a short while now, and it successfully controlled an HP ProLiant DL380 G6 consistently and repeatedly. Unfortunately when I sought to replace this with a DL380 G7 that has more resources, I found the commissioning process hangs in precisely the same spot nearly every time.

I'm running my own DHCP server, as well as DNS.

The G7 has successfully booted the commissioning instance a grand total of twice in the past 48 hours, and I've left it to boot-loop for extended periods (though not the whole 48 hours, as that's a bit rough on the hardware). It nearly always successfully TFTPs the kernel, but the request for the initrd times out, and pxelinux reboots with the "Boot failed: press a key to retry, or wait for reset..." message.

I have tried this with precise and quantal commissioning images, with no noticeable difference.

Running tcpdump shows that the kernel RRQ proceeds fine, although tcpdump is convinced that the checksums are mostly wrong on the sent packets. The tftp ACK-like packets make it back, though, and the next chunk seems to be sent, and pxelinux seems to be fine with it.

But when the initrd is read, we get this:

10:23:27.195148 IP (tos 0x0, ttl 64, id 3701, offset 0, flags [none], proto UDP (17), length 103)
    10.55.200.2.49156 > 10.55.200.1.69: [udp sum ok] 75 RRQ "amd64/generic/quantal/commissioning/initrd.gz" octet tsize 0 blksize 1408
10:23:27.196704 IP (tos 0x0, ttl 64, id 12499, offset 0, flags [DF], proto UDP (17), length 58)
    10.55.200.1.60373 > 10.55.200.2.49156: [bad udp cksum 0xa4a9 -> 0xd48a!] UDP, length 30
10:23:28.196744 IP (tos 0x0, ttl 64, id 12500, offset 0, flags [DF], proto UDP (17), length 58)
    10.55.200.1.60373 > 10.55.200.2.49156: [bad udp cksum 0xa4a9 -> 0xd48a!] UDP, length 30
10:23:31.198995 IP (tos 0x0, ttl 64, id 12501, offset 0, flags [DF], proto UDP (17), length 58)
    10.55.200.1.60373 > 10.55.200.2.49156: [bad udp cksum 0xa4a9 -> 0xd48a!] UDP, length 30
10:23:31.971823 IP (tos 0x0, ttl 64, id 3702, offset 0, flags [none], proto UDP (17), length 32)
    10.55.200.2.49156 > 10.55.200.1.60373: [udp sum ok] UDP, length 4

This is pretty much always what happens. My node requests the initrd (quantal, in this case), and instead of the usual full-frame 1404-byte packet we get a few 30-byte messages with the wrong checksum (and tcpdump's checksum shows that these are identical payloads), and then the whole thing times out.

This of course corresponds with an OOPS:

-rw-r--r-- 1 maas maas 1574 Mar 15 10:23 OOPS-a085f6f73404bd8d66b7029e10e42081

& time�p�m=reportermaas-pservhostnampitombatypeSpentid&OOPS-a085f6f73404bd8d66b7029e10e42081value*This SequentialCall has already timed outtb_textdUnhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 69, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 586, in _doReadOrWrite
    why = selectable.doRead()
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/udp.py", line 131, in doRead
    self.protocol.datagramReceived(data, addr)
  File "/usr/lib/python2.7/dist-packages/tftp/bootstrap.py", line 174, in datagramReceived
    return self._datagramReceived(datagram)
  File "/usr/lib/python2.7/dist-packages/tftp/bootstrap.py", line 391, in _datagramReceived
    return self.tftp_ACK(datagram)
  File "/usr/lib/python2.7/dist-packages/tftp/bootstrap.py", line 404, in tftp_ACK
    self.timeout_watchdog.cancel()
  File "/usr/lib/python2.7/dist-packages/tftp/util.py", line 117, in cancel
    raise Spent("This SequentialCall has already timed out")
tftp.util.Spent: This SequentialCall has already timed out

(apologies if any important binary data is lost here. I can attach it if the format is significant. I figured the traceback alone ought to be enough)

As I watch the logs continue now, I notice that tcpdump is unhappy with the checksums as a whole. Here's the request for the cpu-arch testing program and the start of the kernel download:

10:35:22.521068 IP (tos 0x0, ttl 64, id 31, offset 0, flags [none], proto UDP (17), length 69)
    10.55.200.2.49154 > 10.55.200.1.69: [udp sum ok] 41 RRQ "ifcpu64.c32" octet tsize 0 blksize 1408
10:35:22.522721 IP (tos 0x0, ttl 64, id 60258, offset 0, flags [DF], proto UDP (17), length 54)
    10.55.200.1.43147 > 10.55.200.2.49154: [bad udp cksum 0xa4a5 -> 0x8e52!] UDP, length 26
10:35:22.522822 IP (tos 0x0, ttl 64, id 32, offset 0, flags [none], proto UDP (17), length 32)
    10.55.200.2.49154 > 10.55.200.1.43147: [udp sum ok] UDP, length 4
10:35:22.523083 IP (tos 0x0, ttl 64, id 60259, offset 0, flags [DF], proto UDP (17), length 1344)
    10.55.200.1.43147 > 10.55.200.2.49154: [bad udp cksum 0xa9af -> 0x811d!] UDP, length 1316
10:35:22.523235 IP (tos 0x0, ttl 64, id 33, offset 0, flags [none], proto UDP (17), length 32)
    10.55.200.2.49154 > 10.55.200.1.43147: [udp sum ok] UDP, length 4
10:35:22.523253 IP (tos 0x0, ttl 64, id 34, offset 0, flags [none], proto UDP (17), length 99)
    10.55.200.2.49155 > 10.55.200.1.69: [udp sum ok] 71 RRQ "amd64/generic/quantal/commissioning/linux" octet tsize 0 blksize 1408
10:35:22.525739 IP (tos 0x0, ttl 64, id 60259, offset 0, flags [DF], proto UDP (17), length 57)
    10.55.200.1.43809 > 10.55.200.2.49155: [bad udp cksum 0xa4a8 -> 0xe7ba!] UDP, length 29
10:35:22.525812 IP (tos 0x0, ttl 64, id 35, offset 0, flags [none], proto UDP (17), length 32)
    10.55.200.2.49155 > 10.55.200.1.43809: [udp sum ok] UDP, length 4
10:35:22.526046 IP (tos 0x0, ttl 64, id 60260, offset 0, flags [DF], proto UDP (17), length 1432)
    10.55.200.1.43809 > 10.55.200.2.49155: [bad udp cksum 0xaa07 -> 0x96b9!] UDP, length 1404
10:35:22.526203 IP (tos 0x0, ttl 64, id 36, offset 0, flags [none], proto UDP (17), length 32)
    10.55.200.2.49155 > 10.55.200.1.43809: [udp sum ok] UDP, length 4

Is it possible that the tftpserver in the new maas is doing something funny with the checksums, but that most of the time nobody verifies them?

Related branches

lp:ubuntu/raring-proposed/python-tx-tftp

Revision history for this message

Nick Moffitt (nick-moffitt) wrote on 2013-03-15:

TIL: tcpdumping TSO-enabled hardware will always show bogus checksums, because the real checksumming is done in hardware in fun zero-copy ways.

The OOPS and the 30-byte replies from the TFTP server still seem significant, however.

Revision history for this message

Nick Moffitt (nick-moffitt) wrote on 2013-03-15:

Right, so after a day spent with Daviey and a bunch of 30MB pcap files, we think we've figured this out.

the key exchange that failed happens here:

7418 112.051626 10.55.200.99 10.55.200.1 TFTP Read Request, File: amd64/generic/quantal/commissioning/initrd.gz, Transfer type: octet, tsize\000=0\000, blksize\000=1408\000
7419 112.053444 10.55.200.1 10.55.200.99 TFTP Option Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000
7420 113.053489 10.55.200.1 10.55.200.99 TFTP Option Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000
7423 116.053542 10.55.200.1 10.55.200.99 TFTP Option Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000
7425 116.832761 10.55.200.99 10.55.200.1 TFTP Acknowledgement, Block: 0

The client requests the initrd, but something in the firmware or pxelinux itself gets hung for almost five seconds. During that time, the maas tftpd sends three ACKs (option acknowledgements, specifically), and times out. By the time the client sends the ACK-0 to start the data transfer, the session state has been discarded and the tftpd just loggs the exception as an OOPS and waits for the next session to start.

Incidentally, we spent a lot of time correlating requested/actual block sizes for a while between this tftpd and the HPA tftpd. That turned out to be a red herring, of course, but it seemed like a compelling lead for a while. The solution did come from a comparision to tftpd-hpa, though.

In a few places in tftp/bootstrap.py and tftp/session.py there are timeout tuples set to (1, 3, 7). The iterable is consumed by the watchdog code every time a packet is sent out, and once the iterable is empty the watchdog tells the state machine to give up on the request. We never dug too far into the units or where in the conversation these things are read, but the fact that there are three times in the tuple and that the daemon gave up after three ACKs is a compelling coïncidence.

The tftpd-hpa code tries six times, waiting one second each:

    <Daviey> Spads: #define TIMEOUT 1000000 /* Default timeout (us) */
    <Daviey> #define TRIES 6 /* Number of attempts to send each packet */
    <Daviey> #define TIMEOUT_LIMIT ((1 << TRIES)-1)

Extending the tuple at line 346 of bootstrap.py solved this situation for us, and the maas tftpd succeeded just as tftpd-hpa. In the end we settled on:

    class RemoteOriginReadSession(TFTPBootstrap):
        """Bootstraps a L{ReadSession}, that was started remotely, - we've received
        a RRQ.

"""
timeout = (1, 1, 1, 1, 1, 1)

...as this more closely mimics what Daviey found in the tftpd-hpa source.

This timeout tuple appears in a few places, so any adjustments to this code should probably be made to all of the timeout iterables in bootstrap.py and session.py.

Finally, while it's true that this seems to be a workaround for a fault on the client side (whether the fault is in firmware or in pxelinux.0 I can't say), I believe it is also a regression against the precise maas, which used cobbler.

Right, so after a day spent with Daviey and a bunch of 30MB pcap files, we think we've figured this out.

the key exchange that failed happens here:

7418	112.051626	10.55.200.99	10.55.200.1	TFTP	Read Request, File: amd64/generic/quantal/commissioning/initrd.gz, Transfer type: octet, tsize\000=0\000, blksize\000=1408\000
 7419	112.053444	10.55.200.1	10.55.200.99	TFTP	Option Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000
 7420	113.053489	10.55.200.1	10.55.200.99	TFTP	Option Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000
 7423	116.053542	10.55.200.1	10.55.200.99	TFTP	Option Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000
 7425	116.832761	10.55.200.99	10.55.200.1	TFTP	Acknowledgement, Block: 0

The client requests the initrd, but something in the firmware or pxelinux itself gets hung for almost five seconds.  During that time, the maas tftpd sends three ACKs (option acknowledgements, specifically), and times out.  By the time the client sends the ACK-0 to start the data transfer, the session state has been discarded and the tftpd just loggs the exception as an OOPS and waits for the next session to start.

Incidentally, we spent a lot of time correlating requested/actual block sizes for a while between this tftpd and the HPA tftpd.  That turned out to be a red herring, of course, but it seemed like a compelling lead for a while.  The solution did come from a comparision to tftpd-hpa, though.

In a few places in tftp/bootstrap.py and tftp/session.py there are timeout tuples set to (1, 3, 7).  The iterable is consumed by the watchdog code every time a packet is sent out, and once the iterable is empty the watchdog tells the state machine to give up on the request.  We never dug too far into the units or where in the conversation these things are read, but the fact that there are three times in the tuple and that the daemon gave up after three ACKs is a compelling coïncidence.

The tftpd-hpa code tries six times, waiting one second each:

<Daviey> Spads: #define TIMEOUT 1000000         /* Default timeout (us) */
    <Daviey> #define TRIES   6               /* Number of attempts to send each packet */
    <Daviey> #define TIMEOUT_LIMIT ((1 << TRIES)-1)

Extending the tuple at line 346 of bootstrap.py solved this situation for us, and the maas tftpd succeeded just as tftpd-hpa.  In the end we settled on:

class RemoteOriginReadSession(TFTPBootstrap):
        """Bootstraps a L{ReadSession}, that was started remotely, - we've received
        a RRQ.

"""
        timeout = (1, 1, 1, 1, 1, 1)

...as this more closely mimics what Daviey found in the tftpd-hpa source.

This timeout tuple appears in a few places, so any adjustments to this code should probably be made to all of the timeout iterables in bootstrap.py and session.py.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-03-15:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in python-tx-tftp (Ubuntu):
status:	New → Confirmed

Julian Edwards (julian-edwards) on 2013-03-17

Changed in maas:
status:	New → Triaged
importance:	Undecided → High

Andres Rodriguez (andreserl) on 2013-03-20

Changed in python-tx-tftp (Ubuntu):
assignee:	nobody → Andres Rodriguez (andreserl)

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-04-05:

This bug was fixed in the package python-tx-tftp - 0.1~bzr31-0ubuntu8

---------------
python-tx-tftp (0.1~bzr31-0ubuntu8) raring; urgency=low

  * debian/patches/03-longer-timeouts.patch: Increase the timeout to not
    discard the session state because some clients are less tolerant to
    the provided timeouts. This solution closer matches that of tftpd-hpa.
    Thanks to Spads and Daviey for investigating and coming up with the
    solution. (LP: #1155556)
-- Andres Rodriguez <email address hidden> Mon, 01 Apr 2013 15:19:30 -0400

Changed in python-tx-tftp (Ubuntu):
status:	Confirmed → Fix Released

Julian Edwards (julian-edwards) on 2013-08-30

Changed in maas:
status:	Triaged → Invalid

Revision history for this message

Liam Young (gnuoy) wrote on 2013-10-04:

I hit this today on precise. It would be great to get the fix ported to precise

Revision history for this message

Nick Moffitt (nick-moffitt) wrote on 2013-10-04:

How do we mark that this is still broken in the LTS? I can't figure out how through the drop-downs.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2013-10-08:

The fixed package should be in the cloud archive.

Revision history for this message

Nick Moffitt (nick-moffitt) wrote on 2013-10-09: Re: [Bug 1155556] Re: HP ProLiant DL380 G7 tftps kernel, but initrd tracebacks in tftp server. DL380 G6 succeeds.

Julian Edwards:
> The fixed package should be in the cloud archive.

Great! When will it make it to the LTS?

--
Nick Moffitt

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2013-10-09:

On Wednesday 09 Oct 2013 09:49:21 you wrote:
> Julian Edwards:
> > The fixed package should be in the cloud archive.
>
> Great! When will it make it to the LTS?

No idea I'm afraid, I'm acting as upstream for MAAS. Andres didn't say
anything about uploading it to the LTS but if there's demand it could happen.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.