intermittent tftp issues using physical hardware

Bug #1325762 reported by Adam Gandelman on 2014-06-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Medium
Unassigned

Bug Description

We're hitting an intermittent TFTP issue on our testing racks. Not sure this is hardware or software related, but is blocking tripleo end-to-end deployments for us.

All OS's are Ubuntu 14.04 across the board (seed VM host, seed VM, undercloud, etc)

The Mellanox mlx4_en driver is used across all machines. All distro kernel updates have been applied and the card's firmware updated on problematic nodes.

Symptoms:

While provisioning an undercloud node, the system netboots and TFTP GET's its kernel and ramdisk. In the best case, the kernel transfers okay, the ramdisk transfer either freezes or transfers extremely slow. In the worst case, the ramdisk transfers simply stops and the boot wille ventually timeout. After managing to get a node booted, I am able to reproduce the issues manually from the provisioned node.

Attempting to manually get the files from the seed node via TFTP transfers at expected speeds only a fraction of the time:

root@baddy:~# for i in `seq 1 10` ; do time busybox tftp -l /tmp/out -r /tftpboot/cd3f256f-fab4-45b3-a12b-ebb93de18d2b/deploy_ramdisk -g 10.22.157.150 | grep real; done
real 0m58.484s
real 8m32.232s
real 0m58.389s
real 8m25.488s
real 0m58.117s
real 6m33.109s
real 8m17.880s
real 7m7.308s
real 0m58.479s
real 6m41.177s

Manually running the command in rapid succession will eventually cause xfers to come to a halt, and syslog reporting:

Jun 2 22:30:15 ubuntu in.tftpd[22610]: RRQ from 10.22.157.152 filename /tftpboot/cd3f256f-fab4-45b3-a12b-ebb93de18d2b/deploy_kernel
Jun 2 22:30:17 ubuntu in.tftpd[22559]: tftpd: read(ack): Connection refused

I've checked udp in both directions using iperf and all seems fine. Attempting to reverse the transfer and run a TFTP server from the problematic client, and the client on the seed, appears to work fine for files of similar size.

Tinkering with the TFTP blocksize on both the server side and the client side did not seem to have any impact.

The problem occurs using different Ubuntu versions, kernel version and mlx4_en driver versions. Others saw the issue but an upgrade to from Saucy to Trusty fixed it. In my case, the same upgrade appeared to introduce it.

Adam Gandelman (gandelman-a) wrote :

I've eliminated the seed VM from the picture and have confirmed the issue affects any tftp traffic from the seed host. Leaning toward something card / switch / etc related.

Gregory Haynes (greghaynes) wrote :

Seeing dropped packets in tcpdump: http://paste.ubuntu.com/7577980/ (notice the packet sent at 01:33:29.956558 is never received). As a super temporary fix, if theres some way to lower the retry interval for the sender to something extremely low (we should have sub 1ms latency on our rack) it might be enough to limp by.

Chris Jones (cmsj) wrote :

Note: current hypothesis is that this is actually related to firmware/driver issues with Mellanox hardware.

Ghe Rivero (ghe.rivero) wrote :

Last run using another node as seed, worked properly. No more mellanox issues. Will update the firmware in the rack and try again.

Stephen Pearson (stephen-hp) wrote :

My results on a ProLiant SL390s G7 with P69 BIOS. This is running Precise.

# ethtool -i eth2
driver: mlx4_en
version: 2.0 (Dec 2011)
firmware-version: 2.7.9294
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no

Testing using a 111M file and tftp-hpa:

$ for i in `seq 1 10`; do time tftp holly-test.cicd.useast.hpcloud.net -c get /sstk/initrd_sstkv2.img; done

real 0m17.306s
user 0m2.332s
sys 0m4.528s

real 0m18.370s
user 0m2.432s
sys 0m4.244s

real 0m18.424s
user 0m2.364s
sys 0m4.344s

real 0m18.801s
user 0m2.348s
sys 0m4.432s

real 0m17.197s
user 0m2.416s
sys 0m4.368s

real 0m19.072s
user 0m2.380s
sys 0m4.472s

real 0m17.224s
user 0m2.496s
sys 0m4.384s

real 0m18.568s
user 0m2.348s
sys 0m4.176s

real 0m17.311s
user 0m2.528s
sys 0m4.308s

real 0m18.643s
user 0m2.392s
sys 0m4.384s

Bit sluggish .. 6 to 7 MB/s. But consistently. Doesn't sound as bad as what Adam reported.

damianos (damian-linux) wrote :

f.w 2.9.1530 should be much better. There are known UDP issues with mellanox f/w|drivers combo.

Stephen Pearson (stephen-hp) wrote :

Repeated above test using similar hardware but with much newer firmware and running Trusty. Actually runs slightly slower than before (approx 25s each, or only about 4.5MB/s).

# ethtool -i eth3
driver: mlx4_en
version: 2.2-1 (Feb 2014)
firmware-version: 2.9.1530
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

ProLiant SL390s G7 with P69 BIOS:

Results, transferring same 111MB file over TFTP:

real 0m25.008s
real 0m25.005s
real 0m25.005s

There's clearly something wrong here. The same file transferred over HTTP takes only 0.4s.

Stephen Pearson (stephen-hp) wrote :

Removing all the netfilter and conntrack modules boosts tftp xfer rate to 10MB/s. Might be worth checking whether the seed vm has that.

James Polley (tchaypo) wrote :

Assigning to ghe to document and close.

Changed in tripleo:
assignee: nobody → Ghe Rivero (ghe.rivero)
Ghe Rivero (ghe.rivero) on 2014-06-11
Changed in tripleo:
importance: Undecided → Medium
status: New → Confirmed
Brent Eagles (beagles) wrote :

The last reported activity on this issue was some time ago and, if I am reading it correctly, is specific to a particular hardware configuration - and possibly even a firmware version. c#9 also indicates that the intent was to document and close. I'm marking as incomplete.

Changed in tripleo:
status: Confirmed → Incomplete
Emilien Macchi (emilienm) wrote :

This bug is > 365 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in tripleo:
assignee: Ghe Rivero (ghe.rivero) → nobody
Launchpad Janitor (janitor) wrote :

[Expired for tripleo because there has been no activity for 60 days.]

Changed in tripleo:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers