I have an Ubuntu 14.04 host that I am using as both a keepalived/ipvs loadbalancer and dnsmasq server for pxebooting servers.
After updating linux-image 3.13.0-30.55 -> 3.13.0-32.57 I noticed that dnsmasq-tftp stopped working. pxeboot clients would hang on the "Loading ..../linux" TFTP transfer, with the transfer stalling roughly ~1000 blocks into the transfer:
10:30:51.011728 IP 10.1.1.2.43540 > 10.1.12.1.49165: UDP, length 1412
10:30:51.011924 IP 10.1.12.1.49165 > 10.1.1.2.43540: UDP, length 4
10:30:51.012012 IP 10.1.1.2.43540 > 10.1.12.1.49165: UDP, length 1412
10:30:51.012183 IP 10.1.12.1.49165 > 10.1.1.2.43540: UDP, length 4
stracing dnsmasq I noticed something very odd: sendto() on the socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) would suddenly start persistently returning EPERM in mid-transfer, even when dnsmasq continued to periodically retry:
This was with all iptables rules unloaded (so no OUTPUT -j DENY) and apparmor profiles torn down.
I also noticed the following dmesgs appearing at roughly similar times to the tftp transfers getting stuck (although not coinciding exactly with the stall):
I then tore down the ipvs rules (service keepalived stop) and unloaded the modules (rmmod ip_vs_rr ip_vs), and the issue resolved itself - the stalled dnsmasq-tftp transfer resumed!
This seems to be reproducible, i.e. modprobing ip_vs and starting keepalived will cause dnsmasq-tftp to stall again, and stopping/unloading will resume.
This seems to happen reproducibly on boot with -32 and -30. This does NOT seem to happen with 3.13.0-29 which I was using up until now.
I have an Ubuntu 14.04 host that I am using as both a keepalived/ipvs loadbalancer and dnsmasq server for pxebooting servers.
After updating linux-image 3.13.0-30.55 -> 3.13.0-32.57 I noticed that dnsmasq-tftp stopped working. pxeboot clients would hang on the "Loading ..../linux" TFTP transfer, with the transfer stalling roughly ~1000 blocks into the transfer:
10:30:51.011728 IP 10.1.1.2.43540 > 10.1.12.1.49165: UDP, length 1412
10:30:51.011924 IP 10.1.12.1.49165 > 10.1.1.2.43540: UDP, length 4
10:30:51.012012 IP 10.1.1.2.43540 > 10.1.12.1.49165: UDP, length 1412
10:30:51.012183 IP 10.1.12.1.49165 > 10.1.1.2.43540: UDP, length 4
stracing dnsmasq I noticed something very odd: sendto() on the socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) would suddenly start persistently returning EPERM in mid-transfer, even when dnsmasq continued to periodically retry:
select(18, [4 5 6 7 8 9 10 11 12 15 17], [], [], {0, 250000}) = 1 (in [17], left {0, 249834}) 345f\2{ \r\4)W\ 276\32\ 336q\252_ \230q\213\ 341U\354\ 25\374k7\ 243\32\ 221X+\v" ..., 1408) = 1408 353\25\ 306\345f\ 2{\r\4) W\276\32\ 336q\252_ \230q\213\ 341U\354\ 25\374k7\ 243\32" ..., 1412, 0, {sa_family=AF_INET, sin_port= htons(49165) , sin_addr= inet_addr( "10.1.11. 3")}, 16) = 1412 320:\256~ \307\236\ 26P\323\ 274%\260\ 362\341& \232\r\ 243\370\ 224\277\ 221\\\307\ 372"... , 1408) = 1408 320:\256~ \307\236\ 26P\323\ 274%\260\ 362\341& \232\r\ 243\370\ 224\277" ..., 1412, 0, {sa_family=AF_INET, sin_port= htons(49165) , sin_addr= inet_addr( "10.1.11. 3")}, 16) = -1 EPERM (Operation not permitted) 320:\256~ \307\236\ 26P\323\ 274%\260\ 362\341& \232\r\ 243\370\ 224\277\ 221\\\307\ 372"... , 1408) = 1408 320:\256~ \307\236\ 26P\323\ 274%\260\ 362\341& \232\r\ 243\370\ 224\277" ..., 1412, 0, {sa_family=AF_INET, sin_port= htons(49165) , sin_addr= inet_addr( "10.1.11. 3")}, 16) = -1 EPERM (Operation not permitted)
recvfrom(17, "\0\4\3\352", 4096, 0, NULL, NULL) = 4
lseek(16, 1410816, SEEK_SET) = 1410816
read(16, "\25\306\
sendto(17, "\0\3\3\
select(18, [4 5 6 7 8 9 10 11 12 15 17], [], [], {0, 250000}) = 1 (in [17], left {0, 249839})
recvfrom(17, "\0\4\3\353", 4096, 0, NULL, NULL) = 4
lseek(16, 1412224, SEEK_SET) = 1412224
read(16, "*\360 <C\363l\
sendto(17, "\0\3\3\354*\360 <C\363l\
select(18, [4 5 6 7 8 9 10 11 12 15 17], [], [], {0, 250000}) = 0 (Timeout)
select(18, [4 5 6 7 8 9 10 11 12 15 17], [], [], {0, 250000}) = 0 (Timeout)
select(18, [4 5 6 7 8 9 10 11 12 15 17], [], [], {0, 250000}) = 0 (Timeout)
select(18, [4 5 6 7 8 9 10 11 12 15 17], [], [], {0, 250000}) = 0 (Timeout)
select(18, [4 5 6 7 8 9 10 11 12 15 17], [], [], {0, 250000}) = 0 (Timeout)
select(18, [4 5 6 7 8 9 10 11 12 15 17], [], [], {0, 250000}) = 0 (Timeout)
select(18, [4 5 6 7 8 9 10 11 12 15 17], [], [], {0, 250000}) = 0 (Timeout)
select(18, [4 5 6 7 8 9 10 11 12 15 17], [], [], {0, 250000}) = 0 (Timeout)
lseek(16, 1412224, SEEK_SET) = 1412224
read(16, "*\360 <C\363l\
sendto(17, "\0\3\3\354*\360 <C\363l\
This was with all iptables rules unloaded (so no OUTPUT -j DENY) and apparmor profiles torn down.
I also noticed the following dmesgs appearing at roughly similar times to the tftp transfers getting stuck (although not coinciding exactly with the stall):
[70325.516724] IPv6 header not found
The error pointed to ipvs (which I am using on the same host as an IPv4 NAT loadbalancer): archive. linuxvirtualser ver.org/ html/lvs- devel/2012- 08/msg00018. html comments. gmane.org/ gmane.comp. linux.lvs. devel/3614
http://
http://
I then tore down the ipvs rules (service keepalived stop) and unloaded the modules (rmmod ip_vs_rr ip_vs), and the issue resolved itself - the stalled dnsmasq-tftp transfer resumed!
This seems to be reproducible, i.e. modprobing ip_vs and starting keepalived will cause dnsmasq-tftp to stall again, and stopping/unloading will resume.
This seems to happen reproducibly on boot with -32 and -30. This does NOT seem to happen with 3.13.0-29 which I was using up until now.