Xen buildd-slaves sometimes ignore SYN requests
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Launchpad itself |
High
|
Julian Edwards | ||
| launchpad-buildd |
High
|
Unassigned |
Bug Description
Sometimes, virtual build slaves will take >30 seconds to accept a new connection from the buildd-manager. Since the buildd-manager now uses Twisted code which has a default connection timeout of 30 seconds, the builder gets ignored for that scan. It also slows each scan cycle down.
Related branches
- Jonathan Lange (community): Approve on 2010-11-25
-
Diff: 100 lines (+34/-29)3 files modifiedlib/lp/buildmaster/model/builder.py (+16/-28)
lib/lp/buildmaster/tests/test_builder.py (+13/-0)
lib/lp/soyuz/tests/test_binarypackagebuildbehavior.py (+5/-1)
Changed in launchpad-buildd: | |
status: | New → Confirmed |
Julian Edwards (julian-edwards) wrote : | #2 |
Not really, it can delay dispatches of builds by a minute or more.
Changed in launchpad-buildd: | |
importance: | Low → High |
Julian Edwards (julian-edwards) wrote : | #3 |
What's happening is that the reset script does not complete within 30 seconds which causes a connection timeout to fire on buildmaster side. Lamont, any more ideas on this? I know I keep bugging you about it :)
Julian Edwards (julian-edwards) wrote : | #4 |
It's now become more apparent that this is not the reset script, it's a failure of the slave to return an ACK to the SYN package. The kernel has a timeout of 30 seconds on this which cannot be adjusted.
See also bug 677069
Changed in launchpad-buildd: | |
status: | Confirmed → Triaged |
Julian Edwards (julian-edwards) wrote : | #5 |
This script will grep the log and show you a sorted count of the builders that fail with these timeouts.
grep -A2 TCPTimedOut buildd-
Right now it's:
36 dubnium
35 actinium
28 berkelium
21 radium
17 plutonium
17 americium
15 rosehip
14 lawrencium
14 einsteinium
12 lychee
11 meitnerium
11 doubah
10 nobelium
10 nannyberry
9 radon
9 pluot
9 hassium
9 fermium
8 uranium
8 papaya
8 muntries
8 kutjera
5 seaborgium
4 shipova
3 louvi
3 lansones
2 sandpaperfig
2 rutherfordium
2 litembilla
2 lakoocha
1 titanium
1 terranova
1 rhenium
1 promethium
1 platinum
1 osmium
1 gold
1 charichuelo
Jonathan Lange (jml) wrote : | #6 |
It would probably be useful to have a script that reproduces the problem without involving lp.buildmaster.
Something like:
import errno
import socket
def connect_
for i in range(max_tries):
s = socket.socket()
result = socket.
if result == errno.ETIMEDOUT:
if result == 0:
raise RuntimeError("Did not timeout after %s tries" % (i,))
if __name__ == '__main__':
server = sys.argv[1]
connect_
Note I haven't actually run that code.
Julian Edwards (julian-edwards) wrote : | #7 |
We're currently running this simpler script against dubnium in production (it's been made manual temporarily):
#!/bin/sh
GUESTNAME=$1
RPCDATA="<?xml version='1.0'?> <methodCall> <methodName>
while wget -q -O - --post-
Julian Edwards (julian-edwards) wrote : | #8 |
Appending memory (in Kb) and architecture:
36 dubnium 1931416 i386
35 actinium 1932040 i386
28 berkelium 1931360 i386
21 radium 1931808 i386
17 plutonium 1932336 amd64
17 americium 1931356 amd64
15 rosehip 1931800 i386
14 lawrencium 1931324 amd64
14 einsteinium 1931492 i386
12 lychee 3988692 amd64
11 meitnerium 1928476 i386
11 doubah 3987216 i386
10 nobelium 1932100 amd64
10 nannyberry 3987208 i386
9 radon 1928476 amd64
9 pluot 1932872 amd64
9 hassium 1931800 i386
9 fermium 1932692 amd64
8 uranium 1932872 i386
8 papaya 1932040 amd64
8 muntries 3987188 i386
8 kutjera 3989776 i386
5 seaborgium 1928476 i386
4 shipova 1932736 i386
3 louvi 6051948 amd64
3 lansones 3984124 i386
2 sandpaperfig 1931860 i386
2 rutherfordium 1931564 i386
2 litembilla 6051948 amd64
2 lakoocha 3987188 i386
1 titanium 3830320 amd64
1 terranova 4025576 i386
1 rhenium 1931328 lpia
1 promethium 3987188 amd64
1 platinum 1931328 i386
1 osmium 1931328 amd64
1 gold 1932692 lpia
1 charichuelo 6051948 amd64
Jonathan Lange (jml) wrote : | #9 |
In a sense, the script in comment 7 is in a sense more complicated: it uses HTTP, but the evidence so far points to the bug being at a lower level. Also, do you know how wget handles ETIMEDOUT on socket.connect?
Julian Edwards (julian-edwards) wrote : | #10 |
Here's a version that works:
#!/usr/bin/env python
import errno
from socket import socket
import sys
def connect_
for i in range(max_tries):
s = socket()
result = s.connect_ex((host, port))
if result == errno.ETIMEDOUT:
return i
if result == 0:
raise RuntimeError("Did not timeout after %s tries" % (i,))
if __name__ == '__main__':
server = sys.argv[1]
print connect_
Julian Edwards (julian-edwards) wrote : | #11 |
So it looks like Xen guests sometimes ignore the first network packet. If we run the script below right after the builder is resumed it randomly prints the error message. We can work around this in the buildd-manager by sending a fake message right after the builder is resumed.
#!/usr/bin/env python
import errno
from socket import socket
import sys
def connect_
for i in range(max_tries):
if i % 1000 == 0:
print i
s = socket()
result = s.connect_ex((host, port))
if result == errno.ETIMEDOUT:
if i == 0:
else:
if result == 0:
if __name__ == '__main__':
server = sys.argv[1]
print connect_
Changed in soyuz: | |
status: | New → Triaged |
importance: | Undecided → High |
status: | Triaged → In Progress |
assignee: | nobody → Julian Edwards (julian-edwards) |
Fixed in stable r11986 <http://
Changed in soyuz: | |
milestone: | none → 10.12 |
tags: | added: qa-needstesting |
Changed in soyuz: | |
status: | In Progress → Fix Committed |
tags: |
added: buildd-manager qa-ok removed: qa-needstesting |
Changed in soyuz: | |
status: | Fix Committed → Fix Released |
Robert Collins (lifeless) wrote : | #13 |
FWIW it sounds to me like the xen startup scripts are giving us an 'its back' message too early. I'm going to close the buildd slave task because I don't think there is anything to change there.
summary: |
- Virtual builders are sometimes very slow to accept connections + Xen buildd-slaves sometimes ignore SYN requests |
Changed in launchpad-buildd: | |
status: | Triaged → Invalid |
Julian Edwards (julian-edwards) wrote : | #14 |
Right, it's a bug at the Xen layer. I'm also not convinced that "the xen startup scripts are giving us an 'its back' message too early" because it's consistently only the first packet that's a problem, it's not time-based.
Seems like this will be less crucial with your parallelisation fixes?