Xen buildd-slaves sometimes ignore SYN requests

Bug #586359 reported by Julian Edwards on 2010-05-27
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
High
Julian Edwards
launchpad-buildd
High
Unassigned

Bug Description

Sometimes, virtual build slaves will take >30 seconds to accept a new connection from the buildd-manager. Since the buildd-manager now uses Twisted code which has a default connection timeout of 30 seconds, the builder gets ignored for that scan. It also slows each scan cycle down.

Related branches

Changed in launchpad-buildd:
status: New → Confirmed
Robert Collins (lifeless) wrote :

Seems like this will be less crucial with your parallelisation fixes?

Changed in launchpad-buildd:
importance: Undecided → Low
Julian Edwards (julian-edwards) wrote :

Not really, it can delay dispatches of builds by a minute or more.

Changed in launchpad-buildd:
importance: Low → High
Julian Edwards (julian-edwards) wrote :

What's happening is that the reset script does not complete within 30 seconds which causes a connection timeout to fire on buildmaster side. Lamont, any more ideas on this? I know I keep bugging you about it :)

Julian Edwards (julian-edwards) wrote :

It's now become more apparent that this is not the reset script, it's a failure of the slave to return an ACK to the SYN package. The kernel has a timeout of 30 seconds on this which cannot be adjusted.

See also bug 677069

Changed in launchpad-buildd:
status: Confirmed → Triaged
Julian Edwards (julian-edwards) wrote :

This script will grep the log and show you a sorted count of the builders that fail with these timeouts.

grep -A2 TCPTimedOut buildd-manager.log|grep builder|awk '{print $5}'|sort|uniq -c|sort -nr

Right now it's:
     36 dubnium
     35 actinium
     28 berkelium
     21 radium
     17 plutonium
     17 americium
     15 rosehip
     14 lawrencium
     14 einsteinium
     12 lychee
     11 meitnerium
     11 doubah
     10 nobelium
     10 nannyberry
      9 radon
      9 pluot
      9 hassium
      9 fermium
      8 uranium
      8 papaya
      8 muntries
      8 kutjera
      5 seaborgium
      4 shipova
      3 louvi
      3 lansones
      2 sandpaperfig
      2 rutherfordium
      2 litembilla
      2 lakoocha
      1 titanium
      1 terranova
      1 rhenium
      1 promethium
      1 platinum
      1 osmium
      1 gold
      1 charichuelo

Jonathan Lange (jml) wrote :

It would probably be useful to have a script that reproduces the problem without involving lp.buildmaster.

Something like:

  import errno
  import socket

  def connect_to_server_until_timeout(host, port, max_tries=10000):
      for i in range(max_tries):
          s = socket.socket()
          result = socket.connect_ex((host, port))
          if result == errno.ETIMEDOUT:
              return i
          if result == 0:
              s.close()
      raise RuntimeError("Did not timeout after %s tries" % (i,))

  if __name__ == '__main__':
      server = sys.argv[1]
      connect_to_server_until_timeout(server, 8221)

Note I haven't actually run that code.

Julian Edwards (julian-edwards) wrote :

We're currently running this simpler script against dubnium in production (it's been made manual temporarily):

#!/bin/sh
GUESTNAME=$1
RPCDATA="<?xml version='1.0'?> <methodCall> <methodName>status</methodName> <params> </params> </methodCall>"
while wget -q -O - --post-data="$RPCDATA" http://${GUESTNAME}.ppa:8221/rpc; do :; done

Julian Edwards (julian-edwards) wrote :

Appending memory (in Kb) and architecture:

     36 dubnium 1931416 i386
     35 actinium 1932040 i386
     28 berkelium 1931360 i386
     21 radium 1931808 i386
     17 plutonium 1932336 amd64
     17 americium 1931356 amd64
     15 rosehip 1931800 i386
     14 lawrencium 1931324 amd64
     14 einsteinium 1931492 i386
     12 lychee 3988692 amd64
     11 meitnerium 1928476 i386
     11 doubah 3987216 i386
     10 nobelium 1932100 amd64
     10 nannyberry 3987208 i386
      9 radon 1928476 amd64
      9 pluot 1932872 amd64
      9 hassium 1931800 i386
      9 fermium 1932692 amd64
      8 uranium 1932872 i386
      8 papaya 1932040 amd64
      8 muntries 3987188 i386
      8 kutjera 3989776 i386
      5 seaborgium 1928476 i386
      4 shipova 1932736 i386
      3 louvi 6051948 amd64
      3 lansones 3984124 i386
      2 sandpaperfig 1931860 i386
      2 rutherfordium 1931564 i386
      2 litembilla 6051948 amd64
      2 lakoocha 3987188 i386
      1 titanium 3830320 amd64
      1 terranova 4025576 i386
      1 rhenium 1931328 lpia
      1 promethium 3987188 amd64
      1 platinum 1931328 i386
      1 osmium 1931328 amd64
      1 gold 1932692 lpia
      1 charichuelo 6051948 amd64

Jonathan Lange (jml) wrote :

In a sense, the script in comment 7 is in a sense more complicated: it uses HTTP, but the evidence so far points to the bug being at a lower level. Also, do you know how wget handles ETIMEDOUT on socket.connect?

Here's a version that works:

#!/usr/bin/env python
import errno
from socket import socket
import sys

def connect_to_server_until_timeout(host, port, max_tries=10000):
    for i in range(max_tries):
        s = socket()
        result = s.connect_ex((host, port))
        if result == errno.ETIMEDOUT:
            return i
        if result == 0:
            s.close()
    raise RuntimeError("Did not timeout after %s tries" % (i,))

if __name__ == '__main__':
    server = sys.argv[1]
    print connect_to_server_until_timeout(server, 8221)

So it looks like Xen guests sometimes ignore the first network packet. If we run the script below right after the builder is resumed it randomly prints the error message. We can work around this in the buildd-manager by sending a fake message right after the builder is resumed.

#!/usr/bin/env python
import errno
from socket import socket
import sys

def connect_to_server_until_timeout(host, port, max_tries=9999):
    for i in range(max_tries):
        if i % 1000 == 0:
            print i
        s = socket()
        result = s.connect_ex((host, port))
        if result == errno.ETIMEDOUT:
            if i == 0:
                print "First packet time out."
            else:
                raise RuntimeError("YAY, Timeout after %s tries" % (i,))
        if result == 0:
            s.close()

if __name__ == '__main__':
    server = sys.argv[1]
    print connect_to_server_until_timeout(server, 8221)

Changed in soyuz:
status: New → Triaged
importance: Undecided → High
status: Triaged → In Progress
assignee: nobody → Julian Edwards (julian-edwards)
Changed in soyuz:
milestone: none → 10.12
tags: added: qa-needstesting
Changed in soyuz:
status: In Progress → Fix Committed
tags: added: buildd-manager qa-ok
removed: qa-needstesting
Changed in soyuz:
status: Fix Committed → Fix Released
Robert Collins (lifeless) wrote :

FWIW it sounds to me like the xen startup scripts are giving us an 'its back' message too early. I'm going to close the buildd slave task because I don't think there is anything to change there.

summary: - Virtual builders are sometimes very slow to accept connections
+ Xen buildd-slaves sometimes ignore SYN requests
Changed in launchpad-buildd:
status: Triaged → Invalid

Right, it's a bug at the Xen layer. I'm also not convinced that "the xen startup scripts are giving us an 'its back' message too early" because it's consistently only the first packet that's a problem, it's not time-based.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers