Bug #419408 “Buildbot: over time, buildbot creates zombie proces...” : Bugs : Launchpad itself

Tom Haddon (mthaddon) on 2009-08-26

summary:	- Buildbot: over time, buildbot seems to create zombie processes + Buildbot: over time, buildbot creates zombie processes
Changed in launchpad-foundations:
importance:	Undecided → Low
status:	New → Confirmed

Revision history for this message

Tom Haddon (mthaddon) wrote on 2009-08-26:

#1

For reference, here's what it looks like just after a restart: https://pastebin.canonical.com/21523/

Revision history for this message

Tom Haddon (mthaddon) wrote on 2009-10-09:

#2

20 zombie processes at the moment...

Changed in launchpad-foundations:
importance:	Low → Medium

Revision history for this message

Gavin Panella (allenap) wrote on 2009-10-19:

#3

This /might/ be a problem of the transport not getting disconnected
properly.

bzrbuildbot.poller.BzrPoller.getRawChanges() is run periodically in a
thread. It used bzrlib.branch.Branch.open_containing() to get the
details of the branches being watched. This opens an SSH connection to
the server holding the branches (i.e. bazaar.launchpad.net), but there
is no explicit clean-up done.

The implicit clean-up is something like:

* branch is gc'ed.

* branch._transport is gc'ed.

  * The connection for the branch - obtaininable with a call like
    transport.get_smart_medium() - is gc'ed. If this is
    SmartSSHClientMedium (which it is in my tests), then it declares a
    __del__ method:

        def __del__(self):
            ...
            self.disconnect()

* self.disconnect tries to close everything down:

        def disconnect(self):
            """See SmartClientMedium.disconnect()."""
            if not self._connected:
                return
            osutils.until_no_eintr(self._read_from.close)
            osutils.until_no_eintr(self._write_to.close)
            self._ssh_connection.close()
            self._connected = False

* self._ssh_connection.close is where the process gets reaped:

        def close(self):
            self.proc.stdin.close()
            self.proc.stdout.close()
            self.proc.wait()

There's a lot that could go wrong before the wait() call, and errors
during __del__ do not cause the interpreter to exit, so are easily
ignored. This is not to mention that the SmartSSHClientMedium instance
might not be eligible for garbage collection anyway because of a
possible circular reference, which the interpreter is unable to do
anything about:

  Circular references which are garbage are detected when the option
  cycle detector is enabled (it's on by default), but can only be
  cleaned up if there are no Python-level __del__() methods involved.

(from http://www.python.org/doc/2.5.2/ref/customization.html)

I think it's worth explicitly disconnecting in getRawChanges to see if
that resolves this problem.

This /might/ be a problem of the transport not getting disconnected
properly.

bzrbuildbot.poller.BzrPoller.getRawChanges() is run periodically in a
thread. It used bzrlib.branch.Branch.open_containing() to get the
details of the branches being watched. This opens an SSH connection to
the server holding the branches (i.e. bazaar.launchpad.net), but there
is no explicit clean-up done.

The implicit clean-up is something like:

* branch is gc'ed.

* branch._transport is gc'ed.

* The connection for the branch - obtaininable with a call like
    transport.get_smart_medium() - is gc'ed. If this is
    SmartSSHClientMedium (which it is in my tests), then it declares a
    __del__ method:

def __del__(self):
            ...
            self.disconnect()

* self.disconnect tries to close everything down:

def disconnect(self):
            """See SmartClientMedium.disconnect()."""
            if not self._connected:
                return
            osutils.until_no_eintr(self._read_from.close)
            osutils.until_no_eintr(self._write_to.close)
            self._ssh_connection.close()
            self._connected = False

* self._ssh_connection.close is where the process gets reaped:

def close(self):
            self.proc.stdin.close()
            self.proc.stdout.close()
            self.proc.wait()

There's a lot that could go wrong before the wait() call, and errors
during __del__ do not cause the interpreter to exit, so are easily
ignored. This is not to mention that the SmartSSHClientMedium instance
might not be eligible for garbage collection anyway because of a
possible circular reference, which the interpreter is unable to do
anything about:

Circular references which are garbage are detected when the option
  cycle detector is enabled (it's on by default), but can only be
  cleaned up if there are no Python-level __del__() methods involved.

(from http://www.python.org/doc/2.5.2/ref/customization.html)

I think it's worth explicitly disconnecting in getRawChanges to see if
that resolves this problem.

Gavin Panella (allenap) on 2009-10-19

Changed in launchpad-foundations:
assignee:	nobody → Gavin Panella (allenap)
milestone:	none → 3.1.10
status:	Confirmed → In Progress

Curtis Hovey (sinzui) on 2009-11-11

Changed in launchpad-foundations:
milestone:	3.1.10 → 3.1.11

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2009-11-17:

#4

FWIW, Gavin's diagnosis seems to make sense to me. __del__ methods are evil!

tags:

added: build-infrastructure

Revision history for this message

Tom Haddon (mthaddon) wrote on 2009-12-08:

#5

We get bitten by this during every LP rollout, since when codehosting goes down we end up with tons of zombied processed presumably from trying to connect.

Revision history for this message

Gavin Panella (allenap) wrote on 2009-12-08:

#6

Tom, the two branches linked to this bug are ready to merge. One of them is for staging, and rolls in several other fixes that are already in production, and the other is for production and is simply a fix for this specific bug. I can't guarantee they'll fix the issue, but we'll find out next week if you can apply them before then.

Gavin Panella (allenap) on 2009-12-11

Changed in launchpad-foundations:
milestone:	3.1.11 → none

Diogo Matsubara (matsubara) on 2009-12-11

Changed in launchpad-foundations:
status:	In Progress → Fix Committed

Revision history for this message

Gavin Panella (allenap) wrote on 2010-01-29:

#7

The only folk who can progress this bug now are the LOSAs. It's tricky, because buildbot is basically always needed, but this needs to be released. The branch for staging can be tried out first though, to give some comfort that it's not going to radically break things.

Tom Haddon (mthaddon) on 2010-05-28

tags:

added: canonical-losa-lp

Curtis Hovey (sinzui) on 2010-06-17

Changed in launchpad-foundations:
status:	Fix Committed → Triaged
assignee:	Gavin Panella (allenap) → nobody

Revision history for this message

Haw Loeung (hloeung) wrote on 2011-05-13:

#8

Seems this has been resolved.

Changed in launchpad:
status:	Triaged → Fix Released

Revision history for this message

Tom Haddon (mthaddon) wrote on 2011-05-13:

#9

Marking back to fix committed as this still needs confirmation that we have rolled out the new code - being tracked in RT#37664

Changed in launchpad:
status:	Fix Released → Fix Committed

Revision history for this message

Haw Loeung (hloeung) wrote on 2011-06-09:

#10

Still happening

Changed in launchpad:
status:	Fix Committed → Confirmed

Ursula Junque (ursinha) on 2011-06-13

Changed in launchpad:
status:	Confirmed → Triaged

Revision history for this message

Robert Collins (lifeless) wrote on 2011-10-01:

#11

The underlying problem here is that the test environment isn't reliable (and can't be trusted to be so - corner cases will always exist). We should move to a vm based environment, and have RT tickets open to do just that with a dc hosted jenkins talking to the dc hosted lp test cloud. As such, more time spent fiddling on making buildbot work is waste.

Changed in launchpad:
importance:	Medium → Low
status:	Triaged → Won't Fix

Launchpad itself

Buildbot: over time, buildbot creates zombie processes

Bug Description

Other bug subscribers

Remote bug watches