Buildbot: over time, buildbot creates zombie processes

Bug #419408 reported by Tom Haddon
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Won't Fix
Low
Unassigned

Bug Description

As seen on the buildbot master:

https://pastebin.canonical.com/21519/

This is typically because the child process exits and the parent doesn't collect the exit status.

Tom Haddon (mthaddon)
summary: - Buildbot: over time, buildbot seems to create zombie processes
+ Buildbot: over time, buildbot creates zombie processes
Changed in launchpad-foundations:
importance: Undecided → Low
status: New → Confirmed
Revision history for this message
Tom Haddon (mthaddon) wrote :

For reference, here's what it looks like just after a restart: https://pastebin.canonical.com/21523/

Revision history for this message
Tom Haddon (mthaddon) wrote :

20 zombie processes at the moment...

Changed in launchpad-foundations:
importance: Low → Medium
Revision history for this message
Gavin Panella (allenap) wrote :

This /might/ be a problem of the transport not getting disconnected
properly.

bzrbuildbot.poller.BzrPoller.getRawChanges() is run periodically in a
thread. It used bzrlib.branch.Branch.open_containing() to get the
details of the branches being watched. This opens an SSH connection to
the server holding the branches (i.e. bazaar.launchpad.net), but there
is no explicit clean-up done.

The implicit clean-up is something like:

  * branch is gc'ed.

  * branch._transport is gc'ed.

  * The connection for the branch - obtaininable with a call like
    transport.get_smart_medium() - is gc'ed. If this is
    SmartSSHClientMedium (which it is in my tests), then it declares a
    __del__ method:

        def __del__(self):
            ...
            self.disconnect()

  * self.disconnect tries to close everything down:

        def disconnect(self):
            """See SmartClientMedium.disconnect()."""
            if not self._connected:
                return
            osutils.until_no_eintr(self._read_from.close)
            osutils.until_no_eintr(self._write_to.close)
            self._ssh_connection.close()
            self._connected = False

  * self._ssh_connection.close is where the process gets reaped:

        def close(self):
            self.proc.stdin.close()
            self.proc.stdout.close()
            self.proc.wait()

There's a lot that could go wrong before the wait() call, and errors
during __del__ do not cause the interpreter to exit, so are easily
ignored. This is not to mention that the SmartSSHClientMedium instance
might not be eligible for garbage collection anyway because of a
possible circular reference, which the interpreter is unable to do
anything about:

  Circular references which are garbage are detected when the option
  cycle detector is enabled (it's on by default), but can only be
  cleaned up if there are no Python-level __del__() methods involved.

    (from http://www.python.org/doc/2.5.2/ref/customization.html)

I think it's worth explicitly disconnecting in getRawChanges to see if
that resolves this problem.

Gavin Panella (allenap)
Changed in launchpad-foundations:
assignee: nobody → Gavin Panella (allenap)
milestone: none → 3.1.10
status: Confirmed → In Progress
Curtis Hovey (sinzui)
Changed in launchpad-foundations:
milestone: 3.1.10 → 3.1.11
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

FWIW, Gavin's diagnosis seems to make sense to me. __del__ methods are evil!

tags: added: build-infrastructure
Revision history for this message
Tom Haddon (mthaddon) wrote :

We get bitten by this during every LP rollout, since when codehosting goes down we end up with tons of zombied processed presumably from trying to connect.

Revision history for this message
Gavin Panella (allenap) wrote :

Tom, the two branches linked to this bug are ready to merge. One of them is for staging, and rolls in several other fixes that are already in production, and the other is for production and is simply a fix for this specific bug. I can't guarantee they'll fix the issue, but we'll find out next week if you can apply them before then.

Gavin Panella (allenap)
Changed in launchpad-foundations:
milestone: 3.1.11 → none
Changed in launchpad-foundations:
status: In Progress → Fix Committed
Revision history for this message
Gavin Panella (allenap) wrote :

The only folk who can progress this bug now are the LOSAs. It's tricky, because buildbot is basically always needed, but this needs to be released. The branch for staging can be tried out first though, to give some comfort that it's not going to radically break things.

Tom Haddon (mthaddon)
tags: added: canonical-losa-lp
Curtis Hovey (sinzui)
Changed in launchpad-foundations:
status: Fix Committed → Triaged
assignee: Gavin Panella (allenap) → nobody
Revision history for this message
Haw Loeung (hloeung) wrote :

Seems this has been resolved.

Changed in launchpad:
status: Triaged → Fix Released
Revision history for this message
Tom Haddon (mthaddon) wrote :

Marking back to fix committed as this still needs confirmation that we have rolled out the new code - being tracked in RT#37664

Changed in launchpad:
status: Fix Released → Fix Committed
Revision history for this message
Haw Loeung (hloeung) wrote :

Still happening

Changed in launchpad:
status: Fix Committed → Confirmed
Ursula Junque (ursinha)
Changed in launchpad:
status: Confirmed → Triaged
Revision history for this message
Robert Collins (lifeless) wrote :

The underlying problem here is that the test environment isn't reliable (and can't be trusted to be so - corner cases will always exist). We should move to a vm based environment, and have RT tickets open to do just that with a dc hosted jenkins talking to the dc hosted lp test cloud. As such, more time spent fiddling on making buildbot work is waste.

Changed in launchpad:
importance: Medium → Low
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.