subprocess.Popen() crashed with TypeError in _cleanup(): an integer is required

Bug #711225 reported by gojkok on 2011-02-01
82
This bug affects 8 people
Affects Status Importance Assigned to Milestone
the python apt bindings
Fix Released
Undecided
Barry Warsaw
python-apt (Ubuntu)
Medium
Barry Warsaw
Natty
Medium
Barry Warsaw

Bug Description

Crashed on startup!

ProblemType: Crash
DistroRelease: Ubuntu 11.04
Package: jockey-gtk 0.8-0ubuntu5
ProcVersionSignature: Ubuntu 2.6.38-1.28-generic-pae 2.6.38-rc2
Uname: Linux 2.6.38-1-generic-pae i686
Architecture: i386
Date: Tue Feb 1 15:53:10 2011
ExecutablePath: /usr/bin/jockey-gtk
InstallationMedia: Ubuntu 10.04.1 LTS "Lucid Lynx" - Release i386 (20100816.1)
InterpreterPath: /usr/bin/python2.7
MachineType: LENOVO INVALID
PackageArchitecture: all
ProcCmdline: /usr/bin/python /usr/bin/jockey-gtk
ProcEnviron:
 LANGUAGE=en_US:en
 LANG=en_US.UTF-8
 LC_MESSAGES=en_US.utf8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.38-1-generic-pae root=UUID=028cbcf1-8f2a-478f-82e2-7b1a2dac7337 ro quiet splash vt.handoff=7
PythonArgs: ['/usr/bin/jockey-gtk']
SourcePackage: jockey
Title: jockey-gtk crashed with TypeError in _cleanup(): an integer is required
UserGroups: adm admin cdrom dialout lpadmin plugdev sambashare
XorgConf:
 Section "Device"
          Identifier "Configured Video Device"
          Driver "intel"
 EndSection
dmi.bios.date: 07/08/2010
dmi.bios.vendor: LENOVO
dmi.bios.version: 29CN29WW(V2.06)
dmi.board.asset.tag: Base Board Asset Tag
dmi.board.name: Base Board Product Name
dmi.board.vendor: LENOVO
dmi.board.version: Base Board Version
dmi.chassis.asset.tag: Chassis Asset Tag
dmi.chassis.type: 10
dmi.chassis.vendor: Chassis Manufacturer
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnLENOVO:bvr29CN29WW(V2.06):bd07/08/2010:svnLENOVO:pnINVALID:pvrINVALID:rvnLENOVO:rnBaseBoardProductName:rvrBaseBoardVersion:cvnChassisManufacturer:ct10:cvrChassisVersion:
dmi.product.name: INVALID
dmi.product.version: INVALID
dmi.sys.vendor: LENOVO

Related branches

gojkok (gojkok) wrote :
tags: removed: need-duplicate-check
visibility: private → public
Sebastien Bacher (seb128) wrote :

Confirmed there on a fresh natty alpha2 installation

Changed in jockey (Ubuntu):
status: New → Confirmed
Martin Pitt (pitti) on 2011-02-02
Changed in jockey (Ubuntu):
assignee: nobody → Martin Pitt (pitti)
Martin Pitt (pitti) wrote :

Jockey code just does:

        apt_policy = subprocess.Popen(['apt-cache', 'policy', 'dkms'],
                stdout=subprocess.PIPE, stderr=subprocess.PIPE)

which is a valid usage of subprocess as far as I can see. So this looks like a bug in Python's subprocess module?

affects: jockey (Ubuntu) → python2.7 (Ubuntu)
Changed in python2.7 (Ubuntu):
assignee: Martin Pitt (pitti) → nobody
summary: - jockey-gtk crashed with TypeError in _cleanup(): an integer is required
+ subprocess.Popen() crashed with TypeError in _cleanup(): an integer is
+ required
tags: added: iso-testing
tags: added: python27
Changed in python2.7 (Ubuntu Natty):
milestone: none → natty-alpha-3
Changed in python2.7 (Ubuntu Natty):
importance: Undecided → High
Martin Pitt (pitti) wrote :

Barry, do you have an idea what happens here?

Barry Warsaw (barry) wrote :

The subprocess module appears to be working just fine for me:

% python
Python 2.7.1+ (r271:86832, Dec 24 2010, 10:04:43)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from subprocess import *
>>> Popen(['apt-cache', 'policy', 'dkms'], stdout=PIPE, stderr=PIPE).communicate()
('dkms:\n Installed: (none)\n Candidate: 2.1.1.2-5ubuntu1\n Version table:\n 2.1.1.2-5ubuntu1 0\n 500 http://us.archive.ubuntu.com/ubuntu/ natty/main amd64 Packages\n', '')
>>>

Also, when I start jockey-gtk in a Terminal in a VM, I get the "searching for drivers" window, then the Additional Drivers window that says "No proprietary drivers are in use on this system."

I guess in summary: wfm :)

Is there something else I should try?

Martin Pitt (pitti) wrote :

@gojkok: Can you reproduce the crash on your system?

@Barry: It doesn't happen for me either, it does look like a weird corner case. Apparently the type error comes from trying to copy the _active_ list with _active[:], but why would this ever fail, given that the start/end indexes are implicit?

Changed in python2.7 (Ubuntu Natty):
importance: High → Medium
milestone: natty-alpha-3 → none
status: Confirmed → Incomplete
Martin Pitt (pitti) wrote :

Taking off the natty radar, this does look like a weird corner case.

Changed in python2.7 (Ubuntu Natty):
status: Incomplete → Won't Fix
Barry Warsaw (barry) wrote :

@Martin, the only way I can think that that would fail is if _active isn't a list but some list-like object that overrode some magic methods. That's probably not the case here, so I don't see how it *could* fail in that way.

Mathieu Pellerin (nirvn-asia) wrote :

I'm facing this issue on a brand new installation of latest Natty on my Dell Vostro 3700. It turns out to be a big problem as it prevents me from activating both the NVIDIA and the wireless broadcom drivers.

Martin Pitt (pitti) wrote :

Ah, so you can reproduce it? This is a complete mystery to me, so can you download attached file into /tmp/, and then run these commands in a Terminal (please copy & paste to keep them exactly like that:)

  sudo killall jockey-backend

  sudo PYTHONPATH=/tmp/ /usr/share/jockey/jockey-backend --debug 2>&1 | tee /tmp/log

Then start Jockey from the menu as usual, and repeat the steps that lead to the crash. At some point, the jockey-backend program in the terminal should then crash with the TypeError. Once that happens, please attach /tmp/log here.

Thanks!

Mathieu Pellerin (nirvn-asia) wrote :

I'll do as soon as I go home.

Just to clarify the situation, the only step I need to do for it to crash is to start Jockey (either from terminal or from the menu). It crashes before showing the window that displays the list of proprietary drivers to activate/deactivate.

I have also be experiencing the same crash when trying to start jockey-gtk. I reran the backend using the file you provided, Martin, and received the following output:

2011-02-28 22:11:28,583 DEBUG: Updating repository indexes...
Exception AttributeError: "'Popen' object has no attribute '_child_created'" in <bound method Popen.__del__ of <subprocess.Popen object at 0x975a5cc>> ignored

Of note, is that the crash occurred after the "Searching..." dialog was visible for a few seconds, whereas without the file you provided it crashed immediately. More interesting, after killing the backend running with the Python path including /tmp/, I reran jockey-gtk and it loaded to the main window. However, attempting to activate the Broadcom STA driver then resulted in a jockey-gtk crash (though the main window remained open and the driver became enabled and active).

Mathieu Pellerin (nirvn-asia) wrote :

Martin, apologies for the long delay.

I unfortunately was in a hurry to get my computer to a working stag. I apparently did something that fixed the issue and therefore can't provide debugging output.

Martin Pitt (pitti) wrote :

That actually wasn't the kind of debugging I was hoping for; I added these three prints

    print '*** subprocess _cleanup'
    print '*** _active type:', type(_active)
    print '*** _active:', _active

which ideally would appear in the output somewhere. Your output is interesting as well, as it does point to a bug in subprocess. I'm not quite sure why __del__() would be called at a time when the Popen object doesn't have a _child_created attribute, that could only happen if __del__ was invoked by the _cleanup() method. If that happens, that would mean that the object gets destroyed while it hasn't even finished the constructor yet.

Changed in python2.7 (Ubuntu Natty):
status: Won't Fix → Confirmed
milestone: none → ubuntu-11.04-beta-1
Martin Pitt (pitti) wrote :

This only happens if jockey tries to update its repository index, i. e. when the machine has never been online. I got this to happen in both kvm and a real machine. You can reproduce this by starting a machine/VM without network, and then ensuring that apt doesn't know about dkms (which jockey uses as a litmus test whether external package sources are available):

  sudo rm /var/lib/apt/lists/*_main*_Packages /var/cache/apt/*.bin
  jockey-text -l

Note you can run the backend in a terminal with this:

  sudo killall jockey-backend
  sudo /usr/share/jockey/jockey-backend --debug

and then start jockey-gtk/jockey-text etc.

Some observations:

 - This only seems to happen when running those subprocess commands in the D-BUS backend. I can't get it to crash when I run it with --no-dbus. But as it also works when running the backend in a terminal, it's not due to the special environment that D-BUS uses for auto-activated backends (i. e. no $PATH, etc.)

 - I added some debugging print statements to _cleanup(), including a try/except. The exception really happens in the "for inst in _active[:]', it's not just being messed up by passing it through dbus.

 - printing _active, type(_active), len(_active) works and gives [], <type 'list'>, 0; i. e. there don't seem actual elements to iterate over

 - I've seen it to crash on a "print _active[:]" or "print list(_active)". At this point I don't think it makes much sense to continue to try searching for the error in the Python code; something in the interpreter seems to get royally confused here..

Changed in python2.7 (Ubuntu Natty):
assignee: nobody → Canonical Foundations Team (canonical-foundations)
Barry Warsaw (barry) on 2011-03-25
Changed in python2.7 (Ubuntu Natty):
assignee: Canonical Foundations Team (canonical-foundations) → Barry Warsaw (barry)
Barry Warsaw (barry) wrote :

Well, this is definitely going to be a fun one. Here's some status:

I know that the exception is getting raised in PyInt_AsLong(). See intobject.c:161 in the Python 2.7 source. I had to rebuild a Python 2.7 with all the relevant TypeErrors containing an extended message string to tell me exactly which one of about a dozen was getting tripped. This was fun because I also had to disable all the tests in the package build because they just took so darn long.

I tried various combinations of Python level changes to subprocess.py, but I do not have a workaround at the Python level. E.g. things like this still broke:

   active = _active[:]
or
   active = list(_active)

and so on. Those all worked fine, but as soon as you iterated over the new list with the for statement, boom! I even tried hardcoding things like _active[0:len(_active)] and _active[0:0] but those still fail as soon as you begin for-loop iterating.

I tried to change /usr/share/jockey/jockey-backend to use python-dbg but too many 64bit libraries are missing so that doesn't work. I'm going to rebuild a 32 bit VM to see if I can get attached to a debug version of Python to further inspect the object getting passed to PyInt_AsLong() right at the failure point.

This is almost certainly a bug in Python, but searching the upstream tracker hasn't yielded anything directly related, though there are several issues (mostly closed) about _cleanup() and it's problem in the face of threads. It may be fruitful to work around it by explicitly closing the popen objects rather than letting gc and __del__ put them in _active, but I'll have to look at the jockey code more closely to know that. Still, I'd like to spend a bit more time investigating the issue at the CPython layer before moving on to workaround or hacks.

OTOH, it seems unlikely that this is some kind of race condition since it's 100% reproducible using Martin's recipe (thanks for that!). Anyway, that's it for now.

Barry Warsaw (barry) wrote :

As predicted, this was fun, but I believe I've nailed it. It's actually a bug in python-apt. Stay tuned for a merge proposal.

Changed in python-apt:
status: New → Confirmed
assignee: nobody → Barry Warsaw (barry)
Barry Warsaw (barry) wrote :

So, here's the problem. In python-apt, progress.cc, PyFetchProgress::Pulse(), you have the following code at about line 398:

   if((result == NULL) || (!PyArg_Parse(result, "b", &res)))
   {
      // most of the time the user who subclasses the pulse()
      // method forgot to add a return {True,False} so we just
      // assume he wants a True. There may be a Python exception on the stack
      // that must be cleared.
      PyCbObj_BEGIN_ALLOW_THREADS
      return true;
   }

What this is essentially doing is saying, whatever comes back from the pulse() callback should be either the Python bool value True or False. The PyArg_Parse() call has the effect of trying to turn whatever the callback returns into a short integer. So, if True or False were returned, you'd get 1 and 0. But the pulse() callback can actually return anything, so if it returned something that could be coerced into an integer (deep in Python, by PyInt_AsLong() via the 'b' flag), it would still work (fsvo "work").

The problem happens when pulse() returns something that cannot be coerced to an int, and in the jockey case it's actually getting None returned. python-apt clearly seems this is a common possibility based on the comment in the code. Probably the most likely error situation in client code is falling off the end without returning True or False, and when Python falls off the end of a function, an implicit None is returned. This is exactly what's happening in the jockey case. 'result' is None, and that cannot be coerced to an int, so PyInt_AsLong sets a TypeError, which gets dutifully propagated up the call stack until you get to line 398 in progress.cc.

So, PyFetchProgress::Pulse() wants to ignore a bogus return, but when it enters the 'if' clause at line 400, it's returning a non-error value 'true' *while an exception is pending on Python's stack*. The effect of this is that the next time Python checks for an outstanding exception, it'll see the TypeError and throw it, but as you can tell, it's far away from the *actual* error location. The behavior witnessed should have been a clue that something like this was happening (it did eventually dawn on me :).

The fix is simple. When ignoring any possible exception from PyArg_Parse(), you *also* need to clear the exception, which is what the attached branch does. Note that I don't particularly like ignoring bogus return values here, but that's a yak to shave on some other day.

I've verified the branch fixes the bug using Martin's same reproducible recipe.

tags: removed: python27
Barry Warsaw (barry) wrote :

Note too this has nothing to do with Python 2.7 specifically, it's just a bug in python-apt's use of the Python C API.

On 25.03.2011 23:17, Barry Warsaw wrote:
> was fun because I also had to disable all the tests in the package build
> because they just took so darn long.

DEB_BUILD_OPTIONS="nocheck nobench" ...

Barry Warsaw (barry) wrote :

On Mar 28, 2011, at 10:41 PM, Matthias Klose wrote:

>On 25.03.2011 23:17, Barry Warsaw wrote:
>> was fun because I also had to disable all the tests in the package build
>> because they just took so darn long.
>
>DEB_BUILD_OPTIONS="nocheck nobench" ...

Yes, but sadly I could not get that to work with sbuild and it was quicker
just to comment out a bunch of stuff. I'd shaved enough yaks already.

Colin Watson (cjwatson) on 2011-03-29
affects: python2.7 (Ubuntu Natty) → python-apt (Ubuntu Natty)
Barry Warsaw (barry) on 2011-03-29
Changed in python-apt (Ubuntu Natty):
status: Confirmed → In Progress
Changed in python-apt:
status: Confirmed → In Progress
Martin Pitt (pitti) wrote :

Barry, you are a hero!

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package python-apt - 0.7.100.2ubuntu2

---------------
python-apt (0.7.100.2ubuntu2) natty; urgency=low

  [ Barry Warsaw ]
  * PyFetchProgress::Pulse(): When ignoring a false return value from
    PyArg_Parse() after running the simple callback pulse(), there can be
    an exception on the stack, which must be cleared. (LP: #711225)
 -- Barry Warsaw <email address hidden> Mon, 28 Mar 2011 18:17:11 -0400

Changed in python-apt (Ubuntu Natty):
status: In Progress → Fix Released
Barry Warsaw (barry) on 2011-03-29
Changed in python-apt:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers