Crash when using lxml and libxml2 Python bindings together
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Triaged
|
Medium
|
Unassigned |
Bug Description
This python line causes segfault related to lxml:
import lxml.html, feedparser; feedparser.parse("http://
This has been replicated on one unrelated linux setup by a random person in irc, who also provided the following gdb trace:
---
Program received signal SIGSEGV, Segmentation fault.
0x0000000500000001 in ?? ()
(gdb) bt
#0 0x0000000500000001 in ?? ()
#1 0x00007ffff6db6362 in __pyx_f_
__pyx_
#2 0x0000003bac66265f in xmlLoadExternal
at xmlIO.c:3945
#3 0x0000003bac6f7d60 in xmlSAX2ResolveE
at SAX2.c:512
#4 0x0000003bac6f7e19 in xmlSAX2External
SystemID=
#5 0x0000003bac64cbfc in xmlParseTryOrFinish (ctxt=0xd2f9c0, terminate=0) at parser.c:11299
#6 0x0000003bac64d55f in xmlParseChunk_
chunk=0xd2b2d4 "l version='1.0' encoding=
ct.dtd\">\n<html xmlns=\"http://
#7 0x0000003bac6df7be in xmlTextReaderPu
#8 0x0000003bac6e2564 in xmlTextReaderRe
#9 0x00007ffff4bd217b in libxml_
#10 0x0000003ba36e1560 in call_function (oparg=<optimized out>, pp_stack=
#11 PyEval_EvalFrameEx (f=<optimized out>, throwflag=
#12 0x0000003ba36e18b5 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=
at Python/ceval.c:4099
#13 call_function (oparg=<optimized out>, pp_stack=
#14 PyEval_EvalFrameEx (f=<optimized out>, throwflag=
#15 0x0000003ba36e18b5 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=
at Python/ceval.c:4099
#16 call_function (oparg=<optimized out>, pp_stack=
#17 PyEval_EvalFrameEx (f=<optimized out>, throwflag=
#18 0x0000003ba36e2ea9 in PyEval_EvalCodeEx (co=0x947eb0, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kws=0x6cbd08,
kwcount=0, defs=0xc3c0d8, defcount=7, closure=0x0) at Python/ceval.c:3253
#19 0x0000003ba36e171e in fast_function (nk=<optimized out>, na=1, n=<optimized out>, pp_stack=
#20 call_function (oparg=<optimized out>, pp_stack=
#21 PyEval_EvalFrameEx (f=<optimized out>, throwflag=
#22 0x0000003ba36e2ea9 in PyEval_EvalCodeEx (co=0x7ffff7f0a9b0, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kws=0x0,
kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253
#23 0x0000003ba36e2fd2 in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:667
#24 0x0000003ba36fc9f1 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x640fd0, locals=0x640fd0, flags=<optimized out>,
arena=
#25 0x0000003ba36fd589 in PyRun_StringFlags (
str=0x602010 "import lxml.html, feedparser; feedparser.parse(\"http://
globals=
(gdb) p __pyx_v_
$8 = (void *) 0xd2d680
(gdb) p *__pyx_
$6 = {ob_refcnt = 9665248, ob_type = 0xd2f9c0, __pyx_vtab = 0xa7b890, _resolvers = <unknown at remote 0x100000001>, _default_resolver = 0xce0b60}
---
Required information from my computer:
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
---
Some related irc discussion at irc.freenode.net #python by 'marienz' about the segfault:
"It's somewhat plausible importing lxml sets an lxml-internal callback on libxml's "default entity loader", which is also being used by something else here, which violates some assumption lxml makes when it's actually called"
"yeah, I bet that's it. I'm getting into libxml2 via its own python bindings (/usr/lib64/
"note it's trying to getattr "_resolvers" and the _resolvers field it got via c_context->_private is broken. As lxml appears on the stack above libxml2-py.c code I'm pretty sure the two bindings are getting mixed up."
"add to that parser.pxi calling xmlparser.
summary: |
- python segfaults when feedparser using lxml parses a certain url + Crash when using lxml and libxml2 Python bindings together |
I agree that lxml and the libxml2 Python bindings should get along with each other a bit better. It's just not always trivial and/or portable, e.g. due to a dependency on new API functions in libxml2 that allow for a local instead of a global setup.
I certainly take patches that improve the situation.