Crash when using lxml and libxml2 Python bindings together

Bug #911039 reported by Tuomas Tonteri
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Triaged
Medium
Unassigned

Bug Description

This python line causes segfault related to lxml:

import lxml.html, feedparser; feedparser.parse("http://www.radionetherlands.nl/news/zijlijn/rss-feed")

This has been replicated on one unrelated linux setup by a random person in irc, who also provided the following gdb trace:

---

Program received signal SIGSEGV, Segmentation fault.
0x0000000500000001 in ?? ()
(gdb) bt
#0 0x0000000500000001 in ?? ()
#1 0x00007ffff6db6362 in __pyx_f_4lxml_5etree__local_resolver (__pyx_v_c_url=0xca0fc0 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd",
    __pyx_v_c_pubid=0xbba1e0 "-//W3C//DTD XHTML 1.0 Strict//EN", __pyx_v_c_context=0xd2f9c0) at src/lxml/lxml.etree.c:73319
#2 0x0000003bac66265f in xmlLoadExternalEntity__internal_alias (URL=<optimized out>, ID=0xbba1e0 "-//W3C//DTD XHTML 1.0 Strict//EN", ctxt=0xd2f9c0)
    at xmlIO.c:3945
#3 0x0000003bac6f7d60 in xmlSAX2ResolveEntity__internal_alias (ctx=0xd2f9c0, publicId=0xbba1e0 "-//W3C//DTD XHTML 1.0 Strict//EN", systemId=<optimized out>)
    at SAX2.c:512
#4 0x0000003bac6f7e19 in xmlSAX2ExternalSubset__internal_alias (ctx=0xd2f9c0, name=0xd319ff "html", ExternalID=0xbba1e0 "-//W3C//DTD XHTML 1.0 Strict//EN",
    SystemID=0xca9da0 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd") at SAX2.c:402
#5 0x0000003bac64cbfc in xmlParseTryOrFinish (ctxt=0xd2f9c0, terminate=0) at parser.c:11299
#6 0x0000003bac64d55f in xmlParseChunk__internal_alias (ctxt=0xd2f9c0,
    chunk=0xd2b2d4 "l version='1.0' encoding='utf-8'?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-stri
ct.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en"..., size=512, terminate=0) at parser.c:11739
#7 0x0000003bac6df7be in xmlTextReaderPushData (reader=0xd2d680) at xmlreader.c:853
#8 0x0000003bac6e2564 in xmlTextReaderRead__internal_alias (reader=0xd2d680) at xmlreader.c:1280
#9 0x00007ffff4bd217b in libxml_xmlTextReaderRead (self=<optimized out>, args=<optimized out>) at libxml2-py.c:8521
#10 0x0000003ba36e1560 in call_function (oparg=<optimized out>, pp_stack=0x7fffffffd4b8) at Python/ceval.c:4013
#11 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666
#12 0x0000003ba36e18b5 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7fffffffd618, func=0xbf77d0)
    at Python/ceval.c:4099
#13 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd618) at Python/ceval.c:4034
#14 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666
#15 0x0000003ba36e18b5 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7fffffffd778, func=0xc30230)
    at Python/ceval.c:4099
#16 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd778) at Python/ceval.c:4034
#17 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666
#18 0x0000003ba36e2ea9 in PyEval_EvalCodeEx (co=0x947eb0, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kws=0x6cbd08,
    kwcount=0, defs=0xc3c0d8, defcount=7, closure=0x0) at Python/ceval.c:3253
#19 0x0000003ba36e171e in fast_function (nk=<optimized out>, na=1, n=<optimized out>, pp_stack=0x7fffffffd988, func=0xc5fa28) at Python/ceval.c:4109
#20 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd988) at Python/ceval.c:4034
#21 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666
#22 0x0000003ba36e2ea9 in PyEval_EvalCodeEx (co=0x7ffff7f0a9b0, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kws=0x0,
    kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253
#23 0x0000003ba36e2fd2 in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:667
#24 0x0000003ba36fc9f1 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x640fd0, locals=0x640fd0, flags=<optimized out>,
    arena=<optimized out>) at Python/pythonrun.c:1346
#25 0x0000003ba36fd589 in PyRun_StringFlags (
    str=0x602010 "import lxml.html, feedparser; feedparser.parse(\"http://www.radionetherlands.nl/news/zijlijn/rss-feed\")\n", start=<optimized out>,
    globals=0x640fd0, locals=0x640fd0, flags=0x7fffffffdc50) at Python/pythonrun.c:1309

(gdb) p __pyx_v_c_context->_private
$8 = (void *) 0xd2d680

(gdb) p *__pyx_v_context->_resolvers
$6 = {ob_refcnt = 9665248, ob_type = 0xd2f9c0, __pyx_vtab = 0xa7b890, _resolvers = <unknown at remote 0x100000001>, _default_resolver = 0xce0b60}

---

Required information from my computer:

Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

---

Some related irc discussion at irc.freenode.net #python by 'marienz' about the segfault:

"It's somewhat plausible importing lxml sets an lxml-internal callback on libxml's "default entity loader", which is also being used by something else here, which violates some assumption lxml makes when it's actually called"

"yeah, I bet that's it. I'm getting into libxml2 via its own python bindings (/usr/lib64/python2.7/site-packages/libxml2.pyc on this system)."

"note it's trying to getattr "_resolvers" and the _resolvers field it got via c_context->_private is broken. As lxml appears on the stack above libxml2-py.c code I'm pretty sure the two bindings are getting mixed up."

"add to that parser.pxi calling xmlparser.xmlSetExternalEntityLoader(_local_resolver) presumably setting some global callback"

Revision history for this message
scoder (scoder) wrote :

I agree that lxml and the libxml2 Python bindings should get along with each other a bit better. It's just not always trivial and/or portable, e.g. due to a dependency on new API functions in libxml2 that allow for a local instead of a global setup.

I certainly take patches that improve the situation.

Changed in lxml:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
jiamo (life-130815) wrote :

random get

#0 __pyx_f_4lxml_5etree__collectText (__pyx_v_c_node=<optimized out>)
    at src/lxml/lxml.etree.c:16497
#1 0x00007f34ae920086 in __pyx_pf_4lxml_5etree_8_Element_4text___get__ (__pyx_v_self=
    <lxml.etree._Element at remote 0x5406910>) at src/lxml/lxml.etree.c:37022
#2 __pyx_getprop_4lxml_5etree_8_Element_text (o=<lxml.etree._Element at remote 0x5406910>,
    x=<optimized out>) at src/lxml/lxml.etree.c:6083
#3 0x00000000004d5138 in getset_get ()

python 2.7.3
ubuntu 12.04
python-lxml 2.3.2-1ubuntu0.2
libxml2 2.7.8.dfsg-5.1ubuntu4.9

scoder (scoder)
summary: - python segfaults when feedparser using lxml parses a certain url
+ Crash when using lxml and libxml2 Python bindings together
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.