python segfaults when feedparser using lxml parses a certain url

Bug #911039 reported by Tuomas Tonteri on 2012-01-02
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Medium
Unassigned

Bug Description

This python line causes segfault related to lxml:

import lxml.html, feedparser; feedparser.parse("http://www.radionetherlands.nl/news/zijlijn/rss-feed")

This has been replicated on one unrelated linux setup by a random person in irc, who also provided the following gdb trace:

---

Program received signal SIGSEGV, Segmentation fault.
0x0000000500000001 in ?? ()
(gdb) bt
#0 0x0000000500000001 in ?? ()
#1 0x00007ffff6db6362 in __pyx_f_4lxml_5etree__local_resolver (__pyx_v_c_url=0xca0fc0 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd",
    __pyx_v_c_pubid=0xbba1e0 "-//W3C//DTD XHTML 1.0 Strict//EN", __pyx_v_c_context=0xd2f9c0) at src/lxml/lxml.etree.c:73319
#2 0x0000003bac66265f in xmlLoadExternalEntity__internal_alias (URL=<optimized out>, ID=0xbba1e0 "-//W3C//DTD XHTML 1.0 Strict//EN", ctxt=0xd2f9c0)
    at xmlIO.c:3945
#3 0x0000003bac6f7d60 in xmlSAX2ResolveEntity__internal_alias (ctx=0xd2f9c0, publicId=0xbba1e0 "-//W3C//DTD XHTML 1.0 Strict//EN", systemId=<optimized out>)
    at SAX2.c:512
#4 0x0000003bac6f7e19 in xmlSAX2ExternalSubset__internal_alias (ctx=0xd2f9c0, name=0xd319ff "html", ExternalID=0xbba1e0 "-//W3C//DTD XHTML 1.0 Strict//EN",
    SystemID=0xca9da0 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd") at SAX2.c:402
#5 0x0000003bac64cbfc in xmlParseTryOrFinish (ctxt=0xd2f9c0, terminate=0) at parser.c:11299
#6 0x0000003bac64d55f in xmlParseChunk__internal_alias (ctxt=0xd2f9c0,
    chunk=0xd2b2d4 "l version='1.0' encoding='utf-8'?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-stri
ct.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en"..., size=512, terminate=0) at parser.c:11739
#7 0x0000003bac6df7be in xmlTextReaderPushData (reader=0xd2d680) at xmlreader.c:853
#8 0x0000003bac6e2564 in xmlTextReaderRead__internal_alias (reader=0xd2d680) at xmlreader.c:1280
#9 0x00007ffff4bd217b in libxml_xmlTextReaderRead (self=<optimized out>, args=<optimized out>) at libxml2-py.c:8521
#10 0x0000003ba36e1560 in call_function (oparg=<optimized out>, pp_stack=0x7fffffffd4b8) at Python/ceval.c:4013
#11 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666
#12 0x0000003ba36e18b5 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7fffffffd618, func=0xbf77d0)
    at Python/ceval.c:4099
#13 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd618) at Python/ceval.c:4034
#14 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666
#15 0x0000003ba36e18b5 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7fffffffd778, func=0xc30230)
    at Python/ceval.c:4099
#16 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd778) at Python/ceval.c:4034
#17 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666
#18 0x0000003ba36e2ea9 in PyEval_EvalCodeEx (co=0x947eb0, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kws=0x6cbd08,
    kwcount=0, defs=0xc3c0d8, defcount=7, closure=0x0) at Python/ceval.c:3253
#19 0x0000003ba36e171e in fast_function (nk=<optimized out>, na=1, n=<optimized out>, pp_stack=0x7fffffffd988, func=0xc5fa28) at Python/ceval.c:4109
#20 call_function (oparg=<optimized out>, pp_stack=0x7fffffffd988) at Python/ceval.c:4034
#21 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666
#22 0x0000003ba36e2ea9 in PyEval_EvalCodeEx (co=0x7ffff7f0a9b0, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kws=0x0,
    kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253
#23 0x0000003ba36e2fd2 in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:667
#24 0x0000003ba36fc9f1 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x640fd0, locals=0x640fd0, flags=<optimized out>,
    arena=<optimized out>) at Python/pythonrun.c:1346
#25 0x0000003ba36fd589 in PyRun_StringFlags (
    str=0x602010 "import lxml.html, feedparser; feedparser.parse(\"http://www.radionetherlands.nl/news/zijlijn/rss-feed\")\n", start=<optimized out>,
    globals=0x640fd0, locals=0x640fd0, flags=0x7fffffffdc50) at Python/pythonrun.c:1309

(gdb) p __pyx_v_c_context->_private
$8 = (void *) 0xd2d680

(gdb) p *__pyx_v_context->_resolvers
$6 = {ob_refcnt = 9665248, ob_type = 0xd2f9c0, __pyx_vtab = 0xa7b890, _resolvers = <unknown at remote 0x100000001>, _default_resolver = 0xce0b60}

---

Required information from my computer:

Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

---

Some related irc discussion at irc.freenode.net #python by 'marienz' about the segfault:

"It's somewhat plausible importing lxml sets an lxml-internal callback on libxml's "default entity loader", which is also being used by something else here, which violates some assumption lxml makes when it's actually called"

"yeah, I bet that's it. I'm getting into libxml2 via its own python bindings (/usr/lib64/python2.7/site-packages/libxml2.pyc on this system)."

"note it's trying to getattr "_resolvers" and the _resolvers field it got via c_context->_private is broken. As lxml appears on the stack above libxml2-py.c code I'm pretty sure the two bindings are getting mixed up."

"add to that parser.pxi calling xmlparser.xmlSetExternalEntityLoader(_local_resolver) presumably setting some global callback"

scoder (scoder) wrote :

I agree that lxml and the libxml2 Python bindings should get along with each other a bit better. It's just not always trivial and/or portable, e.g. due to a dependency on new API functions in libxml2 that allow for a local instead of a global setup.

I certainly take patches that improve the situation.

Changed in lxml:
importance: Undecided → Medium
status: New → Triaged
jiamo (life-130815) wrote :

random get

#0 __pyx_f_4lxml_5etree__collectText (__pyx_v_c_node=<optimized out>)
    at src/lxml/lxml.etree.c:16497
#1 0x00007f34ae920086 in __pyx_pf_4lxml_5etree_8_Element_4text___get__ (__pyx_v_self=
    <lxml.etree._Element at remote 0x5406910>) at src/lxml/lxml.etree.c:37022
#2 __pyx_getprop_4lxml_5etree_8_Element_text (o=<lxml.etree._Element at remote 0x5406910>,
    x=<optimized out>) at src/lxml/lxml.etree.c:6083
#3 0x00000000004d5138 in getset_get ()

python 2.7.3
ubuntu 12.04
python-lxml 2.3.2-1ubuntu0.2
libxml2 2.7.8.dfsg-5.1ubuntu4.9

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers