parsing from network fails after etree.fromstring()

Bug #673205 reported by Ben Ranker on 2010-11-09
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Medium
scoder

Bug Description

Calling etree.fromstring() without a parser uses the global default parser and appears to set that parser to no-network mode. Calling etree.parse() without a parser uses that same global default parser, now in no-network mode. Trying to use it to parse an http url thus fails after a cal to etree.fromstring(). Here's the output I get from the attached script:

Traceback (most recent call last):
  File "lxml-error.py", line 5, in <module>
    etree.parse('http://www.w3.org/2001/xml.xsd') # this fails
  File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49945)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71784)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:72067)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:71162)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:68160)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64244)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65165)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64508)
lxml.etree.XMLSyntaxError: Attempt to load network entity http://www.w3.org/2001/xml.xsd

%%
Python : (2, 6, 5, 'final', 0)
lxml.etree : (2, 2, 7, 0)
libxml used : (2, 7, 6)
libxml compiled : (2, 7, 6)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

scoder (scoder) wrote :

I can reproduce this with the latest trunk.

Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
scoder (scoder) on 2011-03-15
summary: - etree.parse() fails after etree.fromstring()
+ parsing from network fails after etree.fromstring()
Søren Bech Christensen (sbc-x) wrote :

I have this same issue on my Windows/cygwin environment using the following construct to parse an xhtml document:

parser = etree.XMLParser(load_dtd = True, dtd_validation = True, remove_blank_text=True, attribute_defaults = True)
html = etree.parse(inputhtmlfile,parser)

returns:

Python : (2, 6, 5, 'final', 0)
lxml.etree : (2, 2, 6, 0)
libxml used : (2, 7, 7)
libxml compiled : (2, 7, 7)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
Traceback (most recent call last):
  File "generatehtml.py", line 61, in <module>
    html = etree.parse(inputhtmlfile,parser)
  File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49958)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71797)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:72080)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:71175)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:68173)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521)
lxml.etree.XMLSyntaxError: Attempt to load network entity http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd

Whereas on my Debian environment, with an older installation, there is no problem doing the same:

Python : (2, 5, 2, 'final', 0)
lxml.etree : (2, 1, 1, 0)
libxml used : (2, 6, 32)
libxml compiled : (2, 6, 32)
libxslt used : (1, 1, 24)
libxslt compiled : (1, 1, 24)

Hervé (herve-menager) wrote :

I have exactly the same issue on
- version 2.3.0 of lxml
- with python 2.7.2
- and libxml 2.07.08
- and libxslt 1.01.26

Dan Muresan (danmbox) wrote :

This bug is very annoying in interactive sessions. It must be "very difficult" to fix since it hasn't been touch for such a long time. Perhaps the documentation needs to be updated, as tracking down this strange problem has been (and will be) a time sink for multiple users.

scoder (scoder) wrote :

I don't think it's "very difficult" to fix (not sure why you used quotes). The problem is that it's not obvious to me why this fails, so before fixing it (which might actually be trivial), someone needs to investigate it first to find out what exactly goes wrong at what point in the code. This takes time, and unless someone (maybe you?) investigates this time, there is no fix.

I don't think changing the documentation makes any sense.

scoder (scoder) wrote :
Changed in lxml:
assignee: nobody → scoder (scoder)
status: Confirmed → Fix Committed
scoder (scoder) wrote :

Fixed in lxml 3.2.0.

Changed in lxml:
status: Fix Committed → Fix Released
scoder (scoder) on 2013-04-28
Changed in lxml:
milestone: none → 3.2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers