parsing from network fails after etree.fromstring()

Bug #673205 reported by Ben Ranker
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Medium
scoder

Bug Description

Calling etree.fromstring() without a parser uses the global default parser and appears to set that parser to no-network mode. Calling etree.parse() without a parser uses that same global default parser, now in no-network mode. Trying to use it to parse an http url thus fails after a cal to etree.fromstring(). Here's the output I get from the attached script:

Traceback (most recent call last):
  File "lxml-error.py", line 5, in <module>
    etree.parse('http://www.w3.org/2001/xml.xsd') # this fails
  File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49945)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71784)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:72067)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:71162)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:68160)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64244)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65165)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64508)
lxml.etree.XMLSyntaxError: Attempt to load network entity http://www.w3.org/2001/xml.xsd

%%
Python : (2, 6, 5, 'final', 0)
lxml.etree : (2, 2, 7, 0)
libxml used : (2, 7, 6)
libxml compiled : (2, 7, 6)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Revision history for this message
Ben Ranker (ben-lateralfricative) wrote :
Revision history for this message
scoder (scoder) wrote :

I can reproduce this with the latest trunk.

Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
scoder (scoder)
summary: - etree.parse() fails after etree.fromstring()
+ parsing from network fails after etree.fromstring()
Revision history for this message
Søren Bech Christensen (sbc-x) wrote :

I have this same issue on my Windows/cygwin environment using the following construct to parse an xhtml document:

parser = etree.XMLParser(load_dtd = True, dtd_validation = True, remove_blank_text=True, attribute_defaults = True)
html = etree.parse(inputhtmlfile,parser)

returns:

Python : (2, 6, 5, 'final', 0)
lxml.etree : (2, 2, 6, 0)
libxml used : (2, 7, 7)
libxml compiled : (2, 7, 7)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
Traceback (most recent call last):
  File "generatehtml.py", line 61, in <module>
    html = etree.parse(inputhtmlfile,parser)
  File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49958)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71797)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:72080)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:71175)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:68173)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521)
lxml.etree.XMLSyntaxError: Attempt to load network entity http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd

Whereas on my Debian environment, with an older installation, there is no problem doing the same:

Python : (2, 5, 2, 'final', 0)
lxml.etree : (2, 1, 1, 0)
libxml used : (2, 6, 32)
libxml compiled : (2, 6, 32)
libxslt used : (1, 1, 24)
libxslt compiled : (1, 1, 24)

Revision history for this message
Hervé (herve-menager) wrote :

I have exactly the same issue on
- version 2.3.0 of lxml
- with python 2.7.2
- and libxml 2.07.08
- and libxslt 1.01.26

Revision history for this message
danmb (danmbox) wrote :

This bug is very annoying in interactive sessions. It must be "very difficult" to fix since it hasn't been touch for such a long time. Perhaps the documentation needs to be updated, as tracking down this strange problem has been (and will be) a time sink for multiple users.

Revision history for this message
scoder (scoder) wrote :

I don't think it's "very difficult" to fix (not sure why you used quotes). The problem is that it's not obvious to me why this fails, so before fixing it (which might actually be trivial), someone needs to investigate it first to find out what exactly goes wrong at what point in the code. This takes time, and unless someone (maybe you?) investigates this time, there is no fix.

I don't think changing the documentation makes any sense.

Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
assignee: nobody → scoder (scoder)
status: Confirmed → Fix Committed
Revision history for this message
scoder (scoder) wrote :

Fixed in lxml 3.2.0.

Changed in lxml:
status: Fix Committed → Fix Released
scoder (scoder)
Changed in lxml:
milestone: none → 3.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.