lxml.html.parse does not recognize "https"

Bug #599533 reported by Martin Mai
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

It is not possible to specify an https URI when calling "parse" from the lxml.html module. It will always throw an IOError. Specifying http URIs works.

>>> parse("http://www.google.de")
<lxml.etree._ElementTree object at 0x7fa204f2eea8>
>>> parse("https://www.google.de")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/dist-packages/lxml/html/__init__.py", line 661, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49958)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71797)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:72080)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:71175)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:68173)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
  File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64493)
IOError: Error reading file 'https://www.google.de': failed to load external entity "https://www.google.de"

ProblemType: Bug
DistroRelease: Ubuntu 10.10
Package: python-lxml 2.2.6-1
ProcVersionSignature: Ubuntu 2.6.35-6.8-generic 2.6.35-rc3
Uname: Linux 2.6.35-6-generic x86_64
NonfreeKernelModules: nvidia
Architecture: amd64
Date: Mon Jun 28 22:10:27 2010
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Alpha amd64 (20100602.2)
ProcEnviron:
 PATH=(custom, user)
 LANG=en_US.utf8
 SHELL=/bin/bash
SourcePackage: lxml

Revision history for this message
Martin Mai (mrkanister-deactivatedaccount-deactivatedaccount) wrote :
Revision history for this message
Martin Mai (mrkanister-deactivatedaccount-deactivatedaccount) wrote :

Forget the "de" domain of google, it's https URI redirects to "http://www.google.com", but "https://www.google.com" really exists and fails:

>>> parse("https://www.google.com")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/dist-packages/lxml/html/__init__.py", line 661, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49958)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71797)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:72080)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:71175)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:68173)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
  File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64493)
IOError: Error reading file 'https://www.google.com': failed to load external entity "https://www.google.com"

Revision history for this message
scoder (scoder) wrote :

SSL/TLS is not supported by libxml2. Use Python's urllib2 instead.

Changed in lxml (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.