lxml.html.parse does not recognize "https"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml (Ubuntu) |
Invalid
|
Undecided
|
Unassigned |
Bug Description
It is not possible to specify an https URI when calling "parse" from the lxml.html module. It will always throw an IOError. Specifying http URIs works.
>>> parse("http://
<lxml.etree.
>>> parse("https:/
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/
return etree.parse(
File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/
File "parser.pxi", line 1500, in lxml.etree.
File "parser.pxi", line 1529, in lxml.etree.
File "parser.pxi", line 1429, in lxml.etree.
File "parser.pxi", line 975, in lxml.etree.
File "parser.pxi", line 539, in lxml.etree.
File "parser.pxi", line 625, in lxml.etree.
File "parser.pxi", line 563, in lxml.etree.
IOError: Error reading file 'https:/
ProblemType: Bug
DistroRelease: Ubuntu 10.10
Package: python-lxml 2.2.6-1
ProcVersionSign
Uname: Linux 2.6.35-6-generic x86_64
NonfreeKernelMo
Architecture: amd64
Date: Mon Jun 28 22:10:27 2010
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Alpha amd64 (20100602.2)
ProcEnviron:
PATH=(custom, user)
LANG=en_US.utf8
SHELL=/bin/bash
SourcePackage: lxml
Forget the "de" domain of google, it's https URI redirects to "http:// www.google. com", but "https:/ /www.google. com" really exists and fails:
>>> parse("https:/ /www.google. com") python2. 6/dist- packages/ lxml/html/ __init_ _.py", line 661, in parse filename_ or_url, parser, base_url=base_url, **kw) lxml.etree. c:49958) _parseDocument (src/lxml/ lxml.etree. c:71797) _parseDocumentF romURL (src/lxml/ lxml.etree. c:72080) _parseDocFromFi le (src/lxml/ lxml.etree. c:71175) _BaseParser. _parseDocFromFi le (src/lxml/ lxml.etree. c:68173) _ParserContext. _handleParseRes ultDoc (src/lxml/ lxml.etree. c:64257) _handleParseRes ult (src/lxml/ lxml.etree. c:65178) _raiseParseErro r (src/lxml/ lxml.etree. c:64493) /www.google. com': failed to load external entity "https:/ /www.google. com"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/
return etree.parse(
File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/
File "parser.pxi", line 1500, in lxml.etree.
File "parser.pxi", line 1529, in lxml.etree.
File "parser.pxi", line 1429, in lxml.etree.
File "parser.pxi", line 975, in lxml.etree.
File "parser.pxi", line 539, in lxml.etree.
File "parser.pxi", line 625, in lxml.etree.
File "parser.pxi", line 563, in lxml.etree.
IOError: Error reading file 'https:/