make_links_absolute fails on bad links
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Confirmed
|
Low
|
Unassigned |
Bug Description
make_links_
2 examples:
<html>
<body>
<div>
<h1>Some test</h1>
<!-- Links like this make lxml panic -->
<a href="http://
</div>
</body>
</html>
This will throw
ret["
File "/home/
self.
File "/home/
new_link = link_repl_
File "/home/
return urljoin(base_url, href)
File "/usr/lib/
urlparse(url, bscheme, allow_fragments)
File "/usr/lib/
tuple = urlsplit(url, scheme, allow_fragments)
File "/usr/lib/
raise ValueError("Invalid IPv6 URL")
ValueError: Invalid IPv6 URL
Other example:
"""
<html>
<body>
<div>
<h1>Some a\x12b\
<a href="/
</div>
</body>
</html>"""
Will throw
ret["
File "/home/
self.
File "/home/
el.
File "lxml.etree.pyx", line 2222, in lxml.etree.
File "apihelpers.pxi", line 520, in lxml.etree.
File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
It would be great to just ignore these malformed links or at least have an option to do so, because for now my only choice is to disable make_links_
Python : sys.version_
lxml.etree : (3, 2, 4, 0)
libxml used : (2, 7, 2)
libxml compiled : (2, 7, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
Thanks a lot!
Changed in lxml: | |
status: | New → Confirmed |
importance: | Undecided → Low |
The first part is fixed by adding a new option "handle_failures" here:
https:/ /github. com/lxml/ lxml/commit/ ab497930d74c7bc f4b725809508a1f efef453faa
The second part is more tricky. The right fix would be to generally handle encoding problems in parsed broken HTML trees better.