make_links_absolute fails on bad links

Bug #1250557 reported by sylvain zimmer on 2013-11-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Low
Unassigned

Bug Description

make_links_absolute() is not robust enough when receiving bad input.

2 examples:

<html>
<body>
  <div>
    <h1>Some test</h1>

    <!-- Links like this make lxml panic -->
    <a href="http://faceracebase.com]Buy">test2</a>
  </div>
</body>
</html>

This will throw

  ret["doc"].make_links_absolute(base_href)
  File "/home/worker/code/venv/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 316, in make_links_absolute
    self.rewrite_links(link_repl)
  File "/home/worker/code/venv/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 437, in rewrite_links
    new_link = link_repl_func(link.strip())
  File "/home/worker/code/venv/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 315, in link_repl
    return urljoin(base_url, href)
  File "/usr/lib/python2.7/urlparse.py", line 260, in urljoin
    urlparse(url, bscheme, allow_fragments)
  File "/usr/lib/python2.7/urlparse.py", line 142, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "/usr/lib/python2.7/urlparse.py", line 190, in urlsplit
    raise ValueError("Invalid IPv6 URL")
ValueError: Invalid IPv6 URL

Other example:

"""
<html>
<body>
  <div>
    <h1>Some a\x12b\x13c\x14d\x15e test</h1>
    <a href="/bad-link-a\x12b\x13c\x14d\x15e">test</a>
  </div>
</body>
</html>"""

Will throw

  ret["doc"].make_links_absolute(base_href)
  File "/home/worker/code/venv/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 316, in make_links_absolute
    self.rewrite_links(link_repl)
  File "/home/worker/code/venv/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 454, in rewrite_links
    el.attrib[attrib] = new_link
  File "lxml.etree.pyx", line 2222, in lxml.etree._Attrib.__setitem__ (src/lxml/lxml.etree.c:54583)
  File "apihelpers.pxi", line 520, in lxml.etree._setAttributeValue (src/lxml/lxml.etree.c:17678)
  File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24701)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

It would be great to just ignore these malformed links or at least have an option to do so, because for now my only choice is to disable make_links_absolute() entirely or do some pre-processing that would make it useless in the first place.

Python : sys.version_info(major=2, minor=7, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 2, 4, 0)
libxml used : (2, 7, 2)
libxml compiled : (2, 7, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Thanks a lot!

scoder (scoder) wrote :

The first part is fixed by adding a new option "handle_failures" here:

https://github.com/lxml/lxml/commit/ab497930d74c7bcf4b725809508a1fefef453faa

The second part is more tricky. The right fix would be to generally handle encoding problems in parsed broken HTML trees better.

Thanks a lot for the first part of the fix :)

scoder (scoder) on 2013-11-15
Changed in lxml:
status: New → Confirmed
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers