Cleaner removes all <link>s when cleaning javascript regardless of host_whitelist

Bug #715687 reported by Mohammad Taha Jahangir
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Undecided
Christine Koppelt

Bug Description

When cleaning html with lxml.html.clean.Cleaner with these options set, all <link>s will be removed regardless of host_whitelist:
links = False
page_structure = False
javascript = True

It's because of this part of code in clean.py:
        # line 311
        elif self.style or self.javascript:
            # We must get rid of included stylesheets if Javascript is not
            # allowed, as you can put Javascript in them
            for el in list(doc.iter('link')):
                if 'stylesheet' in el.get('rel', '').lower():
                    # Note this kills alternate stylesheets as well
                    el.drop_tree()

all links are removed by drop_tree, but it seems they should remove by kill_tags.add('link')

Version info:
Python : sys.version_info(major=3, minor=1, micro=2, releaselevel='final', serial=0)
lxml.etree : (2, 3, -99, 0)
libxml used : (2, 7, 7)
libxml compiled : (2, 7, 7)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Revision history for this message
Christine Koppelt (ch-ko123) wrote :

See pull request 115 on github (https://github.com/lxml/lxml/pull/115)

Changed in lxml:
assignee: nobody → Christine Koppelt (ch-ko123)
Changed in lxml:
status: New → In Progress
Changed in lxml:
status: In Progress → Fix Committed
Revision history for this message
scoder (scoder) wrote :

Fixed in lxml 3.2.0.

Changed in lxml:
status: Fix Committed → Fix Released
scoder (scoder)
Changed in lxml:
milestone: none → 3.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.