lxml

Cleaner() removes comments no matter what

Bug #1882606 reported by mlissner on 2020-06-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Fix Released	Low	Unassigned	lxml 4.5.2

Bug Description

Cleaner() has an argument `comments`, which should control whether comments are stripped from HTML trees. In version 4.5.1, the argument doesn't seem to work. No matter if it's True or False, comments are always stripped:

Cleaner(comments=False).clean_html('<a href="/asdf" onclick="foo">test</a>')

'<a href="/asdf" onclick="foo">test</a>'

For good measure, I tried flipping all the possible attributes on Cleaner() to False. Same outcome:

In [36]: Cleaner(javascript=False, safe_attrs_only=False, scripts=False, comments=False, style=False, inline_style=False, links=False, meta=False, page_structure=False, processing_instructions=False, embedded=False, frames=False, forms=False, annoying_tags=False, remove_unknown_tags=False, ).clean_html(s)
Out[36]: '<a href="/asdf" onclick="foo">test</a>'

I'm a bit surprised by this, since I'd imagine a unit test to catch this, but perhaps there's something I don't understand about how Cleaner() or lxml works.

Thanks for the amazing package. I've been using it for years and it's really quite impressive.

Please let me know if there is anything else I can provide. Here's my version info:

```
In [39]: print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_info(major=2, minor=7, micro=17, releaselevel='final', serial=0)

In [40]: print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
lxml.etree : (4, 5, 1, 0)

In [41]: print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
libxml used : (2, 9, 10)

In [42]: print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
libxml compiled : (2, 9, 10)

In [43]: print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
libxslt used : (1, 1, 34)

In [44]: print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
libxslt compiled : (1, 1, 34)
```

Tags:

Revision history for this message

scoder (scoder) wrote on 2020-06-09:

Cleaner.clean_html() uses the default parser internally, which strips comments and PIs – before even passing the result through the cleaner. I admit that that may appear counter-intuitive. PR welcome to configure the parser based on the relevant Cleaner options.

Revision history for this message

scoder (scoder) wrote on 2020-06-09:

Note that passing a parsed tree into the cleaner does not suffer from this issue.

Changed in lxml:
importance:	Undecided → Low
status:	New → Confirmed

Revision history for this message

mlissner (mlissner-michaeljaylissner) wrote on 2020-06-13:

Hm, I'd expect that setting the default parser would fix this then?

I just tried this, but didn't get the fix I was hoping for:

etree.set_default_parser(etree.HTMLParser())
html = '<a href="asdf">test</a>'
Cleaner(
  javascript=False,
  safe_attrs_only=False,
  scripts=False,
  comments=False,
  style=False,
  inline_style=False,
  links=False,
  meta=False,
  page_structure=False,
  processing_instructions=False,
  embedded=False,
  frames=False,
  forms=False,
  annoying_tags=False,
  remove_unknown_tags=False
).clean_html(html)

'<a href="asdf">test</a>' # Still the comment is stripped

Am I missing something here?

> Note that passing a parsed tree into the cleaner does not suffer from this issue.

I don't find that to be true either, but perhaps I'm misunderstanding. I have this code:

def clean_a_tree(trees):
    assert isinstance(tree, lxml.html.HtmlElement), (
        "`tree` must be of type HtmlElement, but is of type %s. Cleaner() can "
        "work with strs and unicode, but it does bad things to encodings if "
        "given the chance."
        % type(tree)
    )
    cleaner = Cleaner(
        javascript=False,
        safe_attrs_only=False,
        forms=False,
        comments=False,
        processing_instructions=False,
        scripts=True,
        style=True,
        links=True,
        embedded=True,
        frames=True,
    )
    return cleaner.clean_html(tree)

So it asserts that it's getting a tree, but this still suffers from the issue.

Thanks for the help. I'm pretty lost and I admit I'm frustrated with this issue. I'm guessing it's just a documentation issue, but I haven't been able to sort it out yet.

Thank you again,

Mike

Revision history for this message

scoder (scoder) wrote on 2020-06-13:

That should be

etree.set_default_parser(etree.HTMLParser(remove_comments=False))

and I'd generally recommend against changing the default parser, unless you control all usages of lxml in your app (including your dependencies).

But yes, that doesn't work either. The reason is that there is an additional Cleaner step at the end that removes any "unknown tags", and since comments and PIs are not in the list of known tags, they are removed as well. :-/

That bug didn't show up that clearly before because comments are removed by default by the parser, and most use cases are happy with that. Only when you want to keep the comments in place, while cleaning the rest, you run into this issue.

This isn't entirely easy to fix since I'd like to avoid breaking (too much) code that relies on the current behaviour – and there are tons of options that change that behaviour. I'll see what I can do.

Revision history for this message

mlissner (mlissner-michaeljaylissner) wrote on 2020-06-13:

Oof, yeah, we are working with a government website that actually uses HTML comments in a good way. The HTML doesn't have headers between sections, but it does say things like:

So those are the best way to parse things.

This seems like a tricky problem to solve in a backward-compatible way. I don't have any good ideas aside from a major version bump, but I really appreciate that you're looking into it.

Revision history for this message

scoder (scoder) wrote on 2020-06-13:

If you need a work-around, you can pass the "allow_tags" yourself as

allow_tags = set(lxml.html.defs.tags) | {etree.Comment}

and/or set "remove_unknown_tags=False". That will prevent the comments from being discarded as "unknown".

Revision history for this message

scoder (scoder) wrote on 2020-06-13:

Here's a fix:
https://github.com/lxml/lxml/commit/dd2d80a416e0aa5e177a723bcd571acf83a4c06a

I think it's not a backward compatibility issue. If users requested comments/PIs *not* to be removed, then they'll now have them preserved. That's it. Given that removal was the default, users explicitly had to express their intent to keep them.

Changed in lxml:
milestone:	none → 4.5.2
status:	Confirmed → Fix Committed

Revision history for this message

mlissner (mlissner-michaeljaylissner) wrote on 2020-06-16:

Thanks for that fix. A couple responses. First, it seems like your workaround doesn't work either:

In [14]: cleaner = Cleaner(
      safe_attrs_only=False,
      comments=False,
      processing_instructions=False,
      remove_unknown_tags=False,
      allow_tags= set(lxml.html.defs.tags) | {lxml.etree.Comment},
  )
  In [15]: cleaner.clean_html('<a href="asdf">test</a>')
  Out[15]: '<a href="asdf">test</a>' # <-- Should have kept the comment, right?

Also, and sort of relatedly, at first I didn't include the remove_unknown_tags attribute above, and it complained when I ran clean_html that:

"It does not make sense to pass in both allow_tags and remove_unknown_tags")

It's a bit surprising because it's not like *I* set remove_unknown_tags to True, it's the default, so whenever you add the allow_tags parameter, you have to also add the remove_unknown_tags=False parameter, and you only learn this *after* you initialize the class, when you call clean_html.

Does it makes sense to put this in the __init__() method also/instead, maybe as an assertion? Or if you want to go wild, maybe when allow_tags is there, it takes precedence over a default remove_unknown_tags parameter? Not sure there. It'd be nice if this didn't crash:

Cleaner(allow_tags=something).clean_html(some_html)

I also hate to ask because you've been so helpful fixing this, but any idea when 4.5.2 will be released to PyPi? I'm trying to decide if I should wait for the release or use the workaround (assuming it can be made to work).

Thank you again,

Mike

Revision history for this message

scoder (scoder) wrote on 2020-06-16:

> it seems like your workaround doesn't work either

It works if you have a single element instead of a comment and an element:

>>> cleaner.clean_html('<div><a href="asdf">test</a></div>')
'<div><a href="asdf">test</a></div>'

So, comments that precede the element that it cleans are still discarded. I'm not sure right now what the best way would be to make that work for you as well. It really feels like no-one ever wanted comments to come out of the Cleaner. :)

> It's a bit surprising because it's not like *I* set remove_unknown_tags to True, it's the default

Right, I agree that that's surprising. I think changing the default of "remove_unknown_tags" to a sentinel value and checking if it was passed at all would be better here. If users don't pass it, take a decision based on whether "allow_tags" was passed or not. It's good that passing both is currently an error, that makes it safe to change the combination. And yes, we should do that in the constructor, not later.

Care to provide a PR?

Revision history for this message

scoder (scoder) wrote on 2020-06-16:

#10

> any idea when 4.5.2 will be released to PyPi

I was waiting for Cython 0.29.20 to be released for better Py3.9 support. Now that that's done, I think fixing (or improving on) this issue for the release would be nice, and then I'd push it out.

Revision history for this message

mlissner (mlissner-michaeljaylissner) wrote on 2020-06-19:

#11

I rarely have time these days to do PRs, but since you've been so super helpful, I took a stab at it:

https://github.com/lxml/lxml/pull/301

I was also finally able to get that workaround going today, so thank you once again for that. I guess we can close this issue, if you want to just focus on the PR. I assume that you've got the other fix in place, though I haven't verified it.

scoder (scoder) on 2020-08-05

Changed in lxml:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.