Comment 8 for bug 1882606

Revision history for this message
mlissner (mlissner-michaeljaylissner) wrote :

Thanks for that fix. A couple responses. First, it seems like your workaround doesn't work either:

  In [14]: cleaner = Cleaner(
      safe_attrs_only=False,
      comments=False,
      processing_instructions=False,
      remove_unknown_tags=False,
      allow_tags= set(lxml.html.defs.tags) | {lxml.etree.Comment},
  )
  In [15]: cleaner.clean_html('<!-- comment --><a href="asdf">test</a>')
  Out[15]: '<a href="asdf">test</a>' # <-- Should have kept the comment, right?

Also, and sort of relatedly, at first I didn't include the remove_unknown_tags attribute above, and it complained when I ran clean_html that:

  "It does not make sense to pass in both allow_tags and remove_unknown_tags")

It's a bit surprising because it's not like *I* set remove_unknown_tags to True, it's the default, so whenever you add the allow_tags parameter, you have to also add the remove_unknown_tags=False parameter, and you only learn this *after* you initialize the class, when you call clean_html.

Does it makes sense to put this in the __init__() method also/instead, maybe as an assertion? Or if you want to go wild, maybe when allow_tags is there, it takes precedence over a default remove_unknown_tags parameter? Not sure there. It'd be nice if this didn't crash:

 Cleaner(allow_tags=something).clean_html(some_html)

I also hate to ask because you've been so helpful fixing this, but any idea when 4.5.2 will be released to PyPi? I'm trying to decide if I should wait for the release or use the workaround (assuming it can be made to work).

Thank you again,

Mike