Thanks for that fix. A couple responses. First, it seems like your workaround doesn't work either:
In [14]: cleaner = Cleaner(
safe_attrs_only=False,
comments=False,
processing_instructions=False,
remove_unknown_tags=False,
allow_tags= set(lxml.html.defs.tags) | {lxml.etree.Comment},
)
In [15]: cleaner.clean_html('<!-- comment --><a href="asdf">test</a>')
Out[15]: '<a href="asdf">test</a>' # <-- Should have kept the comment, right?
Also, and sort of relatedly, at first I didn't include the remove_unknown_tags attribute above, and it complained when I ran clean_html that:
"It does not make sense to pass in both allow_tags and remove_unknown_tags")
It's a bit surprising because it's not like *I* set remove_unknown_tags to True, it's the default, so whenever you add the allow_tags parameter, you have to also add the remove_unknown_tags=False parameter, and you only learn this *after* you initialize the class, when you call clean_html.
Does it makes sense to put this in the __init__() method also/instead, maybe as an assertion? Or if you want to go wild, maybe when allow_tags is there, it takes precedence over a default remove_unknown_tags parameter? Not sure there. It'd be nice if this didn't crash:
I also hate to ask because you've been so helpful fixing this, but any idea when 4.5.2 will be released to PyPi? I'm trying to decide if I should wait for the release or use the workaround (assuming it can be made to work).
Thanks for that fix. A couple responses. First, it seems like your workaround doesn't work either:
In [14]: cleaner = Cleaner( attrs_only= False, False, g_instructions= False, unknown_ tags=False, html.defs. tags) | {lxml.etree. Comment} , clean_html( '<!-- comment --><a href="asdf" >test</ a>') >test</ a>' # <-- Should have kept the comment, right?
safe_
comments=
processin
remove_
allow_tags= set(lxml.
)
In [15]: cleaner.
Out[15]: '<a href="asdf"
Also, and sort of relatedly, at first I didn't include the remove_unknown_tags attribute above, and it complained when I ran clean_html that:
"It does not make sense to pass in both allow_tags and remove_ unknown_ tags")
It's a bit surprising because it's not like *I* set remove_unknown_tags to True, it's the default, so whenever you add the allow_tags parameter, you have to also add the remove_ unknown_ tags=False parameter, and you only learn this *after* you initialize the class, when you call clean_html.
Does it makes sense to put this in the __init__() method also/instead, maybe as an assertion? Or if you want to go wild, maybe when allow_tags is there, it takes precedence over a default remove_unknown_tags parameter? Not sure there. It'd be nice if this didn't crash:
Cleaner( allow_tags= something) .clean_ html(some_ html)
I also hate to ask because you've been so helpful fixing this, but any idea when 4.5.2 will be released to PyPi? I'm trying to decide if I should wait for the release or use the workaround (assuming it can be made to work).
Thank you again,
Mike