Cleaner() removes comments no matter what
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Fix Released
|
Low
|
Unassigned |
Bug Description
Cleaner() has an argument `comments`, which should control whether comments are stripped from HTML trees. In version 4.5.1, the argument doesn't seem to work. No matter if it's True or False, comments are always stripped:
Cleaner(
'<a href="/asdf" onclick=
For good measure, I tried flipping all the possible attributes on Cleaner() to False. Same outcome:
In [36]: Cleaner(
Out[36]: '<a href="/asdf" onclick=
I'm a bit surprised by this, since I'd imagine a unit test to catch this, but perhaps there's something I don't understand about how Cleaner() or lxml works.
Thanks for the amazing package. I've been using it for years and it's really quite impressive.
Please let me know if there is anything else I can provide. Here's my version info:
```
In [39]: print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_
In [40]: print("%-20s: %s" % ('lxml.etree', etree.LXML_
lxml.etree : (4, 5, 1, 0)
In [41]: print("%-20s: %s" % ('libxml used', etree.LIBXML_
libxml used : (2, 9, 10)
In [42]: print("%-20s: %s" % ('libxml compiled', etree.LIBXML_
libxml compiled : (2, 9, 10)
In [43]: print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_
libxslt used : (1, 1, 34)
In [44]: print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_
libxslt compiled : (1, 1, 34)
```
Changed in lxml: | |
status: | Fix Committed → Fix Released |
Cleaner. clean_html( ) uses the default parser internally, which strips comments and PIs – before even passing the result through the cleaner. I admit that that may appear counter-intuitive. PR welcome to configure the parser based on the relevant Cleaner options.