lxml

Bug #1882606
Comment #3

Comment 3 for bug 1882606

Revision history for this message

mlissner (mlissner-michaeljaylissner) wrote on 2020-06-13:

Hm, I'd expect that setting the default parser would fix this then?

I just tried this, but didn't get the fix I was hoping for:

etree.set_default_parser(etree.HTMLParser())
html = '<a href="asdf">test</a>'
Cleaner(
  javascript=False,
  safe_attrs_only=False,
  scripts=False,
  comments=False,
  style=False,
  inline_style=False,
  links=False,
  meta=False,
  page_structure=False,
  processing_instructions=False,
  embedded=False,
  frames=False,
  forms=False,
  annoying_tags=False,
  remove_unknown_tags=False
).clean_html(html)

'<a href="asdf">test</a>' # Still the comment is stripped

Am I missing something here?

> Note that passing a parsed tree into the cleaner does not suffer from this issue.

I don't find that to be true either, but perhaps I'm misunderstanding. I have this code:

def clean_a_tree(trees):
    assert isinstance(tree, lxml.html.HtmlElement), (
        "`tree` must be of type HtmlElement, but is of type %s. Cleaner() can "
        "work with strs and unicode, but it does bad things to encodings if "
        "given the chance."
        % type(tree)
    )
    cleaner = Cleaner(
        javascript=False,
        safe_attrs_only=False,
        forms=False,
        comments=False,
        processing_instructions=False,
        scripts=True,
        style=True,
        links=True,
        embedded=True,
        frames=True,
    )
    return cleaner.clean_html(tree)

So it asserts that it's getting a tree, but this still suffers from the issue.

Thanks for the help. I'm pretty lost and I admit I'm frustrated with this issue. I'm guessing it's just a documentation issue, but I haven't been able to sort it out yet.

Thank you again,

Mike