Cleaning an HTML with more than 254 depth levels

Bug #1903325 reported by Yajo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Triaged
Undecided
Unassigned

Bug Description

I use Odoo, which uses lxml's html cleaner[1] to cleanup untrusted incoming HTML emails [2].

There was a recent email which was strangely some simple and innocent markup within a depth of 657 nested <span> elements. I have absolutely no idea on how in the world did the sender come up to write such a big amount of nested <span>s, but there they were.

After sanitizing, the mail was wrongly empty (actually a bunch of 254 empty <span> elements). In this similar question [3] you can see easy steps to reproduce the issue, which seems to be more down in the rabbit hole of what clean_html() has.

Is there any way to increase that limit and be able to recurse to more than 254 levels?

[1]: https://lxml.de/lxmlhtml.html#cleaning-up-html
[2]: https://github.com/odoo/odoo/blob/a1dc9c6b753f65884ec6d2b501424372b37cbce2/odoo/tools/mail.py#L237
[3]: https://stackoverflow.com/q/51034706/1468388

Revision history for this message
scoder (scoder) wrote :

This might be due to the safety limits that libxml2's default parser applies in order to defeat DoS attacks with large document content. You could try creating your own self-configured "lxml.html.HTMLParser" for parsing the document that has the "huge_tree=True" option set.

Obviously, disabling the parser limitations opens up your code to DoS attacks, but it's worth a try to see if that's the issue here.

Changed in lxml:
status: New → Triaged
Revision history for this message
Yajo (yajo) wrote :

It seems that's the issue, yes.

Is there no way to specify a custom protection level? I understand 254 is enough for well-formatted docs, but when i.e. parsing emails, which come from anywhere formatted anyhow, it's necessary to have that protection, but also to be a little more permissive on it. According to my tests, lxml is able to parse 15000 nested nodes in 1 second, and that should be a good cap for those kind of usages.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.