Move lxml.html.clean into external project
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Fix Released
|
Medium
|
scoder | ||
lxml (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Hi,
Recently at Red Hat, we had to fix (backport the changes for) multiple lxml clean_html() security issues in the lxml versions that we are maintaing in Fedora and RHEL. It's a "whack-a-mole" game since the implementation is based on a block list.
Would it be possible to deprecate, or even consider removing, the clean_html() function and suggest developers to use the bleach project instead? The bleach project is based on an allow list and so is safer.
Bleach project: https:/
"Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes"
Bleach seems quite popular: https:/
--
In the last 15 months, 3 vulnerabilities have been found in the lxml clean_html() function:
* 2021-12-12, CVE-2021-43818 (SVG):
https:/
* 2021-03-21, CVE-2021-28957 (HTML action attribute):
https:/
* 2020-11-27, CVE-2020-27783 (lxml 4.6.2):
https:/
--
I ran a code search on PyPI top 5000 projects (at 2021-12-01).
I found the following 10 projects which uses the lxml clean_html() method:
* requests-lxml: find() and xpath() use lxml clean_html() if their clean parameter is true (default: clean=False)
* html-telegraph-
* newspaper3k: OutputFormatter
* readability-lxml: Document._parse() uses lxml clean_html()
* jusText: jusText.
* htmldate: htmldate.
* trafilatura: tree_cleaning() uses lxml clean_html()
* html_text: _cleaned_
* item: HTMLField uses lxml clean_html()
* extruct: LxmlMicrodataEx
The "clean_html" code search also found projects which don't use lxml to clean HTML:
* nltk.util.
* textblock.
* django.
* The django-
* yt_dlp.
* recommender-xblock uses bleach.clean()
CVE References
description: | updated |
description: | updated |
I feel your pain. I'd happily deprecate the HTML cleaner and send current users to … something else.
An obvious issue with bleach is that it uses html5lib, a different parser that is known to be quite slow. There is support for exchanging data between html5lib and lxml, but that's not very efficient. Apart from that, users should go for whatever they want. Definitely if cleaning HTML in a secure way is something they need.
I think adding a note to the docs and eventually issuing a deprecation warning is ok. But there should be a reasonable migration path. Telling users to "go and figure out a way to rewrite your code somehow" isn't really cool.
I added a note to the docs (of a future release) for now.
https:/ /github. com/lxml/ lxml/commit/ ac829d561c0bf71 fb8cc704305ffc1 8bd26c6abb