lxml

Bug #1958539
Comment #0

Comment 0 for bug 1958539

Revision history for this message

Victor Stinner (vstinner) wrote on 2022-01-20: Consider deprecating/removing clean_html() in favor of bleach?

Hi,

Recantly at Red Hat, we had to fix (backport the changes for) multiple lxml clean_html() security issues in the lxml versions that we are maintaing in Fedora and RHEL. It's a "whack-a-mole" game since the implementation is based on a block list.

Would it be possible to deprecate, or even consider removing, the clean_html() function and suggest developers to use the bleach project instead? The bleach project is based on an allow list and so is safer.

Bleach project: https://github.com/mozilla/bleach

"Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes"

Bleach seems quite peopular: https://libraries.io/pypi/bleach says 11.7K repositories depend on it and 586 packages depend on it.

In the last 15 months, 3 vulnerabilities have been found in the lxml clean_html() function:

* 2021-12-12, CVE-2021-43818 (SVG):
  https://github.com/lxml/lxml/security/advisories/GHSA-55x5-fj6c-h6m8
* 2021-03-21, CVE-2021-28957 (HTML action attribute):
  https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957
* 2020-11-27, CVE-2020-27783 (lxml 4.6.2):
  https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783

I ran a code search on PyPI top 5000 projects (at 2021-12-01).

I found the following 10 projects which uses the lxml clean_html() method:

* requests-lxml: find() and xpath() use lxml clean_html() if their clean parameter is true (default: clean=False)
* html-telegraph-poster: html_telegraph_poster.converter.clean_article_html() uses lxml clean_html()
* newspaper3k: OutputFormatter.convert_to_html() always calls Parser.clean_article_html() which uses lxml clean_html()
* readability-lxml: Document._parse() uses lxml clean_html()
* jusText: jusText.core.preprocessor() uses lxml clean_html()
* htmldate: htmldate.core.find_date() uses lxml clean_html() with the comment "# clean before string search".
* trafilatura: tree_cleaning() uses lxml clean_html()
* html_text: _cleaned_html_tree() uses lxml clean_html(), function called by cleaned_selector() and extract_text()
* item: HTMLField uses lxml clean_html()
* extruct: LxmlMicrodataExtractor._extract_textContent() uses lxml clean_html()

The "clean_html" code search also found projects which don't use lxml to clean HTML:

* nltk.util.clean_html() raises NotImplementedError("To remove HTML markup, use BeautifulSoup's get_text() function")
* textblock.blob.BaseBlob(clean_html=False) parameters raises an exception if it's true: NotImplementedError("clean_html has been deprecated. To remove HTML markup, use BeautifulSoup's get_text() function")
* django.utils.html.clean_html() undocumented function was removed in Django 1.8. See https://docs.djangoproject.com/en/dev/releases/1.7/ for details (it announces the deprecation).
* The django-html_sanitizer project is based on bleach.
* yt_dlp.utils.clean_html() uses 3 regex replacements and calls its unescapeHTML() function to replace HTML entities using a 4th regex
* recommender-xblock uses bleach.clean()

Hi,