Move lxml.html.clean into external project

Bug #1958539 reported by Victor Stinner
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Medium
scoder
lxml (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

Hi,

Recently at Red Hat, we had to fix (backport the changes for) multiple lxml clean_html() security issues in the lxml versions that we are maintaing in Fedora and RHEL. It's a "whack-a-mole" game since the implementation is based on a block list.

Would it be possible to deprecate, or even consider removing, the clean_html() function and suggest developers to use the bleach project instead? The bleach project is based on an allow list and so is safer.

Bleach project: https://github.com/mozilla/bleach

"Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes"

Bleach seems quite popular: https://libraries.io/pypi/bleach says 11.7K repositories depend on it and 586 packages depend on it.

--

In the last 15 months, 3 vulnerabilities have been found in the lxml clean_html() function:

* 2021-12-12, CVE-2021-43818 (SVG):
  https://github.com/lxml/lxml/security/advisories/GHSA-55x5-fj6c-h6m8
* 2021-03-21, CVE-2021-28957 (HTML action attribute):
  https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957
* 2020-11-27, CVE-2020-27783 (lxml 4.6.2):
  https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783

--

I ran a code search on PyPI top 5000 projects (at 2021-12-01).

I found the following 10 projects which uses the lxml clean_html() method:

* requests-lxml: find() and xpath() use lxml clean_html() if their clean parameter is true (default: clean=False)
* html-telegraph-poster: html_telegraph_poster.converter.clean_article_html() uses lxml clean_html()
* newspaper3k: OutputFormatter.convert_to_html() always calls Parser.clean_article_html() which uses lxml clean_html()
* readability-lxml: Document._parse() uses lxml clean_html()
* jusText: jusText.core.preprocessor() uses lxml clean_html()
* htmldate: htmldate.core.find_date() uses lxml clean_html() with the comment "# clean before string search".
* trafilatura: tree_cleaning() uses lxml clean_html()
* html_text: _cleaned_html_tree() uses lxml clean_html(), function called by cleaned_selector() and extract_text()
* item: HTMLField uses lxml clean_html()
* extruct: LxmlMicrodataExtractor._extract_textContent() uses lxml clean_html()

The "clean_html" code search also found projects which don't use lxml to clean HTML:

* nltk.util.clean_html() raises NotImplementedError("To remove HTML markup, use BeautifulSoup's get_text() function")
* textblock.blob.BaseBlob(clean_html=False) parameters raises an exception if it's true: NotImplementedError("clean_html has been deprecated. To remove HTML markup, use BeautifulSoup's get_text() function")
* django.utils.html.clean_html() undocumented function was removed in Django 1.8. See https://docs.djangoproject.com/en/dev/releases/1.7/ for details (it announces the deprecation).
* The django-html_sanitizer project is based on bleach.
* yt_dlp.utils.clean_html() uses 3 regex replacements and calls its unescapeHTML() function to replace HTML entities using a 4th regex
* recommender-xblock uses bleach.clean()

description: updated
description: updated
Revision history for this message
scoder (scoder) wrote :

I feel your pain. I'd happily deprecate the HTML cleaner and send current users to … something else.

An obvious issue with bleach is that it uses html5lib, a different parser that is known to be quite slow. There is support for exchanging data between html5lib and lxml, but that's not very efficient. Apart from that, users should go for whatever they want. Definitely if cleaning HTML in a secure way is something they need.

I think adding a note to the docs and eventually issuing a deprecation warning is ok. But there should be a reasonable migration path. Telling users to "go and figure out a way to rewrite your code somehow" isn't really cool.

I added a note to the docs (of a future release) for now.

https://github.com/lxml/lxml/commit/ac829d561c0bf71fb8cc704305ffc18bd26c6abb

Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Tomáš Hrnčiar (thrnciar) wrote :

Hello,

I am Victor's colleague at Red Hat. We recently discussed this issue and would like to separate clean_html sources into a standalone package as an optional dependency. This way we could have lxml in RHEL without the threat of CVEs coming from clean_html.
Would you be comfortable with this solution? If so, what would you suggest to call the new package like - lxml-clean-html, clean-html or something else?

Revision history for this message
scoder (scoder) wrote :

There should probably be "lxml" in the name, given that it tightly depends on it. Maybe "lxml-html-clean", to also mention the "lxml.html" package that owns it?

Revision history for this message
Cory Gwin (gwincr11) wrote :

We have done some performance testing between bleach and lxml, it is worth noting that bleach is significantly slower than lxml in our experience increasing page load time by 300-400 milliseconds. See https://github.com/jupyter/nbconvert/issues/1892?notification_referrer_id=NT_kwDOAARsWrE0NzAwNjE2MjEwOjI4OTg4Mg&notifications_query=repo%3Ajupyter%2Fnbconvert#issuecomment-1294475221

Revision history for this message
Victor Stinner (vstinner) wrote :

> bleach is significantly slower than lxml

Well, make your choice between correctness (security) and performance :-D Today, I'm not aware of any open vulnerability in lxml, they are fixed quickly. But by design, bleach looks safer.

Revision history for this message
Cory Gwin (gwincr11) wrote :

We have actually implemented 2 layers at different levels because 2 layers at different levels was faster than bleach and also safer than just lxml :) which is silly.

Revision history for this message
Michal Čihař (nijel) wrote :

…and bleach is now deprecated as well, see https://github.com/mozilla/bleach/issues/698

Revision history for this message
frenzy (frenzy-madness) wrote :
Download full text (3.1 KiB)

I'm looking into this now and have some more info.

Bleach might be deprecated but developers promised that it will still receive security fixes and will stay compatible with new Pythons so it should still be safe to use.

Another alternative I see among projects moving away from bleach/lxml is nh3, a Python binding for ammonia project written in Rust. Ammonia seems to be security-focused, whitelist-based, and fast. nh3 uses the latest pyo3 version (compatible with Python 3.12) and provides wheels built for stable ABI so it should work for everybody. It also seems straightforward to port a code from bleach to nh3 and nh3 is also much faster.

https://github.com/rust-ammonia/ammonia
https://github.com/messense/nh3

I've also looked at projects using clean_html or Cleaner from lxml. I've used the top 5000 PyPI projects, grep.app (https://grep.app/) and looked also into sources of RPM packages in Fedora Linux which depend on python3-lxml package. I've omitted projects already mentioned above.

In Fedora Linux, I found only two occurrences:

* python-readability-lxml
* calibre (this package bundles its own version of readability)

in the top 5000 projects I've found:

* requests-html: https://github.com/kennethreitz/requests-html/blob/master/requests_html.py#L30

And via grep.app, where support for regexes is limited:

* https://github.com/ysim/songtext/blob/master/libsongtext/lyricwiki.py
* https://github.com/lorien/weblib/blob/master/weblib/feed.py
* https://github.com/ColdHeat/pybluemonday/blob/master/benchmarks.py (just some benchmark)
* https://github.com/nopper/twittomatic/blob/master/helpers/hadoop/wikipedia/words/cleanup.py
* https://github.com/python-gsoc/python-blogs/blob/master/aldryn_newsblog/utils/utilities.py
* https://github.com/Linbreux/wikmd/blob/main/wiki.py
* https://github.com/divio/aldryn-search/blob/master/aldryn_search/utils.py
* https://github.com/PacktPublishing/PythonDataAnalysisCookbook/blob/master/Chapter%205/processing_html.py
* https://github.com/divio/aldryn-newsblog/blob/master/aldryn_newsblog/utils/utilities.py (already archived)
* https://github.com/DMOJ/online-judge/blob/master/judge/migrations/0091_compiler_message_ansi2html.py
* https://github.com/janeczku/calibre-web/blob/master/cps/editbooks.py (handles ImportError)
* https://github.com/neuml/paperai/blob/master/examples/search.py
* https://github.com/khamidou/kite/blob/master/src/back/kite/maildir.py (already archived)
* https://github.com/NikolaiT/GoogleScraper/blob/master/GoogleScraper/caching.py
* https://github.com/kootenpv/sky/blob/master/sky/standalone/monitorPage.py
* https://github.com/anyant/rssant/blob/master/rssant_feedlib/processor.py

As next steps, we can:

* Add nh3 as an alternative to the documentation and deprecation warning to Cleaner and clean_html.
* Open issues for identified projects. (I can do that)
* Finish the lxml-html-clean package (https://github.com/hrnciar/lxml-html-clean) (We can do it together with Tomáš)
* Add the new package as an extra dependency for lxml so one can install `lxml[clean_html]` and get both lxml and lxml-clean-html installed.
* Remove the deprecated part and make a new release.

What do you think about this...

Read more...

Revision history for this message
frenzy (frenzy-madness) wrote :

I've prepared this message and I'll start posting it to projects we've found affected in the previous stages.

Subject: Consider switching from lxml's clean_html for enhanced security (and possibly performance)

I'd like to bring to your attention that we are discussing a possibility to remove lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – [CVE-2021-43818](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-43818), [CVE-2021-28957](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957), [CVE-2020-27783](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783), [CVE-2018-19787](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-19787) and [CVE-2014-3146](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-3146).

The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.

Two viable alternatives worth considering are `bleach` and `nh3`. Here's why:

bleach:
* Bleach is a widely adopted Python library specifically designed for sanitizing and cleaning HTML input.
* It has a strong track record in terms of security – it's allowed-list-based.
* It was deprecated in January but it will still receive security updates, support for new Pythons and bugfixes, see [upstream issue](https://github.com/mozilla/bleach/issues/698#issue-1553425406).

nh3:
* nh3 is Python binding for the ammonia library. Ammonia is written in Rust and it's also allowed-list-based.
* Thanks to the Rust backend, nh3 is also significantly faster than bleach.
* Rust backend is nothing to be afraid of. nh3 uses the latest PyO3 compatible with Python 3.12 and provides wheels built on top of compatible ABI for different architectures and platforms.

We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.

Let me know if we can help you with this transition anyhow and have a nice day.

---

Do you think there is anything to add or adjust?

Lumír

Revision history for this message
scoder (scoder) wrote :

Reads very good to me. Happy to see this message spread.

Seems worth adding a link back to this ticket, though, so that others can read up on the discussion and comment if needed.

Revision history for this message
frenzy (frenzy-madness) wrote :

Today, I've gone through all the projects we discovered during the research phases and open issues for them. I also added the link to this bug to the previously proposed message. Before opening an issue, I've verified once more that the projects really use the functionality we are gonna (re)move.

This is a log of open issues and problems I've found during the process:

* requests-lxml does not exists. I've found the described functionality in requests-html so it's probably just a typo here in the first comment.
* https://github.com/mercuree/html-telegraph-poster/issues/23
* https://github.com/codelucas/newspaper/issues/972
* https://github.com/buriy/python-readability/issues/179
* https://github.com/miso-belica/jusText/issues/46
* https://github.com/adbar/htmldate/issues/91
* https://github.com/adbar/trafilatura/issues/412
* https://github.com/TeamHG-Memex/html-text/issues/30
* https://pypi.org/project/item/ - I cannot find sources of this project
* https://github.com/scrapinghub/extruct/issues/209
* https://pypi.org/project/requests-html/ points to https://github.com/kennethreitz/requests-html which is a fork (13 commits behind) of https://github.com/psf/requests-html: https://github.com/psf/requests-html/issues/558
* https://github.com/ysim/songtext/issues/50
* https://github.com/lorien/weblib/ does not exist anymore, the same problem as with `item` project above by the same author
* https://github.com/ColdHeat/pybluemonday/issues/44
* https://github.com/nopper/twittomatic/issues/13
* https://github.com/python-gsoc/python-blogs/issues/538
* https://github.com/Linbreux/wikmd/issues/125
* https://github.com/divio/aldryn-search/issues/115
* https://github.com/PacktPublishing/PythonDataAnalysisCookbook/issues/6
* https://github.com/DMOJ/online-judge/issues/2284
* https://github.com/janeczku/calibre-web/issues/2874
* https://github.com/neuml/paperai/issues/69
* https://github.com/NikolaiT/GoogleScraper/issues/247
* https://github.com/kootenpv/sky/issues/18
* https://github.com/anyant/rssant/issues/139

Revision history for this message
frenzy (frenzy-madness) wrote :

I already have some responses and their sentiment is overall positive. Responders either consider the proposed alternatives or they don't need the output from clean_html to be secure. For the second group, we should make sure that the cleaner will be useful for them with minimal changes to their codebases.

Also, another possible alternative is pybluemonday which uses lxml only in benchmarks and might be another replacement.

Revision history for this message
frenzy (frenzy-madness) wrote :

I've prepared a PR for documentation: https://github.com/lxml/lxml/pull/384 and also a new dedicated project lxml_html_clean for the functionalities from lxml.html.clean: https://github.com/fedora-python/lxml_html_clean

It would be great if you could check them both. I'd then publish the new package to PyPI and we can start preparing lxml for removal for those parts.

Revision history for this message
frenzy (frenzy-madness) wrote :

It took me a little bit longer than expected but I finished the new project lxml_html_clean.

Github: https://github.com/fedora-python/lxml_html_clean
PyPI: https://pypi.org/project/lxml-html-clean/
Documentation: https://lxml-html-clean.readthedocs.io/en/latest/#

It was a little bit tricky but I managed to convince git filter-branch to preserve the whole history of relevant files clean.py and its tests.

The next step is to prepare a PR to remove that part from lxml. I plan to work on that ASAP.

Revision history for this message
scoder (scoder) wrote :

Thanks a lot for working on this. I'll target this to lxml 5.2.

Changed in lxml:
milestone: none → 5.2
Revision history for this message
scoder (scoder) wrote :

The module has been extracted into a separate library and will be removed from lxml 5.2.

See https://github.com/fedora-python/lxml_html_clean

Changed in lxml:
assignee: nobody → scoder (scoder)
status: Confirmed → Fix Committed
Revision history for this message
scoder (scoder) wrote :

Implemented and released in lxml 5.2.

Thanks for taking over the maintenance of the package, it'll have a better life outside of lxml.
That will make it clearer that it's just one choice amongst several, and one that is not meant for security relevant use cases.

summary: - Consider deprecating/removing clean_html() in favor of bleach?
+ Move lxml.html.clean into external project
Changed in lxml:
status: Fix Committed → Fix Released
Revision history for this message
frenzy (frenzy-madness) wrote :

I've posted an update about the newest lxml release to all open issues.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxml - 5.2.1-1

---------------
lxml (5.2.1-1) unstable; urgency=medium

  * New upstream version.

 -- Matthias Klose <email address hidden> Wed, 03 Apr 2024 22:07:13 +0200

Changed in lxml (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers