Bug #1958539 “Move lxml.html.clean into external project” : Bugs : lxml

Victor Stinner (vstinner) on 2022-01-20

description:	updated
description:	updated

Revision history for this message

scoder (scoder) wrote on 2022-01-21:

#1

I feel your pain. I'd happily deprecate the HTML cleaner and send current users to … something else.

An obvious issue with bleach is that it uses html5lib, a different parser that is known to be quite slow. There is support for exchanging data between html5lib and lxml, but that's not very efficient. Apart from that, users should go for whatever they want. Definitely if cleaning HTML in a secure way is something they need.

I think adding a note to the docs and eventually issuing a deprecation warning is ok. But there should be a reasonable migration path. Telling users to "go and figure out a way to rewrite your code somehow" isn't really cool.

I added a note to the docs (of a future release) for now.

https://github.com/lxml/lxml/commit/ac829d561c0bf71fb8cc704305ffc18bd26c6abb

Changed in lxml:
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

Tomáš Hrnčiar (thrnciar) wrote on 2022-09-08:

#2

Hello,

I am Victor's colleague at Red Hat. We recently discussed this issue and would like to separate clean_html sources into a standalone package as an optional dependency. This way we could have lxml in RHEL without the threat of CVEs coming from clean_html.
Would you be comfortable with this solution? If so, what would you suggest to call the new package like - lxml-clean-html, clean-html or something else?

Revision history for this message

scoder (scoder) wrote on 2022-09-12:

#3

There should probably be "lxml" in the name, given that it tightly depends on it. Maybe "lxml-html-clean", to also mention the "lxml.html" package that owns it?

Revision history for this message

Cory Gwin (gwincr11) wrote on 2022-10-28:

#4

We have done some performance testing between bleach and lxml, it is worth noting that bleach is significantly slower than lxml in our experience increasing page load time by 300-400 milliseconds. See https://github.com/jupyter/nbconvert/issues/1892?notification_referrer_id=NT_kwDOAARsWrE0NzAwNjE2MjEwOjI4OTg4Mg&notifications_query=repo%3Ajupyter%2Fnbconvert#issuecomment-1294475221

Revision history for this message

Victor Stinner (vstinner) wrote on 2022-10-31:

#5

> bleach is significantly slower than lxml

Well, make your choice between correctness (security) and performance :-D Today, I'm not aware of any open vulnerability in lxml, they are fixed quickly. But by design, bleach looks safer.

Revision history for this message

Cory Gwin (gwincr11) wrote on 2022-10-31:

#6

We have actually implemented 2 layers at different levels because 2 layers at different levels was faster than bleach and also safer than just lxml :) which is silly.

Revision history for this message

Michal Čihař (nijel) wrote on 2023-01-24:

#7

…and bleach is now deprecated as well, see https://github.com/mozilla/bleach/issues/698

Revision history for this message

frenzy (frenzy-madness) wrote on 2023-08-10:

#8

Download full text (3.1 KiB)

I'm looking into this now and have some more info.

Bleach might be deprecated but developers promised that it will still receive security fixes and will stay compatible with new Pythons so it should still be safe to use.

Another alternative I see among projects moving away from bleach/lxml is nh3, a Python binding for ammonia project written in Rust. Ammonia seems to be security-focused, whitelist-based, and fast. nh3 uses the latest pyo3 version (compatible with Python 3.12) and provides wheels built for stable ABI so it should work for everybody. It also seems straightforward to port a code from bleach to nh3 and nh3 is also much faster.

https://github.com/rust-ammonia/ammonia
https://github.com/messense/nh3

I've also looked at projects using clean_html or Cleaner from lxml. I've used the top 5000 PyPI projects, grep.app (https://grep.app/) and looked also into sources of RPM packages in Fedora Linux which depend on python3-lxml package. I've omitted projects already mentioned above.

In Fedora Linux, I found only two occurrences:

* python-readability-lxml
* calibre (this package bundles its own version of readability)

in the top 5000 projects I've found:

* requests-html: https://github.com/kennethreitz/requests-html/blob/master/requests_html.py#L30

And via grep.app, where support for regexes is limited:

As next steps, we can:

* Add nh3 as an alternative to the documentation and deprecation warning to Cleaner and clean_html.
* Open issues for identified projects. (I can do that)
* Finish the lxml-html-clean package (https://github.com/hrnciar/lxml-html-clean) (We can do it together with Tomáš)
* Add the new package as an extra dependency for lxml so one can install `lxml[clean_html]` and get both lxml and lxml-clean-html installed.
* Remove the deprecated part and make a new release.

What do you think about this...

I'm looking into this now and have some more info.

Bleach might be deprecated but developers promised that it will still receive security fixes and will stay compatible with new Pythons so it should still be safe to use.

Another alternative I see among projects moving away from bleach/lxml is nh3, a Python binding for ammonia project written in Rust. Ammonia seems to be security-focused, whitelist-based, and fast. nh3 uses the latest pyo3 version (compatible with Python 3.12) and provides wheels built for stable ABI so it should work for everybody. It also seems straightforward to port a code from bleach to nh3 and nh3 is also much faster.

https://github.com/rust-ammonia/ammonia
https://github.com/messense/nh3

I've also looked at projects using clean_html or Cleaner from lxml. I've used the top 5000 PyPI projects, grep.app (https://grep.app/) and looked also into sources of RPM packages in Fedora Linux which depend on python3-lxml package. I've omitted projects already mentioned above.

In Fedora Linux, I found only two occurrences:

* python-readability-lxml
* calibre (this package bundles its own version of readability)

in the top 5000 projects I've found:

* requests-html: https://github.com/kennethreitz/requests-html/blob/master/requests_html.py#L30

And via grep.app, where support for regexes is limited:

* https://github.com/ysim/songtext/blob/master/libsongtext/lyricwiki.py
* https://github.com/lorien/weblib/blob/master/weblib/feed.py
* https://github.com/ColdHeat/pybluemonday/blob/master/benchmarks.py (just some benchmark)
* https://github.com/nopper/twittomatic/blob/master/helpers/hadoop/wikipedia/words/cleanup.py
* https://github.com/python-gsoc/python-blogs/blob/master/aldryn_newsblog/utils/utilities.py
* https://github.com/Linbreux/wikmd/blob/main/wiki.py
* https://github.com/divio/aldryn-search/blob/master/aldryn_search/utils.py
* https://github.com/PacktPublishing/PythonDataAnalysisCookbook/blob/master/Chapter%205/processing_html.py
* https://github.com/divio/aldryn-newsblog/blob/master/aldryn_newsblog/utils/utilities.py (already archived)
* https://github.com/DMOJ/online-judge/blob/master/judge/migrations/0091_compiler_message_ansi2html.py
* https://github.com/janeczku/calibre-web/blob/master/cps/editbooks.py (handles ImportError)
* https://github.com/neuml/paperai/blob/master/examples/search.py
* https://github.com/khamidou/kite/blob/master/src/back/kite/maildir.py (already archived)
* https://github.com/NikolaiT/GoogleScraper/blob/master/GoogleScraper/caching.py
* https://github.com/kootenpv/sky/blob/master/sky/standalone/monitorPage.py
* https://github.com/anyant/rssant/blob/master/rssant_feedlib/processor.py

As next steps, we can:

* Add nh3 as an alternative to the documentation and deprecation warning to Cleaner and clean_html.
* Open issues for identified projects. (I can do that)
* Finish the lxml-html-clean package (https://github.com/hrnciar/lxml-html-clean) (We can do it together with Tomáš)
* Add the new package as an extra dependency for lxml so one can install `lxml[clean_html]` and get both lxml and lxml-clean-html installed.
* Remove the deprecated part and make a new release.

What do you think about this plan?

Revision history for this message

frenzy (frenzy-madness) wrote on 2023-08-24:

#9

I've prepared this message and I'll start posting it to projects we've found affected in the previous stages.

Subject: Consider switching from lxml's clean_html for enhanced security (and possibly performance)

I'd like to bring to your attention that we are discussing a possibility to remove lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – [CVE-2021-43818](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-43818), [CVE-2021-28957](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957), [CVE-2020-27783](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783), [CVE-2018-19787](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-19787) and [CVE-2014-3146](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-3146).

The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.

Two viable alternatives worth considering are `bleach` and `nh3`. Here's why:

bleach:
* Bleach is a widely adopted Python library specifically designed for sanitizing and cleaning HTML input.
* It has a strong track record in terms of security – it's allowed-list-based.
* It was deprecated in January but it will still receive security updates, support for new Pythons and bugfixes, see [upstream issue](https://github.com/mozilla/bleach/issues/698#issue-1553425406).

nh3:
* nh3 is Python binding for the ammonia library. Ammonia is written in Rust and it's also allowed-list-based.
* Thanks to the Rust backend, nh3 is also significantly faster than bleach.
* Rust backend is nothing to be afraid of. nh3 uses the latest PyO3 compatible with Python 3.12 and provides wheels built on top of compatible ABI for different architectures and platforms.

We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.

Let me know if we can help you with this transition anyhow and have a nice day.

---

Do you think there is anything to add or adjust?

Lumír

I've prepared this message and I'll start posting it to projects we've found affected in the previous stages.

Subject: Consider switching from lxml's clean_html for enhanced security (and possibly performance)

I'd like to bring to your attention that we are discussing a possibility to remove lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – [CVE-2021-43818](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-43818), [CVE-2021-28957](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957), [CVE-2020-27783](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783), [CVE-2018-19787](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-19787) and [CVE-2014-3146](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-3146).

The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.

Two viable alternatives worth considering are `bleach` and `nh3`. Here's why:

bleach:
* Bleach is a widely adopted Python library specifically designed for sanitizing and cleaning HTML input.
* It has a strong track record in terms of security – it's allowed-list-based.
* It was deprecated in January but it will still receive security updates, support for new Pythons and bugfixes, see [upstream issue](https://github.com/mozilla/bleach/issues/698#issue-1553425406).

nh3:
* nh3 is Python binding for the ammonia library. Ammonia is written in Rust and it's also allowed-list-based.
* Thanks to the Rust backend, nh3 is also significantly faster than bleach.
* Rust backend is nothing to be afraid of. nh3 uses the latest PyO3 compatible with Python 3.12 and provides wheels built on top of compatible ABI for different architectures and platforms.

We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.

Let me know if we can help you with this transition anyhow and have a nice day.

---

Do you think there is anything to add or adjust?

Lumír

Revision history for this message

scoder (scoder) wrote on 2023-08-24:

#10

Reads very good to me. Happy to see this message spread.

Seems worth adding a link back to this ticket, though, so that others can read up on the discussion and comment if needed.

Revision history for this message

frenzy (frenzy-madness) wrote on 2023-08-30:

#11

Today, I've gone through all the projects we discovered during the research phases and open issues for them. I also added the link to this bug to the previously proposed message. Before opening an issue, I've verified once more that the projects really use the functionality we are gonna (re)move.

This is a log of open issues and problems I've found during the process:

Today, I've gone through all the projects we discovered during the research phases and open issues for them. I also added the link to this bug to the previously proposed message. Before opening an issue, I've verified once more that the projects really use the functionality we are gonna (re)move.

This is a log of open issues and problems I've found during the process:

* requests-lxml does not exists. I've found the described functionality in requests-html so it's probably just a typo here in the first comment.
* https://github.com/mercuree/html-telegraph-poster/issues/23
* https://github.com/codelucas/newspaper/issues/972
* https://github.com/buriy/python-readability/issues/179
* https://github.com/miso-belica/jusText/issues/46
* https://github.com/adbar/htmldate/issues/91
* https://github.com/adbar/trafilatura/issues/412
* https://github.com/TeamHG-Memex/html-text/issues/30
* https://pypi.org/project/item/ - I cannot find sources of this project
* https://github.com/scrapinghub/extruct/issues/209
* https://pypi.org/project/requests-html/ points to https://github.com/kennethreitz/requests-html which is a fork (13 commits behind) of https://github.com/psf/requests-html: https://github.com/psf/requests-html/issues/558
* https://github.com/ysim/songtext/issues/50
* https://github.com/lorien/weblib/ does not exist anymore, the same problem as with `item` project above by the same author
* https://github.com/ColdHeat/pybluemonday/issues/44
* https://github.com/nopper/twittomatic/issues/13
* https://github.com/python-gsoc/python-blogs/issues/538
* https://github.com/Linbreux/wikmd/issues/125
* https://github.com/divio/aldryn-search/issues/115
* https://github.com/PacktPublishing/PythonDataAnalysisCookbook/issues/6
* https://github.com/DMOJ/online-judge/issues/2284
* https://github.com/janeczku/calibre-web/issues/2874
* https://github.com/neuml/paperai/issues/69
* https://github.com/NikolaiT/GoogleScraper/issues/247
* https://github.com/kootenpv/sky/issues/18
* https://github.com/anyant/rssant/issues/139

Revision history for this message

frenzy (frenzy-madness) wrote on 2023-08-30:

#12

I already have some responses and their sentiment is overall positive. Responders either consider the proposed alternatives or they don't need the output from clean_html to be secure. For the second group, we should make sure that the cleaner will be useful for them with minimal changes to their codebases.

Also, another possible alternative is pybluemonday which uses lxml only in benchmarks and might be another replacement.

Revision history for this message

frenzy (frenzy-madness) wrote on 2023-10-23:

#13

I've prepared a PR for documentation: https://github.com/lxml/lxml/pull/384 and also a new dedicated project lxml_html_clean for the functionalities from lxml.html.clean: https://github.com/fedora-python/lxml_html_clean

It would be great if you could check them both. I'd then publish the new package to PyPI and we can start preparing lxml for removal for those parts.

Revision history for this message

frenzy (frenzy-madness) wrote on 2024-02-26:

#14

It took me a little bit longer than expected but I finished the new project lxml_html_clean.

Github: https://github.com/fedora-python/lxml_html_clean
PyPI: https://pypi.org/project/lxml-html-clean/
Documentation: https://lxml-html-clean.readthedocs.io/en/latest/#

It was a little bit tricky but I managed to convince git filter-branch to preserve the whole history of relevant files clean.py and its tests.

The next step is to prepare a PR to remove that part from lxml. I plan to work on that ASAP.

Revision history for this message

scoder (scoder) wrote on 2024-03-02:

#15

Thanks a lot for working on this. I'll target this to lxml 5.2.

Changed in lxml:
milestone:	none → 5.2

Revision history for this message

scoder (scoder) wrote on 2024-03-29:

#16

The module has been extracted into a separate library and will be removed from lxml 5.2.

See https://github.com/fedora-python/lxml_html_clean

Changed in lxml:
assignee:	nobody → scoder (scoder)
status:	Confirmed → Fix Committed

Revision history for this message

scoder (scoder) wrote on 2024-03-31:

#17

Implemented and released in lxml 5.2.

Thanks for taking over the maintenance of the package, it'll have a better life outside of lxml.
That will make it clearer that it's just one choice amongst several, and one that is not meant for security relevant use cases.

summary:	- Consider deprecating/removing clean_html() in favor of bleach? + Move lxml.html.clean into external project
Changed in lxml:
status:	Fix Committed → Fix Released

Revision history for this message

frenzy (frenzy-madness) wrote on 2024-04-02:

#18

I've posted an update about the newest lxml release to all open issues.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2024-04-17:

#19

This bug was fixed in the package lxml - 5.2.1-1

---------------
lxml (5.2.1-1) unstable; urgency=medium

* New upstream version.

-- Matthias Klose <email address hidden> Wed, 03 Apr 2024 22:07:13 +0200

Changed in lxml (Ubuntu):
status:	New → Fix Released

lxml

Move lxml.html.clean into external project

Bug Description

CVE References

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	lxml	Fix Released	Medium	scoder	lxml 5.2
	lxml (Ubuntu)	Fix Released	Undecided	Unassigned