2022-01-20 14:34:31 |
Victor Stinner |
bug |
|
|
added bug |
2022-01-20 14:36:41 |
Victor Stinner |
description |
Hi,
Recantly at Red Hat, we had to fix (backport the changes for) multiple lxml clean_html() security issues in the lxml versions that we are maintaing in Fedora and RHEL. It's a "whack-a-mole" game since the implementation is based on a block list.
Would it be possible to deprecate, or even consider removing, the clean_html() function and suggest developers to use the bleach project instead? The bleach project is based on an allow list and so is safer.
Bleach project: https://github.com/mozilla/bleach
"Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes"
Bleach seems quite peopular: https://libraries.io/pypi/bleach says 11.7K repositories depend on it and 586 packages depend on it.
--
In the last 15 months, 3 vulnerabilities have been found in the lxml clean_html() function:
* 2021-12-12, CVE-2021-43818 (SVG):
https://github.com/lxml/lxml/security/advisories/GHSA-55x5-fj6c-h6m8
* 2021-03-21, CVE-2021-28957 (HTML action attribute):
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957
* 2020-11-27, CVE-2020-27783 (lxml 4.6.2):
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783
--
I ran a code search on PyPI top 5000 projects (at 2021-12-01).
I found the following 10 projects which uses the lxml clean_html() method:
* requests-lxml: find() and xpath() use lxml clean_html() if their clean parameter is true (default: clean=False)
* html-telegraph-poster: html_telegraph_poster.converter.clean_article_html() uses lxml clean_html()
* newspaper3k: OutputFormatter.convert_to_html() always calls Parser.clean_article_html() which uses lxml clean_html()
* readability-lxml: Document._parse() uses lxml clean_html()
* jusText: jusText.core.preprocessor() uses lxml clean_html()
* htmldate: htmldate.core.find_date() uses lxml clean_html() with the comment "# clean before string search".
* trafilatura: tree_cleaning() uses lxml clean_html()
* html_text: _cleaned_html_tree() uses lxml clean_html(), function called by cleaned_selector() and extract_text()
* item: HTMLField uses lxml clean_html()
* extruct: LxmlMicrodataExtractor._extract_textContent() uses lxml clean_html()
The "clean_html" code search also found projects which don't use lxml to clean HTML:
* nltk.util.clean_html() raises NotImplementedError("To remove HTML markup, use BeautifulSoup's get_text() function")
* textblock.blob.BaseBlob(clean_html=False) parameters raises an exception if it's true: NotImplementedError("clean_html has been deprecated. To remove HTML markup, use BeautifulSoup's get_text() function")
* django.utils.html.clean_html() undocumented function was removed in Django 1.8. See https://docs.djangoproject.com/en/dev/releases/1.7/ for details (it announces the deprecation).
* The django-html_sanitizer project is based on bleach.
* yt_dlp.utils.clean_html() uses 3 regex replacements and calls its unescapeHTML() function to replace HTML entities using a 4th regex
* recommender-xblock uses bleach.clean() |
Hi,
Recently at Red Hat, we had to fix (backport the changes for) multiple lxml clean_html() security issues in the lxml versions that we are maintaing in Fedora and RHEL. It's a "whack-a-mole" game since the implementation is based on a block list.
Would it be possible to deprecate, or even consider removing, the clean_html() function and suggest developers to use the bleach project instead? The bleach project is based on an allow list and so is safer.
Bleach project: https://github.com/mozilla/bleach
"Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes"
Bleach seems quite peopular: https://libraries.io/pypi/bleach says 11.7K repositories depend on it and 586 packages depend on it.
--
In the last 15 months, 3 vulnerabilities have been found in the lxml clean_html() function:
* 2021-12-12, CVE-2021-43818 (SVG):
https://github.com/lxml/lxml/security/advisories/GHSA-55x5-fj6c-h6m8
* 2021-03-21, CVE-2021-28957 (HTML action attribute):
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957
* 2020-11-27, CVE-2020-27783 (lxml 4.6.2):
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783
--
I ran a code search on PyPI top 5000 projects (at 2021-12-01).
I found the following 10 projects which uses the lxml clean_html() method:
* requests-lxml: find() and xpath() use lxml clean_html() if their clean parameter is true (default: clean=False)
* html-telegraph-poster: html_telegraph_poster.converter.clean_article_html() uses lxml clean_html()
* newspaper3k: OutputFormatter.convert_to_html() always calls Parser.clean_article_html() which uses lxml clean_html()
* readability-lxml: Document._parse() uses lxml clean_html()
* jusText: jusText.core.preprocessor() uses lxml clean_html()
* htmldate: htmldate.core.find_date() uses lxml clean_html() with the comment "# clean before string search".
* trafilatura: tree_cleaning() uses lxml clean_html()
* html_text: _cleaned_html_tree() uses lxml clean_html(), function called by cleaned_selector() and extract_text()
* item: HTMLField uses lxml clean_html()
* extruct: LxmlMicrodataExtractor._extract_textContent() uses lxml clean_html()
The "clean_html" code search also found projects which don't use lxml to clean HTML:
* nltk.util.clean_html() raises NotImplementedError("To remove HTML markup, use BeautifulSoup's get_text() function")
* textblock.blob.BaseBlob(clean_html=False) parameters raises an exception if it's true: NotImplementedError("clean_html has been deprecated. To remove HTML markup, use BeautifulSoup's get_text() function")
* django.utils.html.clean_html() undocumented function was removed in Django 1.8. See https://docs.djangoproject.com/en/dev/releases/1.7/ for details (it announces the deprecation).
* The django-html_sanitizer project is based on bleach.
* yt_dlp.utils.clean_html() uses 3 regex replacements and calls its unescapeHTML() function to replace HTML entities using a 4th regex
* recommender-xblock uses bleach.clean() |
|
2022-01-20 14:37:37 |
Victor Stinner |
description |
Hi,
Recently at Red Hat, we had to fix (backport the changes for) multiple lxml clean_html() security issues in the lxml versions that we are maintaing in Fedora and RHEL. It's a "whack-a-mole" game since the implementation is based on a block list.
Would it be possible to deprecate, or even consider removing, the clean_html() function and suggest developers to use the bleach project instead? The bleach project is based on an allow list and so is safer.
Bleach project: https://github.com/mozilla/bleach
"Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes"
Bleach seems quite peopular: https://libraries.io/pypi/bleach says 11.7K repositories depend on it and 586 packages depend on it.
--
In the last 15 months, 3 vulnerabilities have been found in the lxml clean_html() function:
* 2021-12-12, CVE-2021-43818 (SVG):
https://github.com/lxml/lxml/security/advisories/GHSA-55x5-fj6c-h6m8
* 2021-03-21, CVE-2021-28957 (HTML action attribute):
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957
* 2020-11-27, CVE-2020-27783 (lxml 4.6.2):
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783
--
I ran a code search on PyPI top 5000 projects (at 2021-12-01).
I found the following 10 projects which uses the lxml clean_html() method:
* requests-lxml: find() and xpath() use lxml clean_html() if their clean parameter is true (default: clean=False)
* html-telegraph-poster: html_telegraph_poster.converter.clean_article_html() uses lxml clean_html()
* newspaper3k: OutputFormatter.convert_to_html() always calls Parser.clean_article_html() which uses lxml clean_html()
* readability-lxml: Document._parse() uses lxml clean_html()
* jusText: jusText.core.preprocessor() uses lxml clean_html()
* htmldate: htmldate.core.find_date() uses lxml clean_html() with the comment "# clean before string search".
* trafilatura: tree_cleaning() uses lxml clean_html()
* html_text: _cleaned_html_tree() uses lxml clean_html(), function called by cleaned_selector() and extract_text()
* item: HTMLField uses lxml clean_html()
* extruct: LxmlMicrodataExtractor._extract_textContent() uses lxml clean_html()
The "clean_html" code search also found projects which don't use lxml to clean HTML:
* nltk.util.clean_html() raises NotImplementedError("To remove HTML markup, use BeautifulSoup's get_text() function")
* textblock.blob.BaseBlob(clean_html=False) parameters raises an exception if it's true: NotImplementedError("clean_html has been deprecated. To remove HTML markup, use BeautifulSoup's get_text() function")
* django.utils.html.clean_html() undocumented function was removed in Django 1.8. See https://docs.djangoproject.com/en/dev/releases/1.7/ for details (it announces the deprecation).
* The django-html_sanitizer project is based on bleach.
* yt_dlp.utils.clean_html() uses 3 regex replacements and calls its unescapeHTML() function to replace HTML entities using a 4th regex
* recommender-xblock uses bleach.clean() |
Hi,
Recently at Red Hat, we had to fix (backport the changes for) multiple lxml clean_html() security issues in the lxml versions that we are maintaing in Fedora and RHEL. It's a "whack-a-mole" game since the implementation is based on a block list.
Would it be possible to deprecate, or even consider removing, the clean_html() function and suggest developers to use the bleach project instead? The bleach project is based on an allow list and so is safer.
Bleach project: https://github.com/mozilla/bleach
"Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes"
Bleach seems quite popular: https://libraries.io/pypi/bleach says 11.7K repositories depend on it and 586 packages depend on it.
--
In the last 15 months, 3 vulnerabilities have been found in the lxml clean_html() function:
* 2021-12-12, CVE-2021-43818 (SVG):
https://github.com/lxml/lxml/security/advisories/GHSA-55x5-fj6c-h6m8
* 2021-03-21, CVE-2021-28957 (HTML action attribute):
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957
* 2020-11-27, CVE-2020-27783 (lxml 4.6.2):
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783
--
I ran a code search on PyPI top 5000 projects (at 2021-12-01).
I found the following 10 projects which uses the lxml clean_html() method:
* requests-lxml: find() and xpath() use lxml clean_html() if their clean parameter is true (default: clean=False)
* html-telegraph-poster: html_telegraph_poster.converter.clean_article_html() uses lxml clean_html()
* newspaper3k: OutputFormatter.convert_to_html() always calls Parser.clean_article_html() which uses lxml clean_html()
* readability-lxml: Document._parse() uses lxml clean_html()
* jusText: jusText.core.preprocessor() uses lxml clean_html()
* htmldate: htmldate.core.find_date() uses lxml clean_html() with the comment "# clean before string search".
* trafilatura: tree_cleaning() uses lxml clean_html()
* html_text: _cleaned_html_tree() uses lxml clean_html(), function called by cleaned_selector() and extract_text()
* item: HTMLField uses lxml clean_html()
* extruct: LxmlMicrodataExtractor._extract_textContent() uses lxml clean_html()
The "clean_html" code search also found projects which don't use lxml to clean HTML:
* nltk.util.clean_html() raises NotImplementedError("To remove HTML markup, use BeautifulSoup's get_text() function")
* textblock.blob.BaseBlob(clean_html=False) parameters raises an exception if it's true: NotImplementedError("clean_html has been deprecated. To remove HTML markup, use BeautifulSoup's get_text() function")
* django.utils.html.clean_html() undocumented function was removed in Django 1.8. See https://docs.djangoproject.com/en/dev/releases/1.7/ for details (it announces the deprecation).
* The django-html_sanitizer project is based on bleach.
* yt_dlp.utils.clean_html() uses 3 regex replacements and calls its unescapeHTML() function to replace HTML entities using a 4th regex
* recommender-xblock uses bleach.clean() |
|
2022-01-21 16:57:42 |
scoder |
lxml: importance |
Undecided |
Medium |
|
2022-01-21 16:57:42 |
scoder |
lxml: status |
New |
Confirmed |
|
2022-10-28 13:24:50 |
Cory Gwin |
bug watch added |
|
https://github.com/jupyter/nbconvert/issues/1892 |
|
2023-01-24 08:16:32 |
Michal Čihař |
bug watch added |
|
https://github.com/mozilla/bleach/issues/698 |
|
2023-08-24 12:20:10 |
frenzy |
cve linked |
|
2014-3146 |
|
2023-08-24 12:20:10 |
frenzy |
cve linked |
|
2018-19787 |
|
2023-08-24 12:20:10 |
frenzy |
cve linked |
|
2020-27783 |
|
2023-08-24 12:20:10 |
frenzy |
cve linked |
|
2021-28957 |
|
2023-08-24 12:20:10 |
frenzy |
cve linked |
|
2021-43818 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/mercuree/html-telegraph-poster/issues/23 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/codelucas/newspaper/issues/972 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/buriy/python-readability/issues/179 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/miso-belica/jusText/issues/46 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/adbar/htmldate/issues/91 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/adbar/trafilatura/issues/412 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/TeamHG-Memex/html-text/issues/30 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/scrapinghub/extruct/issues/209 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/psf/requests-html/issues/558 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/ysim/songtext/issues/50 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/ColdHeat/pybluemonday/issues/44 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/nopper/twittomatic/issues/13 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/python-gsoc/python-blogs/issues/538 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/Linbreux/wikmd/issues/125 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/divio/aldryn-search/issues/115 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/PacktPublishing/PythonDataAnalysisCookbook/issues/6 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/DMOJ/online-judge/issues/2284 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/janeczku/calibre-web/issues/2874 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/neuml/paperai/issues/69 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/NikolaiT/GoogleScraper/issues/247 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/kootenpv/sky/issues/18 |
|
2023-08-30 16:56:43 |
frenzy |
bug watch added |
|
https://github.com/anyant/rssant/issues/139 |
|
2024-03-02 06:06:53 |
scoder |
lxml: milestone |
|
5.2 |
|
2024-03-29 20:14:36 |
scoder |
lxml: status |
Confirmed |
Fix Committed |
|
2024-03-29 20:14:36 |
scoder |
lxml: assignee |
|
scoder (scoder) |
|
2024-03-31 06:45:07 |
scoder |
summary |
Consider deprecating/removing clean_html() in favor of bleach? |
Move lxml.html.clean into external project |
|
2024-03-31 06:48:31 |
scoder |
lxml: status |
Fix Committed |
Fix Released |
|
2024-04-02 15:32:50 |
Matthias Klose |
bug task added |
|
lxml (Ubuntu) |
|
2024-04-17 17:41:06 |
Launchpad Janitor |
lxml (Ubuntu): status |
New |
Fix Released |
|