Activity log for bug #1958539

Date Who What changed Old value New value Message
2022-01-20 14:34:31 Victor Stinner bug added bug
2022-01-20 14:36:41 Victor Stinner description Hi, Recantly at Red Hat, we had to fix (backport the changes for) multiple lxml clean_html() security issues in the lxml versions that we are maintaing in Fedora and RHEL. It's a "whack-a-mole" game since the implementation is based on a block list. Would it be possible to deprecate, or even consider removing, the clean_html() function and suggest developers to use the bleach project instead? The bleach project is based on an allow list and so is safer. Bleach project: https://github.com/mozilla/bleach "Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes" Bleach seems quite peopular: https://libraries.io/pypi/bleach says 11.7K repositories depend on it and 586 packages depend on it. -- In the last 15 months, 3 vulnerabilities have been found in the lxml clean_html() function: * 2021-12-12, CVE-2021-43818 (SVG): https://github.com/lxml/lxml/security/advisories/GHSA-55x5-fj6c-h6m8 * 2021-03-21, CVE-2021-28957 (HTML action attribute): https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957 * 2020-11-27, CVE-2020-27783 (lxml 4.6.2): https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783 -- I ran a code search on PyPI top 5000 projects (at 2021-12-01). I found the following 10 projects which uses the lxml clean_html() method: * requests-lxml: find() and xpath() use lxml clean_html() if their clean parameter is true (default: clean=False) * html-telegraph-poster: html_telegraph_poster.converter.clean_article_html() uses lxml clean_html() * newspaper3k: OutputFormatter.convert_to_html() always calls Parser.clean_article_html() which uses lxml clean_html() * readability-lxml: Document._parse() uses lxml clean_html() * jusText: jusText.core.preprocessor() uses lxml clean_html() * htmldate: htmldate.core.find_date() uses lxml clean_html() with the comment "# clean before string search". * trafilatura: tree_cleaning() uses lxml clean_html() * html_text: _cleaned_html_tree() uses lxml clean_html(), function called by cleaned_selector() and extract_text() * item: HTMLField uses lxml clean_html() * extruct: LxmlMicrodataExtractor._extract_textContent() uses lxml clean_html() The "clean_html" code search also found projects which don't use lxml to clean HTML: * nltk.util.clean_html() raises NotImplementedError("To remove HTML markup, use BeautifulSoup's get_text() function") * textblock.blob.BaseBlob(clean_html=False) parameters raises an exception if it's true: NotImplementedError("clean_html has been deprecated. To remove HTML markup, use BeautifulSoup's get_text() function") * django.utils.html.clean_html() undocumented function was removed in Django 1.8. See https://docs.djangoproject.com/en/dev/releases/1.7/ for details (it announces the deprecation). * The django-html_sanitizer project is based on bleach. * yt_dlp.utils.clean_html() uses 3 regex replacements and calls its unescapeHTML() function to replace HTML entities using a 4th regex * recommender-xblock uses bleach.clean() Hi, Recently at Red Hat, we had to fix (backport the changes for) multiple lxml clean_html() security issues in the lxml versions that we are maintaing in Fedora and RHEL. It's a "whack-a-mole" game since the implementation is based on a block list. Would it be possible to deprecate, or even consider removing, the clean_html() function and suggest developers to use the bleach project instead? The bleach project is based on an allow list and so is safer. Bleach project: https://github.com/mozilla/bleach "Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes" Bleach seems quite peopular: https://libraries.io/pypi/bleach says 11.7K repositories depend on it and 586 packages depend on it. -- In the last 15 months, 3 vulnerabilities have been found in the lxml clean_html() function: * 2021-12-12, CVE-2021-43818 (SVG):   https://github.com/lxml/lxml/security/advisories/GHSA-55x5-fj6c-h6m8 * 2021-03-21, CVE-2021-28957 (HTML action attribute):   https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957 * 2020-11-27, CVE-2020-27783 (lxml 4.6.2):   https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783 -- I ran a code search on PyPI top 5000 projects (at 2021-12-01). I found the following 10 projects which uses the lxml clean_html() method: * requests-lxml: find() and xpath() use lxml clean_html() if their clean parameter is true (default: clean=False) * html-telegraph-poster: html_telegraph_poster.converter.clean_article_html() uses lxml clean_html() * newspaper3k: OutputFormatter.convert_to_html() always calls Parser.clean_article_html() which uses lxml clean_html() * readability-lxml: Document._parse() uses lxml clean_html() * jusText: jusText.core.preprocessor() uses lxml clean_html() * htmldate: htmldate.core.find_date() uses lxml clean_html() with the comment "# clean before string search". * trafilatura: tree_cleaning() uses lxml clean_html() * html_text: _cleaned_html_tree() uses lxml clean_html(), function called by cleaned_selector() and extract_text() * item: HTMLField uses lxml clean_html() * extruct: LxmlMicrodataExtractor._extract_textContent() uses lxml clean_html() The "clean_html" code search also found projects which don't use lxml to clean HTML: * nltk.util.clean_html() raises NotImplementedError("To remove HTML markup, use BeautifulSoup's get_text() function") * textblock.blob.BaseBlob(clean_html=False) parameters raises an exception if it's true: NotImplementedError("clean_html has been deprecated. To remove HTML markup, use BeautifulSoup's get_text() function") * django.utils.html.clean_html() undocumented function was removed in Django 1.8. See https://docs.djangoproject.com/en/dev/releases/1.7/ for details (it announces the deprecation). * The django-html_sanitizer project is based on bleach. * yt_dlp.utils.clean_html() uses 3 regex replacements and calls its unescapeHTML() function to replace HTML entities using a 4th regex * recommender-xblock uses bleach.clean()
2022-01-20 14:37:37 Victor Stinner description Hi, Recently at Red Hat, we had to fix (backport the changes for) multiple lxml clean_html() security issues in the lxml versions that we are maintaing in Fedora and RHEL. It's a "whack-a-mole" game since the implementation is based on a block list. Would it be possible to deprecate, or even consider removing, the clean_html() function and suggest developers to use the bleach project instead? The bleach project is based on an allow list and so is safer. Bleach project: https://github.com/mozilla/bleach "Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes" Bleach seems quite peopular: https://libraries.io/pypi/bleach says 11.7K repositories depend on it and 586 packages depend on it. -- In the last 15 months, 3 vulnerabilities have been found in the lxml clean_html() function: * 2021-12-12, CVE-2021-43818 (SVG):   https://github.com/lxml/lxml/security/advisories/GHSA-55x5-fj6c-h6m8 * 2021-03-21, CVE-2021-28957 (HTML action attribute):   https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957 * 2020-11-27, CVE-2020-27783 (lxml 4.6.2):   https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783 -- I ran a code search on PyPI top 5000 projects (at 2021-12-01). I found the following 10 projects which uses the lxml clean_html() method: * requests-lxml: find() and xpath() use lxml clean_html() if their clean parameter is true (default: clean=False) * html-telegraph-poster: html_telegraph_poster.converter.clean_article_html() uses lxml clean_html() * newspaper3k: OutputFormatter.convert_to_html() always calls Parser.clean_article_html() which uses lxml clean_html() * readability-lxml: Document._parse() uses lxml clean_html() * jusText: jusText.core.preprocessor() uses lxml clean_html() * htmldate: htmldate.core.find_date() uses lxml clean_html() with the comment "# clean before string search". * trafilatura: tree_cleaning() uses lxml clean_html() * html_text: _cleaned_html_tree() uses lxml clean_html(), function called by cleaned_selector() and extract_text() * item: HTMLField uses lxml clean_html() * extruct: LxmlMicrodataExtractor._extract_textContent() uses lxml clean_html() The "clean_html" code search also found projects which don't use lxml to clean HTML: * nltk.util.clean_html() raises NotImplementedError("To remove HTML markup, use BeautifulSoup's get_text() function") * textblock.blob.BaseBlob(clean_html=False) parameters raises an exception if it's true: NotImplementedError("clean_html has been deprecated. To remove HTML markup, use BeautifulSoup's get_text() function") * django.utils.html.clean_html() undocumented function was removed in Django 1.8. See https://docs.djangoproject.com/en/dev/releases/1.7/ for details (it announces the deprecation). * The django-html_sanitizer project is based on bleach. * yt_dlp.utils.clean_html() uses 3 regex replacements and calls its unescapeHTML() function to replace HTML entities using a 4th regex * recommender-xblock uses bleach.clean() Hi, Recently at Red Hat, we had to fix (backport the changes for) multiple lxml clean_html() security issues in the lxml versions that we are maintaing in Fedora and RHEL. It's a "whack-a-mole" game since the implementation is based on a block list. Would it be possible to deprecate, or even consider removing, the clean_html() function and suggest developers to use the bleach project instead? The bleach project is based on an allow list and so is safer. Bleach project: https://github.com/mozilla/bleach "Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes" Bleach seems quite popular: https://libraries.io/pypi/bleach says 11.7K repositories depend on it and 586 packages depend on it. -- In the last 15 months, 3 vulnerabilities have been found in the lxml clean_html() function: * 2021-12-12, CVE-2021-43818 (SVG):   https://github.com/lxml/lxml/security/advisories/GHSA-55x5-fj6c-h6m8 * 2021-03-21, CVE-2021-28957 (HTML action attribute):   https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28957 * 2020-11-27, CVE-2020-27783 (lxml 4.6.2):   https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27783 -- I ran a code search on PyPI top 5000 projects (at 2021-12-01). I found the following 10 projects which uses the lxml clean_html() method: * requests-lxml: find() and xpath() use lxml clean_html() if their clean parameter is true (default: clean=False) * html-telegraph-poster: html_telegraph_poster.converter.clean_article_html() uses lxml clean_html() * newspaper3k: OutputFormatter.convert_to_html() always calls Parser.clean_article_html() which uses lxml clean_html() * readability-lxml: Document._parse() uses lxml clean_html() * jusText: jusText.core.preprocessor() uses lxml clean_html() * htmldate: htmldate.core.find_date() uses lxml clean_html() with the comment "# clean before string search". * trafilatura: tree_cleaning() uses lxml clean_html() * html_text: _cleaned_html_tree() uses lxml clean_html(), function called by cleaned_selector() and extract_text() * item: HTMLField uses lxml clean_html() * extruct: LxmlMicrodataExtractor._extract_textContent() uses lxml clean_html() The "clean_html" code search also found projects which don't use lxml to clean HTML: * nltk.util.clean_html() raises NotImplementedError("To remove HTML markup, use BeautifulSoup's get_text() function") * textblock.blob.BaseBlob(clean_html=False) parameters raises an exception if it's true: NotImplementedError("clean_html has been deprecated. To remove HTML markup, use BeautifulSoup's get_text() function") * django.utils.html.clean_html() undocumented function was removed in Django 1.8. See https://docs.djangoproject.com/en/dev/releases/1.7/ for details (it announces the deprecation). * The django-html_sanitizer project is based on bleach. * yt_dlp.utils.clean_html() uses 3 regex replacements and calls its unescapeHTML() function to replace HTML entities using a 4th regex * recommender-xblock uses bleach.clean()
2022-01-21 16:57:42 scoder lxml: importance Undecided Medium
2022-01-21 16:57:42 scoder lxml: status New Confirmed
2022-10-28 13:24:50 Cory Gwin bug watch added https://github.com/jupyter/nbconvert/issues/1892
2023-01-24 08:16:32 Michal Čihař bug watch added https://github.com/mozilla/bleach/issues/698
2023-08-24 12:20:10 frenzy cve linked 2014-3146
2023-08-24 12:20:10 frenzy cve linked 2018-19787
2023-08-24 12:20:10 frenzy cve linked 2020-27783
2023-08-24 12:20:10 frenzy cve linked 2021-28957
2023-08-24 12:20:10 frenzy cve linked 2021-43818
2023-08-30 16:56:43 frenzy bug watch added https://github.com/mercuree/html-telegraph-poster/issues/23
2023-08-30 16:56:43 frenzy bug watch added https://github.com/codelucas/newspaper/issues/972
2023-08-30 16:56:43 frenzy bug watch added https://github.com/buriy/python-readability/issues/179
2023-08-30 16:56:43 frenzy bug watch added https://github.com/miso-belica/jusText/issues/46
2023-08-30 16:56:43 frenzy bug watch added https://github.com/adbar/htmldate/issues/91
2023-08-30 16:56:43 frenzy bug watch added https://github.com/adbar/trafilatura/issues/412
2023-08-30 16:56:43 frenzy bug watch added https://github.com/TeamHG-Memex/html-text/issues/30
2023-08-30 16:56:43 frenzy bug watch added https://github.com/scrapinghub/extruct/issues/209
2023-08-30 16:56:43 frenzy bug watch added https://github.com/psf/requests-html/issues/558
2023-08-30 16:56:43 frenzy bug watch added https://github.com/ysim/songtext/issues/50
2023-08-30 16:56:43 frenzy bug watch added https://github.com/ColdHeat/pybluemonday/issues/44
2023-08-30 16:56:43 frenzy bug watch added https://github.com/nopper/twittomatic/issues/13
2023-08-30 16:56:43 frenzy bug watch added https://github.com/python-gsoc/python-blogs/issues/538
2023-08-30 16:56:43 frenzy bug watch added https://github.com/Linbreux/wikmd/issues/125
2023-08-30 16:56:43 frenzy bug watch added https://github.com/divio/aldryn-search/issues/115
2023-08-30 16:56:43 frenzy bug watch added https://github.com/PacktPublishing/PythonDataAnalysisCookbook/issues/6
2023-08-30 16:56:43 frenzy bug watch added https://github.com/DMOJ/online-judge/issues/2284
2023-08-30 16:56:43 frenzy bug watch added https://github.com/janeczku/calibre-web/issues/2874
2023-08-30 16:56:43 frenzy bug watch added https://github.com/neuml/paperai/issues/69
2023-08-30 16:56:43 frenzy bug watch added https://github.com/NikolaiT/GoogleScraper/issues/247
2023-08-30 16:56:43 frenzy bug watch added https://github.com/kootenpv/sky/issues/18
2023-08-30 16:56:43 frenzy bug watch added https://github.com/anyant/rssant/issues/139
2024-03-02 06:06:53 scoder lxml: milestone 5.2
2024-03-29 20:14:36 scoder lxml: status Confirmed Fix Committed
2024-03-29 20:14:36 scoder lxml: assignee scoder (scoder)
2024-03-31 06:45:07 scoder summary Consider deprecating/removing clean_html() in favor of bleach? Move lxml.html.clean into external project
2024-03-31 06:48:31 scoder lxml: status Fix Committed Fix Released
2024-04-02 15:32:50 Matthias Klose bug task added lxml (Ubuntu)
2024-04-17 17:41:06 Launchpad Janitor lxml (Ubuntu): status New Fix Released