clean_html eats up all RAM and segfaults
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
On at least one specific website, lxml.html.
A minimal reproducible example is below (you will need the HTML attached, which is the HTML from "https:/
One unusual thing about this website is that that it contains >6k lines with useless text data at XPATH "//*[@id=
To reproduce
# Use at most 2GB RAM to prevent freeze
ulimit -Sv 2000000
## Execute this
from lxml.html.clean import clean_html
with open("bug.html") as f:
html = f.read()
clean_html(html)
##
# Version info
Python : sys.version_
lxml.etree : (4, 5, 2, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)
Changed in lxml: | |
status: | Triaged → Confirmed |
Could you maybe try to cut down the file to some smaller example that reproduces this? That would make it clearer where to look for the problem.