lxml

clean_html eats up all RAM and segfaults

Bug #1889653 reported by Anselm on 2020-07-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Confirmed	Undecided	Unassigned

Bug Description

On at least one specific website, lxml.html.clean.clean_html eats up all RAM on my PC and then segfaults.

A minimal reproducible example is below (you will need the HTML attached, which is the HTML from "https://www.bpz-online.de/").

One unusual thing about this website is that that it contains >6k lines with useless text data at XPATH "//*[@id="Container_Startseite"]/div[4]/div[4]/div[2]" - but I don't understand yet how this would cause the memory leak.

To reproduce

# Use at most 2GB RAM to prevent freeze
ulimit -Sv 2000000

## Execute this
from lxml.html.clean import clean_html

with open("bug.html") as f:
html = f.read()

clean_html(html)
##

# Version info
Python : sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0)
lxml.etree : (4, 5, 2, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

Revision history for this message

Anselm (anselm--) wrote on 2020-07-30:

bug.html Edit (1.9 MiB, text/html)

Revision history for this message

scoder (scoder) wrote on 2020-07-30:

Could you maybe try to cut down the file to some smaller example that reproduces this? That would make it clearer where to look for the problem.

Changed in lxml:
status:	New → Triaged

Revision history for this message

Anselm (anselm--) wrote on 2020-08-02: Re: [Bug 1889653] Re: clean_html eats up all RAM and segfaults

I did not succeed in doing that. Instead, I found that after cutting out
the lines 2000-60000 (that only contain HTML comments, newlines and tabs)
makes it work.
So it seems like very long sequences of comments, tabs and newlines cause
this crash.

On Thu, Jul 30, 2020 at 8:20 PM scoder <email address hidden> wrote:

> Could you maybe try to cut down the file to some smaller example that
> reproduces this? That would make it clearer where to look for the
> problem.
>
> ** Changed in: lxml
> Status: New => Triaged
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1889653
>
> Title:
> clean_html eats up all RAM and segfaults
>
> Status in lxml:
> Triaged
>
> Bug description:
> On at least one specific website, lxml.html.clean.clean_html eats up
> all RAM on my PC and then segfaults.
>
> A minimal reproducible example is below (you will need the HTML
> attached, which is the HTML from "https://www.bpz-online.de/").
>
> One unusual thing about this website is that that it contains >6k
> lines with useless text data at XPATH
> "//*[@id="Container_Startseite"]/div[4]/div[4]/div[2]" - but I don't
> understand yet how this would cause the memory leak.
>
> To reproduce
>
> # Use at most 2GB RAM to prevent freeze
> ulimit -Sv 2000000
>
>
> ## Execute this
> from lxml.html.clean import clean_html
>
> with open("bug.html") as f:
> html = f.read()
>
> clean_html(html)
> ##
>
>
> # Version info
> Python : sys.version_info(major=3, minor=8, micro=5,
> releaselevel='final', serial=0)
> lxml.etree : (4, 5, 2, 0)
> libxml used : (2, 9, 10)
> libxml compiled : (2, 9, 10)
> libxslt used : (1, 1, 34)
> libxslt compiled : (1, 1, 34)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/lxml/+bug/1889653/+subscriptions
>

Revision history for this message

scoder (scoder) wrote on 2020-08-03:

It runs through for me when I disable the comment discarding (comments=False).

It seems to eat up a lot of memory while discarding the row of comments (line ~400), for which it has to concatenate a lot of tail text.

for el in _kill:
el.drop_tree()

A subsequent row of elements to discard is the worst-case scenario for the simplistic algorithm, which discards one element after the other. This could be improved by 'inlining' the ".drop_tree()" method into the cleaner and letting it detect sequences of elements (which also share the same parent), so that we could collect and generate the new text/tail only once after removing all of them. Basically, collect tail texts and elements, then use parent.remove() to discard those elements, then set the new text/tail.

It's not entirely as trivial as that, because there is already a bit of ordering going on, but I think this is a reasonable direction to test out.

Would you like to give it a try?

scoder (scoder) on 2023-11-03