clean_html eats up all RAM and segfaults

Bug #1889653 reported by Anselm
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Undecided
Unassigned

Bug Description

On at least one specific website, lxml.html.clean.clean_html eats up all RAM on my PC and then segfaults.

A minimal reproducible example is below (you will need the HTML attached, which is the HTML from "https://www.bpz-online.de/").

One unusual thing about this website is that that it contains >6k lines with useless text data at XPATH "//*[@id="Container_Startseite"]/div[4]/div[4]/div[2]" - but I don't understand yet how this would cause the memory leak.

To reproduce

# Use at most 2GB RAM to prevent freeze
ulimit -Sv 2000000

## Execute this
from lxml.html.clean import clean_html

with open("bug.html") as f:
    html = f.read()

clean_html(html)
##

# Version info
Python : sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0)
lxml.etree : (4, 5, 2, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

Revision history for this message
Anselm (anselm--) wrote :
Revision history for this message
scoder (scoder) wrote :

Could you maybe try to cut down the file to some smaller example that reproduces this? That would make it clearer where to look for the problem.

Changed in lxml:
status: New → Triaged
Revision history for this message
Anselm (anselm--) wrote : Re: [Bug 1889653] Re: clean_html eats up all RAM and segfaults

I did not succeed in doing that. Instead, I found that after cutting out
the lines 2000-60000 (that only contain HTML comments, newlines and tabs)
makes it work.
So it seems like very long sequences of comments, tabs and newlines cause
this crash.

On Thu, Jul 30, 2020 at 8:20 PM scoder <email address hidden> wrote:

> Could you maybe try to cut down the file to some smaller example that
> reproduces this? That would make it clearer where to look for the
> problem.
>
> ** Changed in: lxml
> Status: New => Triaged
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1889653
>
> Title:
> clean_html eats up all RAM and segfaults
>
> Status in lxml:
> Triaged
>
> Bug description:
> On at least one specific website, lxml.html.clean.clean_html eats up
> all RAM on my PC and then segfaults.
>
> A minimal reproducible example is below (you will need the HTML
> attached, which is the HTML from "https://www.bpz-online.de/").
>
> One unusual thing about this website is that that it contains >6k
> lines with useless text data at XPATH
> "//*[@id="Container_Startseite"]/div[4]/div[4]/div[2]" - but I don't
> understand yet how this would cause the memory leak.
>
> To reproduce
>
> # Use at most 2GB RAM to prevent freeze
> ulimit -Sv 2000000
>
>
> ## Execute this
> from lxml.html.clean import clean_html
>
> with open("bug.html") as f:
> html = f.read()
>
> clean_html(html)
> ##
>
>
> # Version info
> Python : sys.version_info(major=3, minor=8, micro=5,
> releaselevel='final', serial=0)
> lxml.etree : (4, 5, 2, 0)
> libxml used : (2, 9, 10)
> libxml compiled : (2, 9, 10)
> libxslt used : (1, 1, 34)
> libxslt compiled : (1, 1, 34)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/lxml/+bug/1889653/+subscriptions
>

Revision history for this message
scoder (scoder) wrote :

It runs through for me when I disable the comment discarding (comments=False).

It seems to eat up a lot of memory while discarding the row of comments (line ~400), for which it has to concatenate a lot of tail text.

        for el in _kill:
            el.drop_tree()

A subsequent row of elements to discard is the worst-case scenario for the simplistic algorithm, which discards one element after the other. This could be improved by 'inlining' the ".drop_tree()" method into the cleaner and letting it detect sequences of elements (which also share the same parent), so that we could collect and generate the new text/tail only once after removing all of them. Basically, collect tail texts and elements, then use parent.remove() to discard those elements, then set the new text/tail.

It's not entirely as trivial as that, because there is already a bit of ordering going on, but I think this is a reasonable direction to test out.

Would you like to give it a try?

scoder (scoder)
Changed in lxml:
status: Triaged → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.