lxml.html.fromstring huge memory leak after parsing few not corectly formed html pages

Bug #728924 reported by Creotiv
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

System:
Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64 GNU/Linux
Versions:
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 2, 8, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 7)
libxslt used : (1, 1, 26)

I'm writing some web spider and usning lxml library for parsing pages.
I found that after some time my spider eats all system memory(8GB) and server goes down. While investigate this bug i found that problem in lxml.html.fromstring. My tests shown that after few parser exceptions("Document empty" and "htmlParseEntityRef: expecting ';'") it starts eat memory till goes down.
When i added module that checks for correctness html markup this bug gone.

I tried to reproduce this bug in standalone code, but fails. Sorry.
But there is definitely something not right.

Revision history for this message
Creotiv (creotiv) wrote :
Revision history for this message
scoder (scoder) wrote : Re: [Bug 728924] [NEW] lxml.html.fromstring huge memory leak after parsing few not corectly formed html pages

> lxml.etree : (2, 2, 8, 0)

Could you try the latest stable lxml release, i.e. 2.3?

Stefan

Revision history for this message
scoder (scoder) wrote :

Also, I can't reproduce any problems with the HTML page you posted. It neither fails to parse, nor does repeated parsing show any memory problems. There seems to be a potentially related problem in libxml2 2.7.6, but 2.7.7 and later should be fine.

Revision history for this message
Creotiv (creotiv) wrote :

Ok i will try version 2.3 and give you know.

Revision history for this message
Creotiv (creotiv) wrote :

I've tested with lxml 2.3 version, but bug remains.
Here is new versions:

Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

I've testing now one guess. I think this bug could be caused by multiprocess library use. Cause my script runing in multiple instances(Instances are independent). And when i run it in one instance, bug didn't appear(but may exist).

Revision history for this message
Creotiv (creotiv) wrote :

My guess was wrong, even with one active process it eats memory.

Revision history for this message
scoder (scoder) wrote : Re: [Bug 728924] Re: lxml.html.fromstring huge memory leak after parsing few not corectly formed html pages

I see no errors when parsing the file you attached. Triggering exceptions
by parsing an empty file works for me, but neither of the two cases shows
any signs of a memory leak.

Please provide a test script and corresponding input data that reproduces
the problem, otherwise I can't help you.

Stefan

Revision history for this message
Creotiv (creotiv) wrote :

When i set memory limit ulimit -v and ulimit -m.
I get this error:
"Exception MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' ignored"

Revision history for this message
Creotiv (creotiv) wrote :
Revision history for this message
scoder (scoder) wrote :

I don't think this is related. I'm not surprised that you can trigger memory errors in arbitrary places by setting memory limits.

Revision history for this message
scoder (scoder) wrote :

bug report lacks a reproducible test case

Changed in lxml:
status: New → Incomplete
Creotiv (creotiv)
Changed in lxml:
status: Incomplete → New
Revision history for this message
Creotiv (creotiv) wrote :

I've tried to fix this situation by writing watch dog process that recreate died child processes.
But there is one problem. When i get exception "Exception MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' ignored" i can't handle it and even more it write this exception in error log until there is free space on disk.

How i can handle this error to kill process so the watch daemon can recreate it?

Revision history for this message
scoder (scoder) wrote :

What do you mean by

"it write this exception in error log until there is free space on disk."

The error log is bounded in size, and it doesn't keep the exceptions alive, only their text. So the log can only grow up to a certain size and can't eat up all memory. I still don't see how this is related.

Revision history for this message
Creotiv (creotiv) wrote :

I setup error log up to 10MB per file up to 5 files. But lxml somehow writes 75 GB in error log file.

I need this to make lxml work until getting memory error and than die. So monitor process can create new process with lxml parser. This is not very clean solution, but i haven't time to find real problem in lxml parser.

For now one of solution that i have it's to monkeypatch the lxml.etree._BaseErrorLog.

Revision history for this message
scoder (scoder) wrote :

Please provide the code that you use to set up the error log. ISTM that the problem is to be found there.

Revision history for this message
Creotiv (creotiv) wrote :

Also i figured out why i have memory leak. This is because i have error caching turned on. And when lxml starts writes errors it write its into memory first. And i have only 8GB while log can be up to 100GB and more.

Revision history for this message
scoder (scoder) wrote :

Ok, so, do I understand correctly that this problem is completely unrelated to lxml then?

I mean, your logging setup code doesn't even use lxml, and lxml itself does not use Python's logging by default.

Revision history for this message
Creotiv (creotiv) wrote :

Ofcourse my logging setup does not use lxml, i give it to you so you can see that i set size limit for logging files. Lxml used by application to parse HTML pages that then analyzed.
But in some cases lxml.html.fromstring method cause memory consumption, and as i set memory limit with "ulimit" it killing my hard drive by writing errors in log file till there is a free space on it(first problem). And i can't handle this error(second problem) so i can't say to my application to restart.

Now, to get control over the situation i monkeypath lxml.etree._BaseErrorLog, so every error will kill process. But this is very dirty hack.

So i have two questions.
1) How i can handle memory error of lxml parser?
2) How i can set lxml to use Python logging so i can control it?

Revision history for this message
scoder (scoder) wrote :

Ok, this is starting to turn into a help request. The mailing list is the right place to ask for help, not the bug tracker. Also, reading the documentation helps in this case.

I'm setting this back to "incomplete" as I cannot reproduce the problem on my side and you still did not provide code that shows the memory problem.

Changed in lxml:
status: New → Incomplete
Revision history for this message
scoder (scoder) wrote :

Not related to lxml but to a specific error logging setup.

Changed in lxml:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.