lxml.html.fromstring huge memory leak after parsing few not corectly formed html pages
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
System:
Linux Debian-
Versions:
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 2, 8, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 7)
libxslt used : (1, 1, 26)
I'm writing some web spider and usning lxml library for parsing pages.
I found that after some time my spider eats all system memory(8GB) and server goes down. While investigate this bug i found that problem in lxml.html.
When i added module that checks for correctness html markup this bug gone.
I tried to reproduce this bug in standalone code, but fails. Sorry.
But there is definitely something not right.
Changed in lxml: | |
status: | Incomplete → New |
> lxml.etree : (2, 2, 8, 0)
Could you try the latest stable lxml release, i.e. 2.3?
Stefan