Freeze in parser.feed

Bug #1674545 reported by Miko on 2017-03-21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Undecided
Unassigned

Bug Description

I have been debugging this issue the last 4 days without success. It occurs in my production environment consistently every 1-3 days. When frozen, the program still uses CPU, strace reports no system activity coming from the program, and it can still respond to signals such as USR2 which I used to get a backtrace during a freeze.

Currently, I am making an effort to get more data about circumstances surrounding the freeze, but since it's taking me a long time to get that data, first I will post the problematic piece of code here just in case there's something wrong with the way I'm using lxml rather than an issue with lxml itself!

The docs specifically mention to be careful with making sure to call parser.close() before using parsed elements, otherwise the behavior is undefined. In my case, the code only closes the parser when I find an element I'm interested in, otherwise I never use the target element (set to "res") so it doesn't matter if I close the parser, correct?

### simplified sample (not intended to be run as it could take days)
link = 'https://www.booking.com'
headers_override = {}
params = {}
timeout = 15

try:
 get_req = await asyncio.wait_for(aiohttp.get(link, headers=headers_override, params=params, max_redirects=2), timeout)
 res = None

 if get_req.status == 200:
  parser = HTMLPullParser(tag='table')
  events = parser.read_events()
  async for data in get_req.content.iter_chunked(131072):
   # freezes on parser.feed... working on getting what "data" is in this case
   parser.feed(data)
   for event, ele in events:
    if ele.get('id') == 'maxotel_rooms':
     parser.close()
     res = ele
     break
except asyncio.TimeoutError as e:
 pass
except (aiohttp.errors.ClientResponseError, aiohttp.errors.ClientOSError, aiohttp.errors.ServerDisconnectedError) as e:
 pass
else:
 await get_req.release()

 # if res is not None ... use it

###

I am attempting to run my program with a debug build of python in order to see the exact parsing data that seems to make the freeze occur. I have seen a gdb low-level backtrace that does point me at least to the fact that parser.feed being where the program is stuck at. Getting a debug build of python working with a debug version of lxml has proven to be a lot of guess and check work... once I get it working hopefully I can get a good core-dump and see exactly what data is being fed that causes the parser to freeze!

Python : sys.version_info(major=3, minor=5, micro=3, releaselevel='final', serial=0)
lxml.etree : (3, 7, 3, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 3)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

Miko (fxrock2002) wrote :

I may have accidentally ran the problematic code on actually a different setup, it may have been this:

lxml.etree : (3, 6, 4, 0)
and python 3.5.2

Once I confirm that it was in fact these versions that caused the freeze, and not the newer versions I'll update this report.

Miko (fxrock2002) wrote :

OK, this issue does not occur with lxml.etree 3.7.3. and python 3.5.3. It was occurring with lxml.etree 3.6.4 and python 3.5.2 however. My mistake for having the wrong affected version in the original report.

scoder (scoder) wrote :

Closing as fixed as it appears to be working with a recent lxml version.

Changed in lxml:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers