lxml

segfault with lxml.html.fromstring

Bug #966761 reported by Guillaume VIRY on 2012-03-28

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Invalid	Undecided	Unassigned

Bug Description

Python : sys.version_info(major=2, minor=7, micro=2, releaselevel='final', serial=0)
lxml.etree : (2, 3, 4, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

This bug started occurring from the 2.3 version, on Windows only as far as I can see (my linux code works fine both on Debian and Archlinux).

Let's consider some html raw data like the following.

data = """<html>
<meta http-equiv="Content-Type" content="text/html; charset=euc-jp"/>
<body>
<p>Some string here in eucjp (japanese)</p>
</body>
</html>"""

If the string is encoded in UTF8, everything is fine (though the encoding is different from the one in charset)
If the string is encoded as expected (euc-jp), I get a segfault after doing

lxml.html.fromstring(data)

converting data in unicode make things work, but I remember that former versions of lxml used to detect the encoding when available.

Tags:

Revision history for this message

Guillaume VIRY (guillaume-viry) wrote on 2012-03-29:

#1

a little correction : that behavior appears in 2.3.1, not 2.3.0. That last one does not segfault

Revision history for this message

scoder (scoder) wrote on 2012-04-20:

#2

Could you provide proper test data for this?

Revision history for this message

Guillaume VIRY (guillaume-viry) wrote on 2012-04-21:

#3

actually here is a little piece of code that is working with 2.3.0 and segfaults with later versions

import urllib2, html.lxml
response = urllib2.urlopen("http://mixi.jp")
html = response.read()
doc = lxml.html.fromstring(html)

before, the fromstring method was able to detect the eucjp encoding and create the etree object without any problem
now it seems it doesn't anymore. The code above would work if html is encoded in utf8 (no matter the "encoding" specified in the html) or else with a unicode variable

Revision history for this message

Guillaume VIRY (guillaume-viry) wrote on 2012-04-21:

#4

sorry, I guess you saw the mistake by yourself, it's of course

import lxml.html

Revision history for this message

Guillaume VIRY (guillaume-viry) wrote on 2012-07-30:

#5

was there any problem reproducing this ?

Revision history for this message

scoder (scoder) wrote on 2012-07-30:

#6

At least given the page you mentioned, the following works for me (at least in the latest master branch):

>>> import lxml.html as h
>>> with open("page.html") as f:
... el = h.fromstring(f.read()) # same with parse()
>>> from lxml import etree
>>> print(etree.tostring(el, method="text", encoding='utf-8'))

Revision history for this message

scoder (scoder) wrote on 2012-07-30:

#7

Oh, and I can't test it on Windows, sorry (the above was on Linux). Where did you get your installation from?

Revision history for this message

Guillaume VIRY (guillaume-viry) wrote on 2012-07-30:

#8

So far, I don't remember having any problem with Linux, perhaps because my console are already set up for UTF8, which is not the case on my Windows installs (japanese and french, so cp932 and latin-1 encodings).

As for where I got my lxml build from, it came from this site
http://www.lfd.uci.edu/~gohlke/pythonlibs/
Having a 64bit OS, it was pretty troublesome to get a 64bit compile chain based on Mingw, so I had the choice to stay on 2.3 using the version available on PyPi or to give a try to those ones.

scoder (scoder) on 2012-09-29

Changed in lxml:
status:	New → Triaged

Revision history for this message

scoder (scoder) wrote on 2017-09-19:

#9

Closing. Since this ticket hasn't been updated for 5 years, I'll just assume that it was fixed in libxml2 since then.

Changed in lxml:
status:	Triaged → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.