segfault with lxml.html.fromstring

Bug #966761 reported by Guillaume VIRY
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

Python : sys.version_info(major=2, minor=7, micro=2, releaselevel='final', serial=0)
lxml.etree : (2, 3, 4, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

This bug started occurring from the 2.3 version, on Windows only as far as I can see (my linux code works fine both on Debian and Archlinux).

Let's consider some html raw data like the following.

data = """<html>
<meta http-equiv="Content-Type" content="text/html; charset=euc-jp"/>
<body>
<p>Some string here in eucjp (japanese)</p>
</body>
</html>"""

If the string is encoded in UTF8, everything is fine (though the encoding is different from the one in charset)
If the string is encoded as expected (euc-jp), I get a segfault after doing

lxml.html.fromstring(data)

converting data in unicode make things work, but I remember that former versions of lxml used to detect the encoding when available.

Tags: html
Revision history for this message
Guillaume VIRY (guillaume-viry) wrote :

a little correction : that behavior appears in 2.3.1, not 2.3.0. That last one does not segfault

Revision history for this message
scoder (scoder) wrote :

Could you provide proper test data for this?

Revision history for this message
Guillaume VIRY (guillaume-viry) wrote :

actually here is a little piece of code that is working with 2.3.0 and segfaults with later versions

import urllib2, html.lxml
response = urllib2.urlopen("http://mixi.jp")
html = response.read()
doc = lxml.html.fromstring(html)

before, the fromstring method was able to detect the eucjp encoding and create the etree object without any problem
now it seems it doesn't anymore. The code above would work if html is encoded in utf8 (no matter the "encoding" specified in the html) or else with a unicode variable

Revision history for this message
Guillaume VIRY (guillaume-viry) wrote :

sorry, I guess you saw the mistake by yourself, it's of course

import lxml.html

Revision history for this message
Guillaume VIRY (guillaume-viry) wrote :

was there any problem reproducing this ?

Revision history for this message
scoder (scoder) wrote :

At least given the page you mentioned, the following works for me (at least in the latest master branch):

>>> import lxml.html as h
>>> with open("page.html") as f:
... el = h.fromstring(f.read()) # same with parse()
>>> from lxml import etree
>>> print(etree.tostring(el, method="text", encoding='utf-8'))

Revision history for this message
scoder (scoder) wrote :

Oh, and I can't test it on Windows, sorry (the above was on Linux). Where did you get your installation from?

Revision history for this message
Guillaume VIRY (guillaume-viry) wrote :

So far, I don't remember having any problem with Linux, perhaps because my console are already set up for UTF8, which is not the case on my Windows installs (japanese and french, so cp932 and latin-1 encodings).

As for where I got my lxml build from, it came from this site
http://www.lfd.uci.edu/~gohlke/pythonlibs/
Having a 64bit OS, it was pretty troublesome to get a 64bit compile chain based on Mingw, so I had the choice to stay on 2.3 using the version available on PyPi or to give a try to those ones.

scoder (scoder)
Changed in lxml:
status: New → Triaged
Revision history for this message
scoder (scoder) wrote :

Closing. Since this ticket hasn't been updated for 5 years, I'll just assume that it was fixed in libxml2 since then.

Changed in lxml:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.