segfault with lxml.html.fromstring
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Python : sys.version_
lxml.etree : (2, 3, 4, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
This bug started occurring from the 2.3 version, on Windows only as far as I can see (my linux code works fine both on Debian and Archlinux).
Let's consider some html raw data like the following.
data = """<html>
<meta http-equiv=
<body>
<p>Some string here in eucjp (japanese)</p>
</body>
</html>"""
If the string is encoded in UTF8, everything is fine (though the encoding is different from the one in charset)
If the string is encoded as expected (euc-jp), I get a segfault after doing
lxml.html.
converting data in unicode make things work, but I remember that former versions of lxml used to detect the encoding when available.
Changed in lxml: | |
status: | New → Triaged |
a little correction : that behavior appears in 2.3.1, not 2.3.0. That last one does not segfault