strange title text due to the position of meta charset

Bug #1613969 reported by Shun-ichi Goto
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Medium
Unassigned

Bug Description

using: python 2.7.11 on win32 (32bit), lxml 3.6.1 (deetails are on end of this message)

While parsing old type html by lxml.html.parse(),
I got strange error/string on extracting title text by root.cssselect('title').text.
After some test, I figured out that the reason is order of <title> and <meta charset=xxx>.
It is good result when meta charset is appered before title,
but I got strange text or exception when title before meta.

There's an exapmle (both files are utf-8):

---[good case: ok.html]---
<!DOCTYPE html>
<html lang="ja" class="col2r">
  <head>
    <meta charset="UTF-8" />
    <title>漢字テキスト</title>
  </head>
  <body>
    hello
  </body>
</html>
-----------------

---[bad case: ng.html]---
<!DOCTYPE html>
<html lang="ja" class="col2r">
  <head>
    <title>漢字テキスト</title>
    <meta charset="UTF-8" />
  </head>
  <body>
    hello
  </body>
</html>
----------------

This is the actual result in ipython:

----
In [1]: lxml.html.parse(file('ok.html')).getroot().cssselect('title')[0].text
Out[1]: u'\u6f22\u5b57\u30c6\u30ad\u30b9\u30c8' ## <-- good byte sequence

In [2]: lxml.html.parse(file('ng.html')).getroot().cssselect('title')[0].text
Out[2]: u'\xe6\xbc\xa2\xe5\xad\x97\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88' # <-- bad!!

In [3]: lxml.html.parse(file('ok.html')).getroot().cssselect('title')[0].text.encode('utf-8')
Out[3]: '\xe6\xbc\xa2\xe5\xad\x97\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88' # <- hmm, same to bad seq
----

As the result [2] and [3] tells that the bad case seems to be using raw utf-8 byte sequence as unicde string,
like encoding is latin-1 (raw 8bit charset).
I don't know parse reuslt of ng.htl should be same with ok.html.
BTW, with another title text, it cause different behaviours by parsing from file() object or StringIO() object.
former is like the ng case above latter cause UnicodeDecodeError on retrieving string via text property.
I don't know why.

---[version informations]---
Python : sys.version_info(major=2, minor=7, micro=11, releaselevel='final', serial=0)
lxml.etree : (3, 6, 1, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)
-----------------------------
(lxml is installed as win32 binary whl from http://www.lfd.uci.edu/~gohlke/pythonlibs/)

Revision history for this message
Shun-ichi Goto (shunichi-goto) wrote :
Revision history for this message
Shun-ichi Goto (shunichi-goto) wrote :

Is it a described behaviour?
- http://lxml.de/parsing.html#parsing-html
- http://lxml.de/parsing.html#python-unicode-strings

How to detect and workaround for such a ng case?

Revision history for this message
Shun-ichi Goto (shunichi-goto) wrote :

I got a expected result by using lxml.html.HTMLParser on both case.
Is it a workaround?

----[test in ipython]---
In [1]: p = lxml.html.HTMLParser()

In [2]: p.feed(file('ok.html').read())

In [3]: p.close().cssselect('title')[0].text
Out[3]: u'\u6f22\u5b57\u30c6\u30ad\u30b9\u30c8'

In [4]: p = lxml.html.HTMLParser()

In [5]: p.feed(file('ng.html').read())

In [6]: p.close().cssselect('title')[0].text
Out[6]: u'\u6f22\u5b57\u30c6\u30ad\u30b9\u30c8'
----

Revision history for this message
scoder (scoder) wrote :

Yes, that's a known problem. The HTML parser in libxml2 defaults to Latin-1 initially and only switches to the given charset when it sees the <meta> tag, without looking back.

It's often better to provide known encodings explicitly, or to run a separate encoding detection and decoding step first, before passing the HTML into the parser as Unicode text.

Would be good to find a way to improve this.

Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Sergei Kholodilov (fat-crocodile) wrote :

>> The HTML parser in libxml2 defaults to Latin-1 initially and only switches to the given charset when it sees the <meta> tag, without looking back.

In my experience it doesn't switch to right charset even after <meta>.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.