lxml

strange title text due to the position of meta charset

Bug #1613969 reported by Shun-ichi Goto on 2016-08-17

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	lxml	Confirmed	Medium	Unassigned

Bug Description

using: python 2.7.11 on win32 (32bit), lxml 3.6.1 (deetails are on end of this message)

While parsing old type html by lxml.html.parse(),
I got strange error/string on extracting title text by root.cssselect('title').text.
After some test, I figured out that the reason is order of <title> and <meta charset=xxx>.
It is good result when meta charset is appered before title,
but I got strange text or exception when title before meta.

There's an exapmle (both files are utf-8):

---[good case: ok.html]---
<!DOCTYPE html>
<html lang="ja" class="col2r">
  <head>
    <meta charset="UTF-8" />
    <title>漢字テキスト</title>
  </head>
  <body>
    hello
  </body>
</html>
-----------------

---[bad case: ng.html]---
<!DOCTYPE html>
<html lang="ja" class="col2r">
  <head>
    <title>漢字テキスト</title>
    <meta charset="UTF-8" />
  </head>
  <body>
    hello
  </body>
</html>
----------------

This is the actual result in ipython:

----
In [1]: lxml.html.parse(file('ok.html')).getroot().cssselect('title')[0].text
Out[1]: u'\u6f22\u5b57\u30c6\u30ad\u30b9\u30c8' ## <-- good byte sequence

In [2]: lxml.html.parse(file('ng.html')).getroot().cssselect('title')[0].text
Out[2]: u'\xe6\xbc\xa2\xe5\xad\x97\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88' # <-- bad!!

In [3]: lxml.html.parse(file('ok.html')).getroot().cssselect('title')[0].text.encode('utf-8')
Out[3]: '\xe6\xbc\xa2\xe5\xad\x97\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88' # <- hmm, same to bad seq
----

As the result [2] and [3] tells that the bad case seems to be using raw utf-8 byte sequence as unicde string,
like encoding is latin-1 (raw 8bit charset).
I don't know parse reuslt of ng.htl should be same with ok.html.
BTW, with another title text, it cause different behaviours by parsing from file() object or StringIO() object.
former is like the ng case above latter cause UnicodeDecodeError on retrieving string via text property.
I don't know why.

---[version informations]---
Python : sys.version_info(major=2, minor=7, micro=11, releaselevel='final', serial=0)
lxml.etree : (3, 6, 1, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)
-----------------------------
(lxml is installed as win32 binary whl from http://www.lfd.uci.edu/~gohlke/pythonlibs/)

Revision history for this message

Shun-ichi Goto (shunichi-goto) wrote on 2016-08-17:

ok/ng html files Edit (26.0 KiB, application/zip)

Revision history for this message

Shun-ichi Goto (shunichi-goto) wrote on 2016-08-17:

Is it a described behaviour?
- http://lxml.de/parsing.html#parsing-html
- http://lxml.de/parsing.html#python-unicode-strings

How to detect and workaround for such a ng case?

Revision history for this message

Shun-ichi Goto (shunichi-goto) wrote on 2016-08-17:

I got a expected result by using lxml.html.HTMLParser on both case.
Is it a workaround?

----[test in ipython]---
In [1]: p = lxml.html.HTMLParser()

In [2]: p.feed(file('ok.html').read())

In [3]: p.close().cssselect('title')[0].text
Out[3]: u'\u6f22\u5b57\u30c6\u30ad\u30b9\u30c8'

In [4]: p = lxml.html.HTMLParser()

In [5]: p.feed(file('ng.html').read())

In [6]: p.close().cssselect('title')[0].text
Out[6]: u'\u6f22\u5b57\u30c6\u30ad\u30b9\u30c8'
----

Revision history for this message

scoder (scoder) wrote on 2017-08-13:

Yes, that's a known problem. The HTML parser in libxml2 defaults to Latin-1 initially and only switches to the given charset when it sees the <meta> tag, without looking back.

It's often better to provide known encodings explicitly, or to run a separate encoding detection and decoding step first, before passing the HTML into the parser as Unicode text.

Would be good to find a way to improve this.