strange title text due to the position of meta charset

Bug #1613969 reported by Shun-ichi Goto on 2016-08-17
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Medium
Unassigned

Bug Description

using: python 2.7.11 on win32 (32bit), lxml 3.6.1 (deetails are on end of this message)

While parsing old type html by lxml.html.parse(),
I got strange error/string on extracting title text by root.cssselect('title').text.
After some test, I figured out that the reason is order of <title> and <meta charset=xxx>.
It is good result when meta charset is appered before title,
but I got strange text or exception when title before meta.

There's an exapmle (both files are utf-8):

---[good case: ok.html]---
<!DOCTYPE html>
<html lang="ja" class="col2r">
  <head>
    <meta charset="UTF-8" />
    <title>漢字テキスト</title>
  </head>
  <body>
    hello
  </body>
</html>
-----------------

---[bad case: ng.html]---
<!DOCTYPE html>
<html lang="ja" class="col2r">
  <head>
    <title>漢字テキスト</title>
    <meta charset="UTF-8" />
  </head>
  <body>
    hello
  </body>
</html>
----------------

This is the actual result in ipython:

----
In [1]: lxml.html.parse(file('ok.html')).getroot().cssselect('title')[0].text
Out[1]: u'\u6f22\u5b57\u30c6\u30ad\u30b9\u30c8' ## <-- good byte sequence

In [2]: lxml.html.parse(file('ng.html')).getroot().cssselect('title')[0].text
Out[2]: u'\xe6\xbc\xa2\xe5\xad\x97\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88' # <-- bad!!

In [3]: lxml.html.parse(file('ok.html')).getroot().cssselect('title')[0].text.encode('utf-8')
Out[3]: '\xe6\xbc\xa2\xe5\xad\x97\xe3\x83\x86\xe3\x82\xad\xe3\x82\xb9\xe3\x83\x88' # <- hmm, same to bad seq
----

As the result [2] and [3] tells that the bad case seems to be using raw utf-8 byte sequence as unicde string,
like encoding is latin-1 (raw 8bit charset).
I don't know parse reuslt of ng.htl should be same with ok.html.
BTW, with another title text, it cause different behaviours by parsing from file() object or StringIO() object.
former is like the ng case above latter cause UnicodeDecodeError on retrieving string via text property.
I don't know why.

---[version informations]---
Python : sys.version_info(major=2, minor=7, micro=11, releaselevel='final', serial=0)
lxml.etree : (3, 6, 1, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)
-----------------------------
(lxml is installed as win32 binary whl from http://www.lfd.uci.edu/~gohlke/pythonlibs/)

Shun-ichi Goto (shunichi-goto) wrote :
Shun-ichi Goto (shunichi-goto) wrote :

Is it a described behaviour?
- http://lxml.de/parsing.html#parsing-html
- http://lxml.de/parsing.html#python-unicode-strings

How to detect and workaround for such a ng case?

Shun-ichi Goto (shunichi-goto) wrote :

I got a expected result by using lxml.html.HTMLParser on both case.
Is it a workaround?

----[test in ipython]---
In [1]: p = lxml.html.HTMLParser()

In [2]: p.feed(file('ok.html').read())

In [3]: p.close().cssselect('title')[0].text
Out[3]: u'\u6f22\u5b57\u30c6\u30ad\u30b9\u30c8'

In [4]: p = lxml.html.HTMLParser()

In [5]: p.feed(file('ng.html').read())

In [6]: p.close().cssselect('title')[0].text
Out[6]: u'\u6f22\u5b57\u30c6\u30ad\u30b9\u30c8'
----

scoder (scoder) wrote :

Yes, that's a known problem. The HTML parser in libxml2 defaults to Latin-1 initially and only switches to the given charset when it sees the <meta> tag, without looking back.

It's often better to provide known encodings explicitly, or to run a separate encoding detection and decoding step first, before passing the HTML into the parser as Unicode text.

Would be good to find a way to improve this.

Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed

>> The HTML parser in libxml2 defaults to Latin-1 initially and only switches to the given charset when it sees the <meta> tag, without looking back.

In my experience it doesn't switch to right charset even after <meta>.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers