lxml HTML parser mangles documents whose <meta> tags define the charset as other than UTF-8

Reported by Leonard Richardson on 2012-04-03
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Undecided
Unassigned
beautifulsoup4 (Ubuntu)
Undecided
Unassigned
Nominated for Precise by Dmitry Shachnev

Bug Description

---
markup = '''<?xml version="1.0" encoding="ISO-8859-1"?>
                <!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile
1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
                <head>
                <title>test</title>
                <meta http-equiv="content-type" content="text/html;
charset=ascii" />
                <meta http-equiv="expires" content="-1" />
                <meta http-equiv="cache-control" content="no-cache" />
                <meta http-equiv="pragma" content="no-cache" />
                </head>
                <body >
                <p>Actual content.</p>
                </body>
                </html>'''

soup = BeautifulSoup(markup, "lxml")
print soup.prettify()

soup = BeautifulSoup(markup.replace("ascii", "utf8"))
print soup.prettify()
---

Output #1:

<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile
1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html>
 <head>
  <title>
   test
  </title>
  <meta content="text/html;
charset=utf-8" http-equiv="content-type"/>
  <meta content="-1" http-equiv="expires"/>
  <meta content="no-cache" http-equiv="cache-control"/>
  <meta content="no-cache" http-equiv="pragma"/>
 </head>
 <body>
  <p>
   / h e a d &gt;
                                                                   b o d y &gt;
                                                                   p &gt; A c t u a l c o n t e n t . / p &gt;
                                                                   / b o d y &gt;
                                                                   / h t m l &gt;
  </p>
 </body>
</html>

---

Output #2:

<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile
1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html>
 <head>
  <title>
   test
  </title>
  <meta content="text/html;
charset=utf-8" http-equiv="content-type"/>
  <meta content="-1" http-equiv="expires"/>
  <meta content="no-cache" http-equiv="cache-control"/>
  <meta content="no-cache" http-equiv="pragma"/>
 </head>
 <body>
  <p>
   Actual content.
  </p>
 </body>
</html>

---

This problem does not occur when parsing the document with lxml's XML parser.

I believe this is a problem with lxml's feed interface. I already know there's a bug in feed() when given Unicode data. (bug 963936), but I haven't heard back from the developers since filing that bug.

Assuming this is the same or a related bug explains why the problem only happens when the HTML document contains a <meta> tag defining the charset as something other than UTF-8. The HTML parser probably has some code for rewriting the <meta> tag, which doesn't play well with Unicode data. The XML parser doesn't trigger the bug because it leaves the <meta> tag alone.

Changing the lxml HTML tree builder to remove the 963936 workaround fixes the problem.

Leonard Richardson (leonardr) wrote :

For the record, this happened when the data was split into 512-byte chunks and fed into feed() one chunk at a time. The fix was to pass all the data in at once.

summary: - lxml parser mangles documents whose <meta> tags define the charset as
- other than UTF-8
+ lxml HTML parser mangles documents whose <meta> tags define the charset
+ as other than UTF-8
Changed in beautifulsoup:
status: New → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
Michael Pitra (mortomanos) wrote :

I can reproduce the issue, and the issue seems to be introduced with the 512-byte chunks as you mentioned (changelog from 4.0.1 to 4.0.2 in lxml parser).
With str.replace('ISO-8859-1', 'utf-8') this goes away.

Changed in beautifulsoup (Ubuntu):
status: New → Fix Released
Bernhard Reiter (ockham-razor) wrote :
affects: beautifulsoup (Ubuntu) → beautifulsoup4 (Ubuntu)
Bernhard Reiter (ockham-razor) wrote :

Okay, obviously this isn't as trivial as that patch I posted suggests.
http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/revision/323#NEWS.txt suggests that we need some more changes

Bernhard Reiter (ockham-razor) wrote :

From an email conversation with leonardr:

The relevant changes should all be in revno 305. The diff is very large (~800 lines) and I don't know how well it would apply to 4.0.2, but that's where to look for it.

Actually, the changes are extensive enough that if you applied them to 4.0.2 I wouldn't feel comfortable calling the result "4.0.2." The change involves API changes, most notable with the UnicodeDammit class. That's why this release was called 4.3.0 instead of 4.2.2. I don't know how you deal with such things, but I wanted you to know.

Bernhard Reiter (ockham-razor) wrote :

People on #ubuntu-bugs suggest this is too big for an SRU; the alternative would be a backport.
(Not sure if I'm going to tackle this then.)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers