lxml HTML parser mangles documents whose <meta> tags define the charset as other than UTF-8

Bug #972466 reported by Leonard Richardson
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned
beautifulsoup4 (Ubuntu)
Fix Released
Undecided
Unassigned
Nominated for Precise by Dmitry Shachnev

Bug Description

---
markup = '''<?xml version="1.0" encoding="ISO-8859-1"?>
                <!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile
1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
                <head>
                <title>test</title>
                <meta http-equiv="content-type" content="text/html;
charset=ascii" />
                <meta http-equiv="expires" content="-1" />
                <meta http-equiv="cache-control" content="no-cache" />
                <meta http-equiv="pragma" content="no-cache" />
                </head>
                <body >
                <p>Actual content.</p>
                </body>
                </html>'''

soup = BeautifulSoup(markup, "lxml")
print soup.prettify()

soup = BeautifulSoup(markup.replace("ascii", "utf8"))
print soup.prettify()
---

Output #1:

<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile
1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html>
 <head>
  <title>
   test
  </title>
  <meta content="text/html;
charset=utf-8" http-equiv="content-type"/>
  <meta content="-1" http-equiv="expires"/>
  <meta content="no-cache" http-equiv="cache-control"/>
  <meta content="no-cache" http-equiv="pragma"/>
 </head>
 <body>
  <p>
   / h e a d &gt;
                                                                   b o d y &gt;
                                                                   p &gt; A c t u a l c o n t e n t . / p &gt;
                                                                   / b o d y &gt;
                                                                   / h t m l &gt;
  </p>
 </body>
</html>

---

Output #2:

<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile
1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html>
 <head>
  <title>
   test
  </title>
  <meta content="text/html;
charset=utf-8" http-equiv="content-type"/>
  <meta content="-1" http-equiv="expires"/>
  <meta content="no-cache" http-equiv="cache-control"/>
  <meta content="no-cache" http-equiv="pragma"/>
 </head>
 <body>
  <p>
   Actual content.
  </p>
 </body>
</html>

---

This problem does not occur when parsing the document with lxml's XML parser.

I believe this is a problem with lxml's feed interface. I already know there's a bug in feed() when given Unicode data. (bug 963936), but I haven't heard back from the developers since filing that bug.

Assuming this is the same or a related bug explains why the problem only happens when the HTML document contains a <meta> tag defining the charset as something other than UTF-8. The HTML parser probably has some code for rewriting the <meta> tag, which doesn't play well with Unicode data. The XML parser doesn't trigger the bug because it leaves the <meta> tag alone.

Changing the lxml HTML tree builder to remove the 963936 workaround fixes the problem.

Revision history for this message
Leonard Richardson (leonardr) wrote :

For the record, this happened when the data was split into 512-byte chunks and fed into feed() one chunk at a time. The fix was to pass all the data in at once.

summary: - lxml parser mangles documents whose <meta> tags define the charset as
- other than UTF-8
+ lxml HTML parser mangles documents whose <meta> tags define the charset
+ as other than UTF-8
Changed in beautifulsoup:
status: New → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
Revision history for this message
Michael Pitra (mortomanos) wrote :

I can reproduce the issue, and the issue seems to be introduced with the 512-byte chunks as you mentioned (changelog from 4.0.1 to 4.0.2 in lxml parser).
With str.replace('ISO-8859-1', 'utf-8') this goes away.

Changed in beautifulsoup (Ubuntu):
status: New → Fix Released
Revision history for this message
Bernhard Reiter (ockham-razor) wrote :
affects: beautifulsoup (Ubuntu) → beautifulsoup4 (Ubuntu)
Revision history for this message
Bernhard Reiter (ockham-razor) wrote :

Okay, obviously this isn't as trivial as that patch I posted suggests.
http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/revision/323#NEWS.txt suggests that we need some more changes

Revision history for this message
Bernhard Reiter (ockham-razor) wrote :

From an email conversation with leonardr:

The relevant changes should all be in revno 305. The diff is very large (~800 lines) and I don't know how well it would apply to 4.0.2, but that's where to look for it.

Actually, the changes are extensive enough that if you applied them to 4.0.2 I wouldn't feel comfortable calling the result "4.0.2." The change involves API changes, most notable with the UnicodeDammit class. That's why this release was called 4.3.0 instead of 4.2.2. I don't know how you deal with such things, but I wanted you to know.

Revision history for this message
Bernhard Reiter (ockham-razor) wrote :

People on #ubuntu-bugs suggest this is too big for an SRU; the alternative would be a backport.
(Not sure if I'm going to tackle this then.)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.