Beautiful Soup

lxml HTML parser mangles documents whose <meta> tags define the charset as other than UTF-8

Bug #972466 reported by Leonard Richardson on 2012-04-03

This bug affects 4 people

Affects		Status	Importance	Assigned to
	Beautiful Soup	Fix Released	Undecided	Unassigned
	beautifulsoup4 (Ubuntu)	Fix Released	Undecided	Unassigned
Nominated for Precise by Dmitry Shachnev

Bug Description

---
markup = '''<?xml version="1.0" encoding="ISO-8859-1"?>
                <!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile
1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
                <head>
                <title>test</title>
                <meta http-equiv="content-type" content="text/html;
charset=ascii" />
                <meta http-equiv="expires" content="-1" />
                <meta http-equiv="cache-control" content="no-cache" />
                <meta http-equiv="pragma" content="no-cache" />
                </head>
                <body >
                <p>Actual content.</p>
                </body>
                </html>'''

soup = BeautifulSoup(markup, "lxml")
print soup.prettify()

soup = BeautifulSoup(markup.replace("ascii", "utf8"))
print soup.prettify()
---

Output #1:

<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile
1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html>
<head>
  <title>
   test
  </title>
  <meta content="text/html;
charset=utf-8" http-equiv="content-type"/>
  <meta content="-1" http-equiv="expires"/>
  <meta content="no-cache" http-equiv="cache-control"/>
  <meta content="no-cache" http-equiv="pragma"/>
</head>
<body>
  <p>
   / h e a d >
                                                                   b o d y >
                                                                   p > A c t u a l c o n t e n t . / p >
                                                                   / b o d y >
                                                                   / h t m l >
  </p>
</body>
</html>

---

Output #2:

---

This problem does not occur when parsing the document with lxml's XML parser.

I believe this is a problem with lxml's feed interface. I already know there's a bug in feed() when given Unicode data. (bug 963936), but I haven't heard back from the developers since filing that bug.

Assuming this is the same or a related bug explains why the problem only happens when the HTML document contains a <meta> tag defining the charset as something other than UTF-8. The HTML parser probably has some code for rewriting the <meta> tag, which doesn't play well with Unicode data. The XML parser doesn't trigger the bug because it leaves the <meta> tag alone.

Changing the lxml HTML tree builder to remove the 963936 workaround fixes the problem.