lxml HTML parser mangles documents whose <meta> tags define the charset as other than UTF-8
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned | ||
beautifulsoup4 (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bug Description
---
markup = '''<?xml version="1.0" encoding=
1.0//EN" "http://
charset=ascii" />
soup = BeautifulSoup(
print soup.prettify()
soup = BeautifulSoup(
print soup.prettify()
---
Output #1:
<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile
1.0//EN" "http://
<html>
<head>
<title>
test
</title>
<meta content="text/html;
charset=utf-8" http-equiv=
<meta content="-1" http-equiv=
<meta content="no-cache" http-equiv=
<meta content="no-cache" http-equiv=
</head>
<body>
<p>
/ h e a d >
</p>
</body>
</html>
---
Output #2:
<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile
1.0//EN" "http://
<html>
<head>
<title>
test
</title>
<meta content="text/html;
charset=utf-8" http-equiv=
<meta content="-1" http-equiv=
<meta content="no-cache" http-equiv=
<meta content="no-cache" http-equiv=
</head>
<body>
<p>
Actual content.
</p>
</body>
</html>
---
This problem does not occur when parsing the document with lxml's XML parser.
I believe this is a problem with lxml's feed interface. I already know there's a bug in feed() when given Unicode data. (bug 963936), but I haven't heard back from the developers since filing that bug.
Assuming this is the same or a related bug explains why the problem only happens when the HTML document contains a <meta> tag defining the charset as something other than UTF-8. The HTML parser probably has some code for rewriting the <meta> tag, which doesn't play well with Unicode data. The XML parser doesn't trigger the bug because it leaves the <meta> tag alone.
Changing the lxml HTML tree builder to remove the 963936 workaround fixes the problem.
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
Changed in beautifulsoup (Ubuntu): | |
status: | New → Fix Released |
affects: | beautifulsoup (Ubuntu) → beautifulsoup4 (Ubuntu) |
For the record, this happened when the data was split into 512-byte chunks and fed into feed() one chunk at a time. The fix was to pass all the data in at once.