Beautiful Soup

Bug #1034883
Comment #3

Comment 3 for bug 1034883

Revision history for this message

Leonard Richardson (leonardr) wrote on 2012-08-09:

To make a long story short, this is a bug in lxml. Specifically, it's bug 963936, which prevents any substantial Unicode XML document from being parsed by lxml through the feed() interface.

Version 4.0.1 of Beautiful Soup worked around this by passing in the document in 512-character chunks. This workaround exposed bug Beautiful Soup bug 972466: HTML documents whose <meta> tags declared the encoding as other than UTF-8 became mangled, due to an unknown bug in lxml--possibly the same as bug 963936, possibly a different bug, certainly a related bug since they both involve Unicode data.

Your NZB file starts with this line:

<?xml version="1.0" encoding="iso-8859-1" ?>

Replace it with this line and the file will parse:

<?xml version="1.0" encoding="iso-8859-1" ?>

It looks like you've reproduced bug 972466 with an XML document. The 'encoding' attribute in the XML declaration triggers the same underlying lxml bug as does the <meta> tag declaration in an HTML document.

For HTML documents, the solution was to remove the 963936 workaround (in Beautiful Soup 4.0.2). lxml's HTML parser doesn't have 963936. But the XML parser does have 963936, and the bug has yet to be fixed upstream. That bug is much worse than this one, so the Beautiful Soup workaround needs to stay in place.

I recommend parsing your documents as HTML. Both the built-in parser and lxml's HTML parser find all the <segment> tags in the document.

You can also use UnicodeDammit to convert the documents to Unicode, then replace 'encoding="iso-8859-1"' with 'encoding="utf8"'. The lxml XML parser will then parse the documents correctly.